Baidu and Nvidia claim AI system that can imitate your voice
Tue 6 Sep 2016
Chinese internet giant Baidu has developed an artificial intelligence system which it claims, can successfully imitate a human voice after listening to a person speak for only thirty minutes.
At Baidu World , the company’s annual tech expo, CEO Robin Li launched the AI project entitled ‘Baidu Brain’, a tripartite initiative which it has undertaken in partnership with Nvidia. Baidu Brain is concerned with AI algorithms, computing power, and big data, and Li observed that the extensive speech synthesis capabilities of the system extend to imitation:
“Anyone just records 50 sentences as required in 30 minutes, and our speech synthesis technology could simulate the person’s voice. We could let everyone have their own voice model.”
It’s an off-hand claim about one of the oldest objects of curiosity in science (and science-fiction) – the ability to replicate an individual’s vocal patterns.
The last time voice cloning made notable headlines was 15 years ago when AT&T Labs promised to bring dead celebrities back to life with a “custom voice” product called Natural Voices. Though mooted as the first commercial venture for the pure research arm back in 2001, that technology seems to have got no further since it was acquired by speech synthesis specialists Nuance – who currently list no such functionality in their portfolio of products.
However, Natural Voices required that a subject record 30 to 40 hours of speech and used phoneme samples to reconstruct new sentences drawn from a database derived from the recordings.
The speech synthesis engine from Israeli startup VivoText uses a technology called Music Objects Recognition (MOR), developed by founder Gershon Silbert, a former concert pianist. In this video Silbert compares AT&T’s product with the ‘expressive’ VivoText interpretation of the same text – and later demonstrates the facility to make the system assign radically different emotional nuances to the same text:
American radio journalist Neal Conan has worked with VivoText to have his own voice cloned, and commented in 2013 “One truly intriguing part of this project, if this technology works as well as we expect it to, is that essentially my voice could live forever.”
However, as one commenter notes, VivoText have no product on the market yet.
Other companies have attempted digital approaches to voice mimicry in recent years, chiefly in the service of Voice Banking, which aims to preserve an individual’s authentic voice template – to enable patients with progressively debilitating diseases to retain familiar vocalising powers via text-to-speech in later stages of the disease.
CereVoice Me’s voice cloning service offers a cloud-based voice templating system. The examples of the voice ‘Jess’ on the CereVoice home page are not encouraging compared to VivoText’s examples in the video above, but inputting an excerpt from Lady Macbeth’s speeches into one of the Scottish female voices in the test section produced surprisingly credible results.
In 1999 The Washington Post was impressed to hear audio scientist George Papcun put words into the mouth of General Carl Steiner, former Commander-in-chief of U.S. Special Operations Command. Papcun was able to derive an authentic voice template of Steiner from a mere ten-minute recording of the general’s voice, making him say “Gentlemen! We have called you together to inform you that we are going to overthrow the United States government.”
Steiner was reported to be so impressed with the performance that he asked for a copy of the tape. Whilst experimenting the technique, Papcun’s team also ‘made’ former Chairman of the Joint Chiefs of Staff Colin Powell say “I am being treated well by my captors.”
Speaking of the conspiracy theories which claim that he generated the voices of passengers on United Flight 93 on September 11th 2001, Papcun observes one realistic obstacle to casual audio impersonation of a subject:
“I cannot imagine how I might have obtained extensive samples of the voices of the passengers on Flight 93, especially not knowing which of them might call home. Additionally, in this situation, it would be necessary to know what someone would say to his or her loved ones under such circumstances. What pet names would be used? What references would be made to children and other loved ones? Do believers actually suppose that the government (or I) listens in to everyone’s pillow talk?”
A future tense
According to the paper Current Trends in Linguistics (1974), Pope Sylvester II was the first recorded individual to attempt a system to imitate the human voice, a thousand years ago; Albertus Magnus and Roger Bacon also built speaking heads in the 13th century. But Hungarian author, inventor and hoaxer Wolfgang von Kempelen broke genuine new ground in the early 19th century, inventing a speaking machine after twenty years of research. R.K. Potter’s invention of the sound spectrograph during WWII brought voice synthesis research away from physical machines and into the analogue realm, later allowing scientists to literally ‘paint’ reproducible speech.
We haven’t heard yet what Baidu’s AI system can really accomplish in terms of voice cloning, but since it’s a topic that has continued to interest the U.S. military in the last 25 years, it’s probably one worth keeping a weather eye on from the point of view of personal and civil liberty.