The future of voice synthesis after Google WaveNet debut
Tue 13 Sep 2016
Dr Matthew Aylett is the Chief Scientific Officer and co-founder at CereProc, which develops voice synthesis systems primarily for the healthcare field. In the light of last week’s announcements from Baidu/NVidia and from Google, he takes a look at the challenges and motivations involved in reproducing the human voice artificially…
Our voice is very personal to us. It conveys where we are from, our childhood, our sense of self, and our emotions and intentions. It allows us to interact with others and is the cement (if not the bricks) that make up the social aspect of humanity.
So the first and clearest reason for replicating a voice is for someone who will (or has) lost their own through surgery or illness. This area of voice replacement is very important – losing the power of speech is traumatic, and being able to at least retain your vocal identity can offer some relief to this trauma. By taking audio recordings, CereProc can recreate the patient’s voice and allow them to use it themselves to communicate with others.
Personal voice cloning is a growing area, and even if the cloned voice sounds less natural, our research indicates that users prefer a simulation of their voice to an off-the-shelf voice, and will use (if not listen) to a personalised voice, because it conveys who they are.
Replacing lost voices
Currently the main market for cloned voices is clinic/medical, and voice banking (recording audio of your own voice so it can be used to build a duplicate at a later date) is gathering momentum. It still takes us three hours of recordings to build a voice that sounds like the subject – but this does not sound as natural as our commercial synthesis voices.
Custom voices require a lot more recording time, and can be a hard process for people, especially if they have a problem with speaking to record large amounts of audio. We were able to replicate Roger Ebert’s voice from DVD commentaries, but for Steve Gleason we used recordings he did in the early stage of his illness (he has ALS), which were limited; and so the voice he uses on the Microsoft Superbowl advert that we built sounds like him, but is also a little buzzy, and not as natural as we would like.
Vocal impersonation through voice synthesis
Voice cloning is also generating interest because of its potential for mimicry. There is already a big industry of sound-alikes who can mimic famous voices for entertainment and satire, and our own George Bush site was in this vein.
As media fears occasionally reflect, there is potentially a darker side to this. It is possible to duplicate a person’s voice for malicious reasons and use it not to mimic someone, but to impersonate them. Technologies such as voice print identification could be duped, bogus telephone messages could be created and audio recordings could be manipulated. Listening to George W. Bush making two contradictory statements via voice synthesis, can you distinguish the real voice? You can probably guess by the content, but not by the voice.
Making robots and AI socially accessible
There is also an impetus to extend our humanity through technology. All good quality synthesis voices are currently copies of someone’s voice; they are used to add humanity to information systems, allow computers to enter the social domain, and to extend ourselves into cyberspace – for example a cloned voice can be used to build a virtual me that sounds like me and can converse with others.
When Google first took Siri out of the box they didn’t say ‘Nice recognition, interesting application, cool array microphones’ – they said ‘Where are our Ads?’
Much of the promise of this still lies ahead of us, and society and the law needs to catch up with our ability to copy and use a person’s voice with and without permission. At CereProc we made a decision never to sell a voice which was recorded without a speaker’s permission (including the George W. Bush voice, which we will only use for satirical purposes).
But the legal aspects of this are still indistinct; the GWB voice was created with open creative commons presidential speech audio, and technically that audio can be used for any purpose.
The rise of the virtual assistant
Speech technology, both recognition and synthesis, has improved dramatically over the last decade. In 2006 the main use of synthesis and recognition was for automatic phone systems (voted the most hated technology of the 21st century by wired Magazine in 2011). However, the focus of current interest is in the field of artificial assistants.
Speech technology is a requirement for eye/hands busy systems but also for computing within the social domain – where how you say something may be as important as what you say. Siri was the first example of a widely adopted speech-based service after the advent of automatic phone systems; Amazon Echo, Google Now and Cortana have all entered the fray.
But the scope and development of new systems and applications that require speech technology, including good quality expressive speech synthesis, are exploding, with Virtual Assistants envisioned even in non-obvious contexts: Intel announced at the CES 2016 a partnership with Oakley to produce sunglasses that offered fitness coaching using CereProc custom voices.
As an engineer it is easy to be carried away by the applications and miss the key commercial drivers behind this new enthusiasm for speech technology. Speech technology offers a direct channel to the user – an intimate and profound control of the user’s interaction with a system.
Even small incongruent facets can destroy the sense of believability when attempting high levels of realism, while a more ‘artificial’ product can be smoother and have greater integrity and usability
When Google first took Siri out of the box they didn’t say “Nice recognition, interesting application, cool array microphones” – they said “Where are our Ads?”Google’s business model almost entirely depends on selling Ads. Siri replaces the browser and returns control to Apple.
Similarly, if you ask Amazon Echo to buy a new Beyoncé track, it won’t be buying it from iTunes. Speech interfaces offer unprecedented control of the user channel, and this is why large corporates are desperate to develop or source high-quality recognition and synthesis systems – to keep control of their channels to market.
With Siri speech technology entering the commercial war zone, it will be fascinating to see who the winners and losers will be. One major loser I predict will be the automatic IVR phone system in its current form. With CereProc’s and other’s technology improving personal assistants and mobile and voice-operated web services, these applications will dominate the way humans interact with commercial systems for purchases, information and knowledge.
And if the IVR disappears, I doubt many tears will be shed.
The ‘uncanny valley’ of voice cloning
In voice synthesis, as with CGI, there are many subtle cues that distinguish what is ‘natural’. Much of the animation in a computer game or movie CGI character is also sampled, with the creators using an actor’s motion-captured movements to animate a computer generated model – and a decision must be taken as to the extent to which the final product should attempt absolute realism.
The latter is a hard proposition in both fields; even small incongruent facets can destroy the sense of believability when attempting high levels of audio/photo-realism, while a more ‘artificial’ product can be smoother and have greater integrity and usability, even if it will not be confused with real video/audio.
Parametric synthesis (where speech is generated from a statistically built voice model) rather than Unit Selection (where samples are stitched together) faces particular challenges, notably in recreating the expressive and variation in a subject’s speech.
Based typically on statistical or neural net models, it is hard for parametric synthesis to create expressive speech, whereas with unit selection, pre-recorded prompts can be included and regenerated as appropriate. Thus, although parametric speech synthesis has been a focus of R&D, it still doesn’t sound as good as unit selection unless there is limited speech (less than 5-10 hours) to build the voice from.
But this will change. At CereProc we have Unit Selection, Statistical and Neural Net systems running side-by-side. The gap is getting smaller, and new techniques are changing the way we see these approaches.
DeepMind Synthesis and the future
The recent Google DeepMind WaveNet synthesis was a startling example of how almost unlimited processing power could be directed at the problem and produce good results. In the published paper Google point out their system is currently very slow (And Google has access to a lot of compute power). But the technique is potentially ground-breaking because it tries to model all the subtleties in speech, and thus to approach (or even surpass) the well understood and commercially successful unit selection systems. How to harness such techniques and produce releasable systems for servers and small devices will be at the centre of commercial TTS development over the next few years.
Before the Google DeepMind results, I would have said that the chief challenge wasn’t computing power, but rather having the skill to use the power to produce high quality synthesis. Unlike speech recognition, where you need to be able to recognise any speaker with any content, in general with TTS the requirements are more modest.
Thus, nearly all commercial systems run in real time. The DeepMind results are new in that the model requires enormous compute power and produces quite good (although by no means perfect) results. We are likely to see some interesting developments here over the next few years, and brute force computation will play an important part.
However I would also say that understanding how speech is produced, and what the true degrees of freedom are will also play a critical role in developing the next generation of systems. Just as Microsoft Sam was still the default synthesis system on Windows XP, years after higher quality unit selection systems were available, there is a long road to producing commercial systems using these new approaches and it’s unclear even then what availability they may have. If you listen to default Android synthesis, it has a long way to go to reach Google’s own results with DeepMind.
CereProc in the voice synthesis market
CereVoiceMe is a low cost web based replacement voice service that we developed with input from MND Scotland among others, the first service of its kind at launch in 2014. Our brief was to offer a straightforward, easy to use service that people could use at home to record their voice and for CereProc to be there in the background to offer help and advice via the web etc. if required.
To make the process easier, CereProc supplies the microphone; the customer then records their voice, and once we have checked and tested the data, CereProc builds a voice replacement for them.
What makes CereProc unique in the field of voice replication (or cloning as it is often described) is our ability to produce scalable voices. Unlike competitors, our voice-building system is heavily automated, making even our full commercial-grade custom voices a fifth of the cost of our competitors; but we also have a set of different speech synthesis techniques which are commercially available, and allow us to build voices with a different amounts of source audio – from an hour, to the many hours of audio in an audio book.
From day one CereProc understood to create voices that people liked, it was important to retain the character of the person and to allow emotion content to be part of the voice recording process. CereProc voices therefore retain more of the variation and expressiveness than our competitors, which makes the voice more interesting to listen to, and so is better for reading out longer text and for social interaction.
Finally we are constantly developing and upgrading our systems. Some of the legacy TTS systems used in call centres and other support systems have not been updated for years. Perhaps this is OK for reading your bank balance out, but for more speech-intensive applications, these old voices are just bland and boring.
Dr Aylett is a recognized world authority in speech technology research and development, having worked at the International Computer Science Institute (ICSI) at Berkeley, California, before returning to Scotland in late-2005 to co-found CereProc. Prior to ICSI he was the senior development engineer at speech synthesis leader Rhetorical Systems (later acquired by ScanSoft), which he joined at its formation. At Rhetorical, Matthew was responsible for the design, implementation, and testing of the company’s core speech technology. Before Rhetorical, he spent five years at the University of Edinburgh researching speech and dialogue technologies. Dr Aylett holds a BA in Computing and Artificial Intelligence from the University of Sussex, and an MSc with Distinction and PhD in Speech and Language Technology from the University Of Edinburgh.