Open Site Navigation
  • Charlie Choi

The Road to General Intelligence in Synthetic Speech (Part 1)

2020 was a year of incredible progress in synthetic speech research. Powered by new methods in AI, cloned synthetic voices are now virtually indistinguishable from the real voice, leaping across the uncanny valley, with impeccable naturalness and fidelity.


A slew of new projects and companies were started to offer synthetic speech technology to various different industries, thanks to amazing open source repositories such as Nvidia/Tacotron2 or Tensorspeech/TensorflowTTS quickly implementing results from cutting edge research. We see companies like Wellsaid Labs making their marks in corporate L&D industry, which surprisingly is the largest segment for US voiceover market taking 24.3% of the market share. We also see companies like Replica Studios and Sonantic that are tackling the more difficult industries like game production that require much more emotional expressivity. And of course, there is 15.ai providing the ability to create voice memes using internet’s favorite cartoon and game characters. In case you are wondering, yes, they are LOVO’s competition because we do all of that as well, but none of us are real threats to each other, yet.


All of the promise shown so far is fantastic. However, in order for synthetic speech technology to change the paradigm of voice content production, we need to reach what we call at LOVO, General Intelligence for Synthetic Speech.


General Intelligence in Synthetic Speech: What is it?


We see the progress of synthetic speech in three stages: natural enunciation, prosodic control, and timbre control.


Natural Enunciation: In this stage, the primary goal of the speech synthesis system is to learn how to naturally enunciate like a human, with perfect similarity to the original speaker’s voice, with highest audio fidelity possible. Research in this field has been going on for decades with concatenative speech synthesis. In 2016, a major breakthrough occurred with Deepmind’s WaveNet, and in 2017 with Google’s Tacotron, showcasing huge improvements in naturalness and similarity. In subsequent years, multiple innovations arose to improve upon these results, focusing on scalability and stability. For instance, Nvidia’s Waveglow aims to speed up WaveNet by several orders of magnitude for real world applications. Numerous finetuning techniques came about to reduce the data requirement to train speech synthesis systems, from hours to minutes to even seconds.


By mid-2020, academia and industry together pretty much mastered this stage, which is why you now see so many companies offering similar technologies to customers.


Prosodic Control: Prosody is by definition “concerned with those elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm.” Here we see efforts to improve speech synthesis systems to provide end users prosodic controllability. We can now feed into AI models prosodic information, such as intonation and rhythm, along with text and audio data when training. At LOVO, we can even make our AI sing! Even if the original speaker did not sing ever, we can make them sing any song with prosodic control. We can also clone singer’s voices to sing songs that they never sang before. Research in this field is still on-going, and much engineering for stability is still needed to bring these enhancements to the market.


Timbre Control: Timbre is defined as anything else that uniquely defines someone’s voice, other than prosody. You can imagine a violin and a piano playing the same pitch at same rhythm, but obviously sounding very different from each other.




You may have noticed that this definition is quite open ended, and there is a good reason for this. The characteristics of human speech other than prosody is extremely difficult and subjective to define semantically. If you ask a 10 people how they define a “rusty” voice, you may just as well end up with 10 varying definitions. Conversely, if you ask 10 people to describe someone’s voice after listening to it, you may also get very different adjectives.


To a superficial degree, we have achieved timbre control by cloning someone’s voice perfectly, and linking this timbre information with speaker identity. Hence, we are now able to decouple prosodic information from content and timbre information. Some research goes on further as to mapping speaker timbre in an unsupervised fashion to categorize and interpolate to invent completely new speakers (a truly synthetic voice! What would halfway between Obama and Trump sound like?). However, none of these efforts are able to semantically and objectively define different classes of timbre that can be easily interpreted by humans.


If we can achieve full interpretable control of timbre, we can enter commands with multiple timbre parameters such as “I want a casual, confident, husky Russian voice in a friendly tone, in their 50s, to be used for audio newsletters”, and AI will build you or query that voice for you right away.


So how do we achieve this last stage of Timbre control?


Data. A LOT of data.


This is what we are currently working on at LOVO. We will dig deeper in Part 2.


See you soon!