The Road to General Intelligence in Synthetic Speech (Part 1)
2020 was a year of incredible progress in synthetic speech research. Powered by new methods in AI, cloned synthetic voices are now virtually indistinguishable from the real voice, leaping across the uncanny valley, with impeccable naturalness and fidelity. A slew of new projects and companies were started to offer synthetic speech technology to various different industries, thanks to amazing open source repositories such as Nvidia/Tacotron2 or Tensorspeech/TensorflowTTS quickly implementing results from cutting edge research. We see companies like Wellsaid Labs making their marks in corporate L&D industry, which surprisingly is the largest segment for US voiceover market taking 24.3% of the market share. We also see companies like Replica Studios and Sonantic that are tackling the more difficult industries like game production that require much more emotional expressivity. And of course, there is 15.ai providing the ability to create voice memes using internet’s favorite cartoon and game characters. In case you are wondering, yes, they are LOVO’s competition because we do all of that as well, but none of us are real threats to each other, yet. All of the promise shown so far is fantastic. However, in order for synthetic speech technology to change the paradigm of voice content production, we need to reach what we call at LOVO, General Intelligence for Synthetic Speech. General Intelligence in Synthetic Speech: What is it? We see the progress of synthetic speech in three stages: natural enunciation, prosodic control, and timbre control. Natural Enunciation: In this stage, the primary goal of the speech synthesis system is to learn how to naturally enunciate like a human, with perfect similarity to the original speaker’s voice, with highest audio fidelity possible. Research in this field has been going on for decades with concatenative speech synthesis. In 2016, a major breakthrough occurred with Deepmind’s WaveNet, and in 2017 with Google’s Tacotron, showcasing huge improvements in naturalness and similarity. In subsequent years, multiple innovations arose to improve upon these results, focusing on scalability and stability. For instance, Nvidia’s Waveglow aims to speed up WaveNet by several orders of magnitude for real world applications. Numerous finetuning techniques came about to reduce the data requirement to train speech synthesis systems, from hours to minutes to even seconds. By mid-2020, academia and industry together pretty much mastered this stage, which is why you now see so many companies offering similar technologies to customers. Prosodic Control: Prosody is by definition “concerned with those elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm.” Here we see efforts to improve speech synthesis systems to provide end users prosodic controllability. We can now feed into AI models prosodic information, such as intonation and rhythm, along with text and audio data when training. At LOVO, we can even make our AI sing! Even if the original speaker did not sing ever, we can make them sing any song with prosodic control. We can also clone singer’s voices to sing songs that they never sang before. Research in this field is still on-going, and much engineering for stability is still needed to bring these enhancements to the market. Timbre Control: Timbre is defined as anything else that uniquely defines someone’s voice, other than prosody. You can imagine a violin and a piano playing the same pitch at same rhythm, but obviously sounding very different from each other. You may have noticed that this definition is quite open ended, and there is a good reason for this. The characteristics of human speech other than prosody is extremely difficult and subjective to define semantically. If you ask a 10 people how they define a “rusty” voice, you may just as well end up with 10 varying definitions. Conversely, if you ask 10 people to describe someone’s voice after listening to it, you may also get very different adjectives. To a superficial degree, we have achieved timbre control by cloning someone’s voice perfectly, and linking this timbre information with speaker identity. Hence, we are now able to decouple prosodic information from content and timbre information. Some research goes on further as to mapping speaker timbre in an unsupervised fashion to categorize and interpolate to invent completely new speakers (a truly synthetic voice! What would halfway between Obama and Trump sound like?). However, none of these efforts are able to semantically and objectively define different classes of timbre that can be easily interpreted by humans. If we can achieve full interpretable control of timbre, we can enter commands with multiple timbre parameters such as “I want a casual, confident, husky Russian voice in a friendly tone, in their 50s, to be used for audio newsletters”, and AI will build you or query that voice for you right away. So how do we achieve this last stage of Timbre control? Data. A LOT of data. This is what we are currently working on at LOVO. We will dig deeper in Part 2. See you soon!
TTS Market Update (April 2021)
The increasing demand for automation and consumer convenience is at the forefront of the growth of the TTS (Text-to-Speech) market. According to the latest reports from Emergen Research, the global TTS market is expected to grow from USD 2.0 billion to USD 7.06 billion by 2028, at a steady CAGR of 14.7%, while the overall Speech and Voice Recognition Market is expected to reach USD 31.82 billion by 2025, at a CAGR of 17.2%. This market growth can be attributed to the leaps in innovation in neural networking and custom voice cloning in recent years. With the latest announcement of Open AI’s GPT 3 language prediction model, these advancements are bound to continue. While large enterprise accounts are the leading adopters of TTS services as of now, SMEs are expected to grow their purchasing interest in TTS substantially during the forecast period (2020 ~ 2027), primarily driven by the increasing awareness of the cost-efficiency of adopting TTS with existing CRM tools. These adoptions can come in the form of Intelligent Virtual Assistants (IVA), interactive chatbots, and branded custom voices. Post-COVID TTS While SME usage grows, the TTS market for the healthcare vertical is predicted to grow with the highest component CAGR, as personalized healthcare applications for an increasing elderly and visually impaired population require high-quality voice notifications. According to the Brouton Lab, there has been a substantive spike in demand for TTS with the rise of COVID, as the technology allows for rapid publishing of explainer videos and audio manuals, crucial to persuade active engagement from patients and increase awareness of health guidelines, especially for audiences with visual disabilities and language constraints. In fact, LOVO recently partnered with Stanford MedIC to accelerate their content production process by allowing them to easily localize education videos around COVID. Distance Education With the global transition from physical classrooms to virtual ones, the TTS industry has seen a wave of support from government education funds. For example, through the Individuals with Disabilities Act, the US Department of Education annually grants USD 10,000 ~ 20,000 to each student with a disability, in coordination with service packages including adoption of accessibility technology, and TTS-powered media services and curriculums. Several Learning Management Systems (LMS), textbook publishers, and MOOCs are also leveraging TTS by applying it to their services, providing a more audio-interactive blended learning experience for students. Emergence of APAC In Asian markets such as India, China, Australia, Japan, South Korea, Indonesia, and Singapore, TTS services have been able to penetrate the consumer sector, with wide adoption of voice-activated technology in daily life. But while AI-enabled TTS has been deployed in public airports and ATMs, the surge can also be witnessed in the private sector, especially in the form of cloud-based TTS solutions. The scalability and applicability of these cloud applications has grown interest in APAC-based SaaS enterprises, with companies integrating branded AI Voices into their products to build personalized user experiences. LOVO AI The TTS industry may be packed, but LOVO differentiates itself with its user-friendly platform and industry-leading HD synthesis technology. LOVO not only includes over 180 voices in 34 languages for content creators to choose from, it also allows users to easily integrate TTS with their APIs, and build natural-sounding clones of their voice with just 15 minutes of recording data. See how LOVO compares to other enterprise offerings:
Text-to-Speech (TTS): A Beginner’s Guide
The ability to make computers read text has been a commonplace feature of daily life for quite some time now. We have grown accustomed to hearing metallic voices in our smartphones reading monotone summaries of the day’s weather and headlines. Recently, however, artificial voice technology is going through a renaissance in output quality. Content creators now have access to not only traditional TTS (Text-to-Speech) services, but incredibly human-like voiceover software that mimics natural speech. Futurist Ray Kurzweil once predicted in 2005 that the cost-performance ratio from speech synthesizers would be exponential, allowing for creators to not only reduce significant costs in production, but also democratize access to previously industry-exclusive sound processing tools. It would not be remiss to say that we have now reached, or are close to reaching, the future that Kurzweil predicted, with machine-read audiobooks and virtual influencers. However, before we arrive at such a judgement of the present, let us take a look back on the technological advancements that enabled us to enter this new period of TTS growth. Origins & Methodology The foundation of voice cloning technology is the synthesization of speech using a TTS engine. TTS is in fact a decades old technology that dates back to the 1960s, when researchers like Noriko Umeda of the Electrotechnical Laboratory and physicist John Larry Kelly Jr. developed the first versions of computer-based speech synthesis. Curiously enough, it was Kelly’s synthesizer demonstration, which recreated the song “Daisy Bell”, that inspired Stanley Kubrick and Arthur C. Clarke to use the electronic synthesizer in the climax of the film “2001: A Space Odyssey”. In the past, before the development of neural-network powered TTS models that we use today, there were two main approaches to the cloning of voice: Concatenative TTS and Parametric TTS. Both approaches attempted to maximize naturalness and intelligibility, the most important characteristics in speech synthesis. Concatenative TTS describes the process of amassing a database of short sound samples that could range from 10 milliseconds to 1 second, which the user would then directly manipulate to merge and generate specific sequences of sound. These sequences could then be formed to create audible and intelligible verbal sentences, but due to the uniform and static nature of the sequences, Concatenative TTS lacked the phonetic variations and idiosyncrasies that makes speech sound natural and emotionally expressive. In addition, producing the datasets required for a fully-functioning Concatenative TTS was incredibly time-consuming. Parametric TTS uses statistical models to predict speech variations in the parameters that make speech. Once a voice actor is recorded reading a script, the researcher can train a generative model to learn the specific distributions of the recorded sound parameters (acoustics, frequency, magnitude spectrum, prosodic, spectogram) and linguistics of the text, and then utilize the TTS to reproduce artificial speech with similar parameters to the original voice recording (vocoding). In the end, this means that the data footprint is significantly minimized in comparison to Concatenative TTS, and the output model is far more flexible in adapting specific vocal expressions and accents. The TTS also “oversmoothes” the recording, making sound discontinuities rare, but in contrast makes the speech more flat and monotone, making it easily differentiable to natural voice. Even with their limitations, it was the development of such TTS methodologies using Linear Predictive Coding (LPC) that allowed for the manufacturing of iconic consumer speech synthesizers such as the one used by Stephen Hawking in 1999 and games like Milton. AI Powered TTS (LOVO) Today, TTS is going through a rapid new stage of innovation, and is dominated primarily by the Deep Neural Network (DNN) approach. By leveraging artificial intelligence and machine learning algorithms, the DNN method attempts to remove all human intervention from the voice cloning process, fully automizing tasks such as smoothing and parameter generation. Of course, science has not yet reached a stage of full automation, but we are getting there. Some early pioneers of the DNN approach include Google Deepmind’s Wavenet, an autoregressive parametric model built through casual convolutions, and Baidu’s Deepvoice, which uses a convolutional neural network. All of these confusingly technical descriptions don’t help us understand the developmental stage we have now arrived at: with minimal audio samples, we can now recreate human-like AI Voiceovers to accelerate content production. This is where LOVO comes in. Through our advanced HD speech synthesis technology, we not only provide 180+ voice skins in 34 languages for creators to effortlessly build artificial narration through our Studio platform, users can also create natural-sounding clones of their voice with just 15 minutes of recording data. Don’t believe us? Check out our demos:
Are You Free to Welcome LOVO to Clubhouse?
Click this link to hear the audio version of this article The Accountant. Stephen Hawking. Morgan Freeman. Clubhouse. What do these 4 have in common? Yes, you guessed it correctly: VOICE. "Heavy Sigh"... What's the Plan? The 2016 movie with Ben Affleck (playing the role of Christian Wolff) as a martial arts-trained CPA with high-functioning autism, features Alison Wright (playing the role of Justine) who goes by the moniker “The Voice” that uses Text-to-Speech to communicate with clients and Christian. Stephen Hawking brought speech synthesizer to global awareness as many people associated his robotic voice, among other traits, a core part of the late genius’ character. We've all had that one friend who prided in doing the best voice cover of Morgan Freeman, and Clubhouse… well, that’s probably the loudest bandwagon I’ve heard in a while. Hawking and The Accountant represent the fundamental use of synthetic voice where people leveraged it for basic means of communication. It was so rudimentary that "The Voice" in The Accountant reads out loud "heavy sigh". Singers saw a business value in enhancing their voices, and people on Clubhouse, and the folks mimicking Morgan Freeman, are finding entertainment values. Since The Accountant, we’ve seen synthesized voices playing many roles in the Hollywood, whether as a character (Jarvis of Iron Man, although the voice is actually played by a human actor) or providing the necessary audio behind the scenes (Saw series). This evolution of synthetic voice from a barebones single-purpose tool to a fleshed-out multi-purpose application occurred due to significant improvements in the quality of the voices as well as the supporting technology, and the society that is more and more welcoming to synthetic media in general. In a visually crowded world we’ve lived in, audio was always secondary. But platforms like Spotify, Podcast, and most recently Clubhouse, are liberating audio from its chains, and the newer generations seem to be all ears. Voice is Heavy So, what's all this noise with Clubhouse? There’s been numerous social platforms emphasizing community, so why the clamor now? It’s because voice, traditionally, has been very “heavy”. Voices pack a punch: they add depth to the text they carry, but also convey their own meanings beyond the words. A voice can send shudders down the spine (ergo the fandom for ASMR) or perk up ears (imagine a puppy hearing you calling out). It is emotional and powerful, and people grow a strong attachment to it: DECtalk speech synthesizer, more famously know as Dr. Hawking’s voice, was based on the voice of the creator of the machinery, Dennis H. Klatt. When Hawking was offered an improved synthesizer by Speech Plus in 1988 with a different voice, he had asked them to replace it with the original voice of “Perfect Paul” recorded by Klatt himself. The synthetic voice had become a part of his identity. But it’s also heavy in that it is cumbersome: you see people reading texts, watching YouTube on mute in subways, but without earphones, your consumption of audio is limited. People have phone phobias, but a much lesser number of people are afraid of reading texts or communicating over emails, and the rate of absorption of information is much slower for audible words compared to written script. I still take a couple minutes to gather my thoughts before answering a phone call, and when if it’s speaking in front of any number of people, you bet I’m prepping beforehand. Homo Sapiens Vox But when Clubhouse opened its doors, people embraced it with an unparalleled eagerness to talk and listen to each other. What made it so different from Chatroulette, WeChat groups, and other audio/video community players that have come and gone is that they made audio light in a not-so-shallow way: it was light in that people could listen on mute and go in and out silently, but meaningful in that the invite-only structure with rapid participation from high-level industry leaders led to people maintaining a sense of professionalism, each room built with a specific purpose in mind. Combine this with the loss of in-person conversations due to COVID, we have people hungry to talk, to chat, to get their voices heard. Watch out, visual world, here come the Homo Sapiens Vox – and their AI. p.s.; if you need an invite to Clubhouse, leave a comment below!
How a Massive UK Game Company is Utilizing TTS AI Voiceovers for Rapid Iteration at a Lower Cost
*For confidentiality purposes, I cannot share the names of the company or the title until it is released. But I bet if you have not played their games yet, you have at least heard them being mentioned in the annual gaming awards' nominee lists. Why is this a big news? As a former RPG-nerd (and a current fan of the FIFA, AC, and the Uncharted series), I'm personally enthralled by this newly formed partnership. We've had a music record label, a giant US movie production firm, and small indie game studios use our voices in the past, but this is the first time LOVO voices are actually going to be used in the narration and NPC voices of a well-established gaming franchise. But really, the problems the gaming industry have been facing were clear: 1) How can you get hundreds of voices to use for NPCs without renting out tens of recording studios and breaking bank?
2) How do you get the main narration in 10 different languages, and edit them multiple times through game development, production, and post-release? More specifically for this release, it was: How do we get hundreds of voices with scrips as short as "Oh!" to as long as a 5-minute tutorial in a 10 different languages without setting up studios and managers in several countries? and; how do we create the main narration of the game that we change over and over throughout the production, and even after the game has launched? So what are they doing actually? After testing out a few options (let's be honest, there are a lot of TTS companies out there, starting with Google, Amazon, and IBM, not to mention other startups) they decided to leverage LOVO's AI voices via Text to Speech API to generate, edit, and automatically upload audio files into their dev environment. This cuts the cost AND production time by more than 99%, including not only the actual VO fee but the overhead to search talent, set up recording booths, and manage finished audio + bottleneck in game development. We are super excited to see how this game turns out, and will be sure to follow its release! When the dust settles down a bit and we receive the green-light from our partner to disclose everything, I'll make sure to come back and share it with you all :) Feel free to contact us if you have any questions about our voices, our web application LOVO Studio, or our API!
The Art of Writing Great Voiceover Scripts
Guest Post by Bridgette Hernandez We often overlook the significance of voiceover in whichever medium we consume. Video games, animated movies, online ads, as well as audiobooks all rely on voiceover to relay their content to us. Recently, the voiceover industry has started to seep into the corporate sector due to COVID-19 regulations and social distancing, preventing traditional studio work. According to Adobe, 91% of brands already actively invest in voiceover creation for their customer servicing and content marketing on the global market. Additionally, 71% of them stated that voices genuinely improve the user’s experience of their online platforms, ads, or services in general. However, in order to take full advantage of VO, writing a great voiceover script is pivotal. With AI platforms such as LOVO available for voiceover and text-to-speech, writing a proper script becomes even more important. Whether you simply want to edit and plug in an existing script into LOVO or are starting from scratch, let’s discuss voiceover scriptwriting. Tips to Writing Great Voiceover Scripts 1. Outline your Script The best way to get ahead with your scriptwriting and give it a good direction is to start with an outline. A voiceover script outline can take many shapes, depending on your sensibilities. You can create hand drawn storyboards, outline your script like an essay, or describe each scene and conversation without the actual lines present. It’s also a good idea to time-limit your VO script to get a good sense of how long the final product will be. This is an important first step in voiceover scriptwriting since it will focus your work and give it a purpose. Without an outline, you can spend a lot of time editing, revising, and outright abandoning certain ideas due to them being invalid with your vision. 2. Rely on Short, Informative Sentences Every sentence you write for your voiceover script should have a meaning behind it. Avoid writing sentences which serve as fluff and don’t offer tangible information to the listener. It’s worth mentioning that there is a clear distinction between narrative and conversational VO scripts in relation to sentence length. Narrative VO scripts typically serve to enrich multimedia content meant for advertisement or YouTube tutorials with animated visuals. However, conversational voiceover scripts are mostly found in indie films, animated movies, and other forms of media where two or more individuals speak. A shorter sentence structure with more pauses will make your script easier to reproduce, especially through an AI platform, as well as easier to listen. 3. Provide Clear Direction It’s not enough to simply write dialogue or a narrative voiceover script and hand it to a voiceover artist or AI algorithm. Similar to traditional film scripts, your voiceover script should include as much direction and contextual information as possible. Start with simple annunciations of emotions, tone of voice, and state of being for every voice line. Is the character sad, happy, horrified, excited, or melancholic? What is their state of mind – are they tired, agitated, worried, or nervous? These elements are pivotal for quality voiceover line delivery, and they will make your script stand out that much more. You can go a step further and start every conversational scenario with a brief summary of the current location, weather, and time of day. These elements can also affect the voice actor and enrich the final product, giving it a sense of life and genuine artistic integrity beyond writing. 4. Don’t Be Afraid of Silence While it may seem counterintuitive at first, silence is an important part of artistic voice acting. A dramatic pause can improve your scriptwriting significantly and add certain flair to the final product. You can name characters and leave them “speechless” to emphasize their shock or thinking about what occurred previously. Likewise, a dramatic pause can be used as a collective reaction of all present characters to a joke, a tragic event, or other unordinary events. A pause will also help the listener sigh a breath of relief and “rest” for a few seconds before the voiceover continues. Don’t be afraid of taking a proverbial break, and the script will be much stronger for it. 5. Give your Listeners a Call to Action Lastly, you want your audience to react to the voiceover script in some capacity, depending on its content. Integrating a call to action into scriptwriting is simple when you write voiceover for an online ad or YouTube video. What about a film or an audiobook? This is where artistic expression comes into play. Different movies, books, and video games carry messages which can resonate with the active participant. These are referred to as themes. You can make the theme of your story revolve around “courage” and “friendship”, giving the listener an opportunity to identify with these calls to action. Never write voiceover scripts which don’t carry a long-term message or call the listener to a certain action. They are a wasted opportunity to convey a personal message to the listener and will often be forgotten as soon as they finish. Benefits of Using AI with Voice-Over Scripts Now that we have a clearer idea of how you can write a great script let’s switch gears and discuss AI’s role in voiceover. Artificial intelligence algorithms have come a long way since their inception. We have witnessed technologies such as chatbots and eCommerce personalization make life easier for people across the globe. The same can be said of the voiceover industry and AI’s place in it. Richard Ernst, Content Creator and Editor at Trust My Paper, said that: “Creators should use every tool at their disposal to make the process from inception to publishing as smooth as possible. Given the plethora of options in regards to AI’s implementation in voiceover, it is a logical next step for filmmakers and marketers alike.” If we were to break down the concrete advantages of implementing an AI voiceover platform such as LOVO in your creative process, they would include: ● Faster and more affordable voiceover content creation ● Numerous voice post-processing and editing capabilities ● Cloning possibilities with existing voiceover recordings ● Multilingual dubbing possibilities In Conclusion Whether you work on a film or a series of content pieces for online marketing, make sure that your voiceover scripts are your own. Write from your heart and make them original to stand out as much as possible. When it comes to practical implementation, leveraging AI technologies against real-world voice actors is a no-brainer, especially with COVID-19 making studio work nigh impossible. Whichever path you choose to take, writing a great voiceover script is a good first step in making sure that your ideas find their audience. Image source: Bridgette Hernandez is a Master in Anthropology who is interested in writing and is planning to publish her own book in the near future. Now she is a content editor at TrustMyPaper. Brid also works with professional writing companies such as BestEssayEducation. The texts she writes are always informative, based on qualitative research but nevertheless pleasant to read.
Localize Corporate Training & E-Learning Across Global Locations - Construction Media
Making Corporate Training, HR, and E-Learning materials is not easy.
Especially when you need them in more than one language. Whether you are a company that specializes in LMS, or simply need to create training contents for internal or client purposes, there are steps you need to follow as a global entity:
1) Write a script in your dominant language
2) Gather visuals
3) Recruit voice talent
4) Interview, sample, and pin your top choice
5) Get the recorded audio
6) Request edits
7) Mix the visuals with the audio for the final product
8) Test how it performs in your dominant language / market
9) Find voice talent for other languages under your umbrella, which requires people who speak the languages as well as those who know the VO market in respective countries
10) carry out 4~8 for each country. This is why Construction Media, a dutch company focusing on audiovisual courses for safety, tool usage, and VCA exams, chose to work with LOVO to help them create employment training courses.
All they now need to do is:
1) Write a script in their dominant language
2) Gather visuals
3) Find the voice they like from LOVO and make the audio file
4) Mix the visuals with the audio for the final product
5) Test how it performs in your dominant language / market
6) Find voices on LOVO for different languages.
They are seeing savings in cost, time, and effort unlike anything before, and are looking to expand their classes from Dutch to English and other major vernaculars worldwide, serving more clients and even upselling their existing customers to leverage the courses in their own global offices. With the LOVO TTS API, they are also able to automate the creation and distribution process even further, making it much easier to make quick edits to existing courses whenever a new law gets enacted or a product name updates.
Check out for yourself at
Adding a Voice to Your Brand, in the Literal Sense - Old Spice & Bouncer
If you live in North America, when you see the phrase "Smell like a man, man", the low, charismatic yet humorous voice behind the script rings around your mental ears along with the following whistling tune. Or vice versa, when you hear the voice of Isaiah Amir Mustafa, your brain suddenly conjures up images from the series of ads that Old Spice airs. This is the power of the voice - in literal sense - of your brand. For many decades, companies have dedicated a plethora of resources to develop eye-catching logos, repeatable catch phrases, and jingles that resound in customers' minds. Everybody recognizes the giant arch of McDonald's; remembers Nike's "Impossible is Nothing" slogan; and sings along to Huggies' "I'm a big kid now". It is only recent that the actual voice behind these have received spotlight: Aflac duck's quacking and Pillsbury Doughboy's high-tone squeaks, for instance. And while in nascent stages, it's a new way for companies to cut through the literal noise in the market and be remembered in people's thoughts. This is why Bouncer, a popular email verification platform, used LOVO's voice as the voice of their new mascot Winston. You can hear him explain what Bouncer is, and what it can do for your business here: According to Radek, the CEO of Bouncer, "Winston for past three years was developing his personality, but finally is complete with his own voice!!!" As Bouncer builds out their brand and create a story around their company and the character of Winston, his voice will play a crucial part in both being more personable and more notable in prospects' minds. Check out what Winston has to say about Bouncer, and also feel free to try out LOVO's 150+ voices available for free at + if you can tell which LOVO voice Winston is created with, shoot us an email at and we'll give a free month of LOVO Studio to the first 3 people ;)
How CPA Canada is Leveraging Human-like AI Voices to Optimize for Efficiency in E-Learning
With over 17,000 students taking its classes to become accountants, Chartered Professional Accountants of Canada (CPA Canada) is a leading educational institution in North America. With the COVID-19 pandemic sticking around for longer than anyone expected, social distancing and digital learning has become the new norm. As they have already been offering courses online, CPA Canada only had to shift more weight to that medium and create more contents. Easy, right?
Not exactly. To create these lessons, you need to go through script-writing, visual content production, audio recording, and overall editing and distribution. After you have the script and the visual parts ready, the traditional method of obtaining the audio was to: 1) Search for voiceover talent
2) Check out their recordings or ask for previews
3) Negotiate a deal 4) Give them the script
5) Wait for drafts 6) Listen and suggest edits 7) Wait some more
8) Receive the final copies
9) Combine them with video materials
10) If a number or a couple words had to be changed throughout the course of time, they would contact the original VO again or find a new VO to run through steps 1-9. That's a lot of effort, time, and $$$. A whole lot of that, actually, numbering in the tens and hundreds of thousands of dollars if you were doing it at the scale of CPA Canada. This is why they looked at a solution like LOVO Studio's AI Voiceover platform to help them save on the resources that they could otherwise allocate to increase quality and quantity of teaching materials, improve user experience, and foster an online community of educators and students.
LOVO's voices are practically indistinguishable from humans'; content creators can quickly discover from a list of 150+ voices in 33 languages the one to suit their needs, input text and get audio in minutes, and even fine-tune with various speech and intonation patterns to ensure the delivery of the message is optimal.
Learn more about LOVO Studio at: Find out more about LOVO API here:
What do APIs and Legos have in common?
They are both fun! (Ok, sorry maybe not so) They are amazing building blocks that can be put together, modified, and reused easily to create something bigger and better. What are APIs? Think of them as Legos for software, which means even if they go missing, you won't have to worry about stepping on them and gasping for breath in agony. There are 4 key advantages to leveraging APIs compared to hard-coding, or building things from scratch: 1) Standardization: APIs are universal, standardized building blocks. 2) Usability: Each API comes pack with information and documentation needed to carry out the task it's assigned to. No hassle, just plug-and-play. 3) Customizability: because of the above 2 traits, you can mix-and-match to create custom functionalities. 4) Innovation: Ease of customization leads to innovation. And we all love innovation. I'm not an engineer, so I'll try to explain it as simple as possible: Imagine you are putting together a car. It's one car, so you start from scratch, and it's doable. ==> Hard-coding Now all of a sudden, you get an order to create 1,000 cars. You don't have time or the resources to create each car individually from scratch, so you go find a supplier that has standardized models for doors, tires, and body frames. ==> Standardization Uh-oh, the management decides that you need to change the doors on all 1,000 cars. No worries, all of your parts are standardized and easily replaceable - you take out the existing doors, get a new model of doors, and simply plug them into the cars. ==> Usability Then you realize, hmm maybe we can create different models of cars by mixing-and-matching different doors, tires, and body frames you already have in stock. ==> Customization & Innovation What's it mean, then? Engineers don't need to do redundant work, saving them from fatigue and demotivation. Management saves on associated costs, and can use the resources to carry out other business functions. Business as a whole can move faster, react to changing competitive environments and even innovate ahead of the market. Ok, so what's this got to do with voiceover & text to speech industry? Currently, you need to go to a voiceover or a TTS platform, put in an order / work on the text, get the audio file, listen to it, work on it some more, then download the finished file to your computer, and upload it to whatever 3rd party environment (video makers, audio editing platforms, presentation tools, or website). But what if you could start on that 3rd party environment, do all of your text to speech work there, and wrap up your entire project without tapping into different tools and websites? Let's take Powtoon, Vyond, or Moovly for example (these are all online video makers). You have to make a marketing video, so you get started with compiling the animations, images, the background music, and the script. Instead of being forced to leave the website and look for voices on Voice123.com, Fiverr, or a text to speech app like LOVO, what if you could simply get your voice needs met there? Ahh... life would be much easier. (and those companies would probably love it, too, because they want to keep YOU on their website for as long as possible) That's why the tech giants like Google, Amazon, and IBM have created APIs for their TTS. And LOVO has joined them as well - with better voices that you can test out for free on their free text to speech web application LOVO Studio. If you are curious about their APIs, check out their docs here: Ready to get started? Sign up here: Bottom Line Some people compare this to the question of Buy vs. Build. However, I think APIs go beyond a simple comparison of financial cost - from workers' mental well-being to the business' innovative speed, there is a whole lot at stake when you pit utilizing APIs and micro-services vs. hard-coded architecture. Let's call it a matter of "Would you rather have your team run on Hamster wheels" vs. "Have your team spring forward in leaps and bounds". I like running on a hamster wheel at work - said nobody.
Democratization of Education by Voice AI Technology (2) - Corporate
Question: Who is the biggest fan of mandatory corporate classes?
1. The management
4. The hired instructor getting paid by the hour
5. The owners of the venue, F&B catering, travel services, etc. I admit, some were beneficial, but most of the times, I felt like I was sitting through an aggrandized HR stunt that on the surface provided value to the employees but mostly were a show to the management and fattened the purses of the outsourced vendors. This is not to berate the HR folks or belittle the corporate training events that actually proved to be incredibly valuable to my career or personal learning — it’s just the reality of how most of these sessions are. By the Numbers According to 2018 Training Industry Report by Training Mag, the total U.S. training expenditures amounted to a staggering $87.6 billion. (To put it into perspective, Dell’s 2019 revenue was $90.6 billion, and Facebook’s revenue was $70.7 billion). More than half ($47 billion) of the expenditure was spent on training payroll, with about $30 billion spent on travel, facilities, equipment, etc. That only leaves about 11.4% to be spent on any product or service directly related to training. If you submit a marketing proposal to your CMO or CEO where only 11.4% of your budget had anything to do with the actual content production, distribution, and customer engagement, you would be out in the streets before that proposal hits the trash can. Going more in detail, companies across the board spent on average $986 per learner. But note that this excludes the opportunity cost of the workers taking time away from their daily responsibilities, or the impossible reality of getting everyone’s schedule aligned, so the final cost soars even higher. But wait, that’s not it — because most companies utilize a ‘spray and pray’ tactic and don’t follow-up to these workshops, the bulk of the investment into training gets washed down the drain. Kind of like the “Spaghetti Test” where you throw pasta against a wall to see if they are cooked. If they stick, they are ready to serve, but if they don’t, shucks, boil ’em some more! Rich Dad vs. Poor Dad What’s worse is that not every organization has the luxury: giant corporations can take the fees and the opportunity cost, and hope that something will stick with a portion of their large pool of employees. Furthermore, getting 1 speaker to speak to 100,000 workers dwindles down the cost/employee in comparison to when you have to hire for 100 or even 1,000. Smaller firms just don’t have the capacity to provide for their teams or the buffer to take the bet, even though they are usually more in need of training and learning than their bigger counterparts. Rightfully so, online training courses are garnering more momentum across the industry, with even the prehistoric companies leveraging them for cost-saving and efficiency. According to LinkedIn Learning’s 2019 Report, the majority of Learning & Development personnels are increasing their budget for online learning, while the number for instructor-led training is on the decline. So shouldn’t that level the playing field for everyone now? The field is less uneven, but there is still a sizable gap. Big firms can shell out tens and hundreds of thousands of dollars on voiceover work to create as many new contents as they want, when they want. Heck, they could have voiceover artists on payroll. Or, they could have their own key employees create videos in a nice studio without their business coming to a screeching halt. Startups and SMBs can’t afford the sheer cost of working with the voice actors, let alone the time and effort it takes to find, audition, hire, submit work, wait, and edit requests. Having voice talent on payroll is also unthinkable, and since every person is a crucial piece of the organization, taking someone out for hours and days on end to film new training materials in a fancy studio would be financially and operationally disastrous. Ok, Then What? What if you can create human-like voices in real-time from text that you write and edit whenever you want? Say, at 1% of the cost of securing a contract with an external voice over talent or agency? Not the typical robotic voices you hear from your Siri or Alexa, but authentic, emotional, natural human voices? Imagine how efficiently you can create new training materials in sync with the new industry trend, without worrying about your budget or setting up buffers for backlogs. New employees can get a welcome message from your CEO without ever having to take the time out of the day from your CEO. Better yet, create customized sales and on-boarding videos for your clients without re-recording everything. The Verdict Human voice has the power to deliver a message that a written script just cannot. However, at the same time, it is neither scalable nor affordable (the good ones, at least) in most cases. LOVO is trying to break that barrier by making authentic human voices scalable and affordable for corporations and organizations to utilize in their content creation and business operations. Try it for yourself at
Democratization of Education by Voice AI Technology (1)
They say nothing beats face to face interaction between a pupil and a teacher, whether it be young children huddled up with their peers in a classroom, or full-grown adults battling the onset of sleep and constant buzzing of notifications on their phones while the instructor drones on in an auditorium. But what if you can’t? Better off, what does democratization of education even look like? In a 2-piece series, I will first discuss education in a traditional sense, for K-12 + universities, in this article. In the next, I’ll discuss e-learning in a corporate sense, for HR, training, and marketing. Evolution of Technology and Education My father grew up doing a 4 hour round trip from his rural home to his elementary school because there were no educators in his town. I was fortunate to have the chance to partake in online classes, take virtual exams, get help from the internet, and so forth. He had worked diligently to provide for me, to level the playing field for me to compete with the most successful people in the world. Yet you still can’t beat the connections, the massive amount of money the traditionally wealthy and powerful people pour into raising their young to follow in their path, but I have nothing to complain, only things to be grateful for. And it was the technology that really helped me get on a somewhat of an equal footing. With the advent of video, and later PC, technology started to really penetrate education in every step of its progress: 1980~1990: Video / Audio 1990~: PC / Laptop 2000~: Mobile 2015~: AR/VR 2018~: Voice Assistants 2019~: AI Let’s focus on the two most recent trends: Voice Assistants & AI. 1. Voice Assistants as Teaching Assistants Voice assistants opened doors for students with difficulties to find other avenues of learning and being tested, and for students to interact with the speaker as their new teacher, fellow student, and a library of information. It freed both the educators and students from temporal and spatial limitations, and created new platform for providers of educational materials. Aside from being potential substitutes, they compliment in-person classes with verbal readouts, reminders, taking care of simpler tasks, leading small group discussions, etc. The influx of apps and platforms like Alexa Skills and Google Assistant Platform have opened doors for flexible, custom services to cater to individual type of students and educators, and for developers and companies to truly bridge the gap in digital learning. A recent study by Glide, a fibre broadband firm in the UK, indicated that 48% of UK students would like to have voice assistants assist them during their studies. We might have to check back on the notion of “nothing beats face-to-face education” quite soon. 2. “Actual” Intelligence of Artificial Intelligence (AI) AI has been the new buzzword for the past couple years. From predicting test scores and student behaviors to full-blown character and content production, AI is carving its own territory in the education sector. Circling back to the point of requiring some sort of “presence” of an educator, combining an AI-created visual representation with audio course materials increases engagement and performance of students. But what happens when professors and teachers don’t have the time to go into a studio to record new classes? Or what if the cost of creating these courses far outweigh the benefits? Multiple studies point to the cost of creating 1 hour of online course between $10K ~ $50K $10K to $50K for 1-hour. A semester’s worth of classes would knock you back a few million dollars. That’s where LOVO Studio’s automated voice content creation comes in to play. You do not need to have the professors go into a studio for hours on end every few months to record new materials, nor do you need to finance for the studio rental or potential back up voice actors. All you have to do is create a copy of your voice in a few minutes, upload text whenever you need to create a new course, and convert the script into a usable audio file. As students and educators, we all know doing it ourselves is worth reading a hundred times, so go ahead and try it for yourself at LOVO Studio!