What is Speech Synthesis? How to collect TTS data for speech synthesis?

We mentioned in the previous article that voice data collection is divided into two common types of voice data collection, one is speech recognition data (ASR) and the other is speech synthesis (TTS). In this issue, we will introduce what speech synthesis technology is, how to collect and produce speech synthesis data, and help you quickly understand the background and basic principles of speech synthesis.  

 

What is Text-to-Speech (TTS)

As the mode of human-computer interaction becomes more and more popular in our lives, speakers and sound waves are used as the main audio transmission medium, and the continuous iteration of text-to-speech technology has enriched our communication methods, and the machine's speech has become more flexible and natural. , all of which are inseparable from the advancement of speech synthesis technology. 

How to Collect Speech Synthesis Data

Background of Speech Synthesis Technology

Speech synthesis is the technology of text to speech (text to speech) , which is a computer voice formed from text. The earliest known device in history to mimic human speech was built by Wolfgang von Kempelen more than 200 years ago. The machines he built consisted of elements that could be used to mimic the various organs humans use to produce speech – the bellows of the lungs, the tubes of the vocal tract, the side branches of the nostrils, and so on. Interest in this mechanical analog of the human vocal organ continued into the twentieth century. In the second half of the 19th century, Helmholz et al. began to synthesize vowels and other initial consonants by superimposing harmonic waveforms with appropriate amplitudes. Traditional TTS is mainly implemented by combining multiple modules to form a pipeline, and the whole system can be roughly divided into frontend and backend.

Speech Synthesis (TTS) Technology Principle

We can think of TTS as a sequence-to-sequence problem, which includes 2 main stages, text analysis and speech synthesis. Text analysis is fairly similar to general natural language processing (NLP) steps (although we may not need Heave preprocessing when using deep neural networks). For example, sentence segmentation, word segmentation, parts of speech (POS). The output of the first stage is grapheme-to-phoneme (G2P), which is the input of the second stage. In speech synthesis, it generates a waveform from the output of the first stage.  

Text-to-Speech (TTS) Systems and Data Production

NLP natural language processing, which converts raw text (including punctuation, abbreviations, numbers, and symbols) into speech transcriptions. Transcripts include phonemes (parts of speech) and intonation (intonation, rhythm, rate) based on cues in the text. Digital signal processing (DSP), which converts speech representations into text via the audio output of a computer or other device. The DSP needs to create a phonetic lexicon (that is, a series of phrases that humans enter into the system trying to hit every combination of phonemes in the language). The system builds speech from this phonetic font by concatenating audio samples. It then applies algorithms to smooth the completed phrase and adjust aspects such as the volume and speed of the speech. Although the machines in the past could make sound normally, with the development of the times and the increase in the demand for human-computer interaction experience, the sound of the machine appears pale and stiff, unable to provide humans with the most vivid interactive experience. Nowadays, the modern speech synthesis system pays more attention to the personalized technical output of experience first, which is divided into: general TTS, personalized TTS and emotional TTS.

  • General TTS: It can meet the needs of commercialization. The production process includes: pre-recording personnel preparation, recording location determination, recording (data collection), post-data cleaning and data labeling to obtain a complete set of "commercial database".
  • Personalized TTS: According to the characteristics of data products, different types of voices are provided to customize the voice library.
  • Emotional TTS: prosodic parameters via XML-tagging. This preprocessing assists the TTS system in generating synthetic speech that contains emotional cues. Emotional intent recognition is one of the important technologies of emotional TTS, and it is also closely related to natural language processing . Want to be closer to the real language of human beings, let the machine be endowed with emotions instead of just a cold repeater, this is the effect that enterprises want their products to achieve. In order for such a machine to speak vividly, the database behind the emotional synthesis speech technology will also be richer and more diverse.

Two common methods of speech synthesis are splicing and parametric methods.

  • Splicing method: Extracting suitable splicing units from the pre-recorded corpus. The high quality requirements for sound are not conducive to commercial use, and the high demand for data scale leads to high commercial costs.
  • Parametric method: Parametric modeling of the corpus is divided into three modules: front-end processing, modeling and vocoder. The database requirement is small, but the sound quality will be rough.

 

Common Application Scenarios of Speech Synthesis

Finally, how can speech synthesis technology as an upstream technology be applied to downstream AI scenarios? Speech synthesis assistants, smart customer service, audiobooks, call centers, in-vehicle entertainment devices, etc. are all common application scenarios of speech synthesis technology. In order to make the user experience more real and rich, many upstream data collection companies will directly cooperate with voice actors, allowing customers to choose voices to meet the needs of their end users. Imagine when you toss and turn insomnia at night, when you open the blog and hear the voice of Hiroshi Kamiya, how would you feel?  

 

Guess you like

Origin blog.csdn.net/Appen_China/article/details/132064303