DeepVoice3: Baidu multi-person speech synthesis practice

Baidu's deepvoice has launched three versions, each of which is more optimized and efficient. Haven't had time to compile and test this until recently.

  The DeepVoiceV1 voice system appeared in early 2017. It uses artificial intelligence technology to learn in depth and convert text into speech. This version can convert simple short sentences, and the sound is basically close to the human voice. If you don't listen carefully, it is almost indistinguishable from real speech. The system can learn one sound at a time and requires hours of data entry to master each sound.
  DeepVoiceV2 can learn hundreds of different voices. Data is absorbed from each speaker in less than half an hour, yet high sound quality can be achieved. The system can find commonalities between the speeches being trained on its own, without any prior guidance.

  DeepVoice3 can learn 2,500 voices in half an hour. For previous products, achieving a similar goal required at least 20 hours of training per voice.

  1. Thesis principle

  Several major features of DeepVoice3 are proposed in the paper (https://arxiv.org/pdf/1710.07654.pdf):

  (1) A fully convolutional feature-to-spectrum architecture is proposed, which enables us to perform fully parallel computations on all elements of a sequence, and uses recursive units (eg, Wang et al., 2017) to make its training faster than Similar architectures speed up dramatically.
  (2) Support large-scale data set training, the experimental data contains nearly 820 hours of recording data of 2484 speakers.
  (3) Experimental results demonstrate that the new method can generate monotonic attention behavior and avoid common error patterns in speech synthesis.
  (4) Experiments compare the quality of multiple signal synthesis methods for synthesizing a single speaker's speech, including WORLD (Morise et al., 2016), Griffin-Lim (Griffin & Lim, 1984) and WaveNet (Oord et al., 2016).

  (5) Very fast, it can complete up to 10 million inferences per day on a single GPU server.

The framework of the paper is as follows:


  The framework converts various textual features (words, phonemes, accents) into various acoustic features (mel-band spectrum, linear-scale log-amplitude spectrum, or a set of vocoder features such as fundamental frequency, amplitude-frequency packet network and aperiodic parameters). These acoustic features are then used as input to a sound waveform synthesis model. As can be seen from the above figure, the Seq2Seq method is used to achieve the TTS effect.
 Among them, the Encode stage: a fully convolutional encoder that converts text features into internally learned representations.
   Decoding stage: a fully convolutional causal decoder that decodes the learned representation (in an auto-regressive mode) into a low-dimensional sound representation (mel- band spectrum).

   Transformer: A fully convolutional post-processing network that predicts the final output features from the decoded hidden state (depending on the type of signal waveform synthesis method). Unlike decoders, transformers are acausal and thus can rely on future contextual information.

2. Practice test

  Recompile the source code, the effect is as follows:



Test the statement on the news, the comma can pass directly.

In addition, the test found that deepvoice3 is super fast, and can realize the speech imitation of multiple speakers. The multi-person speech imitation is as follows:

for speaker_id in range(N):
  print(speaker_id)
  tts(model, text, speaker_id=speaker_id, figures=False)

National rejuvenation relies on the "hard work" of the Chinese people, and the country's innovation capacity must be raised through independent efforts, President Xi Jinping said on Tuesday.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325448402&siteId=291194637