李宏毅DLHLP.12.Speech Synthesis.1/2.Tacotron

Introduction

This course is a new course for Teacher Li Hongyi in 2020: Deep Learning for Human Language Processing (Deep Learning for Human Language Processing)
course website
B station video
Please refer to the formula input: Online Latex formula The
entire speech synthesis includes the following four contents
TTS before End -to-end: Speech synthesis before
end-to-end learning Tacotron: End-to-end TTS: End-to-end speech synthesis
Beyond Tacotron: Speech synthesis after end-to-end learning
Controllable TTS: How to control TTS
This section talks about the first two .

TTS before End-to-end

Concatenative Approach

The whole method is very simple, which is to store a lot of sound information in the data, and then take it out for splicing when you want to use it.
Insert picture description here
For example, to say: artificial intelligence, the pronunciation of these four words should be found and spliced ​​together. Of course, the splicing will be more rigid. Therefore, there has been a lot of research to reduce Concatenation cost and make the result more natural.
This method has a disadvantage that it cannot synthesize any person's voice. If Zhang San's voice is not saved, then Zhang San's voice cannot be synthesized.

Parametric Approach

The structure of speech synthesis based on HMM is relatively complicated and has not been expanded.
HMM/DNN-based Speech Synthesis System (HTS)
Insert picture description here

Deep Voice

A closer approach to end-to-end is Deep Voice, which Baidu does. Input text, output voice.

Insert picture description here
It can be seen that there are four modules
Graphemeto-phoneme guess the pronunciation according to the letter.
Duration Prediction predicts the duration of each pronunciation according to the phoneme.
Fundamental Frequency Prediction predicts the pitch, which is the frequency of vocal cord vibration, and outputs without vocal cord vibration.

Tacotron:End-to-end TTS

TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
[Wang, et al., INTERSPEECH'17]
[Shen, et al., ICASSP'18] For
comparison, before the emergence of end-to-end Tacotron, researchers also made some attempts:
Tacotron:
• Input: character
• Output: (linear) spectrogram This thing is sound waves as long as the linear transform is done

First Step Towards End-to-end Parametric TTS Synthesis
• Input: phoneme
• Output: acoustic features for STRAIGHT (vocoder)
Char2wav
• Input: character
• Output: acoustic features for SampleRNN (vocoder)
The overall architecture of Tacotron is as follows:
including: Encoder, Decoder, Attention, and Post-processing will
Insert picture description here
explain each component in detail below.

Encoder 类似 Grapheme-to-phoneme

The input of Encoder is character, which is transformed into embedding after linear transform.
Insert picture description here
Punctuation can be added here, and the model can learn sentence segmentation and tone.
Then access Pre-net, which is actually a FC with a catching explosion.
Insert picture description here
Then after the CBHG module, the Convolutional bank+Highway+GRU
Insert picture description here
module is shown in the figure below. If you don’t expand the
Insert picture description here
CBHG here, there is no more in Tacotron v2, and the whole structure becomes:
Insert picture description here

Attention is similar to Modeling Duration

• The output audio and input text much be monotonic aligned.
Below, the left side is the better result of training, and the right side is the rotten result. Through attention, the pronunciation time of each phoneme can be predicted:
Insert picture description here

Decoder :Audio Synthesis

Eat the output of the encoder to generate the sound signal. It is almost the same as the general seq2seq Decoder. It
Insert picture description here
enters the Pre-net from the initial zero vector, then passes through the RNN, adds Attention, and then enters the RNN to get multiple results, here is the normal seq2seq Where the decoder is different, the general seq2seq decoder only outputs one vector at a time. The number of vectors generated at a time is denoted as r, which is a hyperparameter.
This is because the output of speech synthesis is speech. The speech corresponding to spectrogram is very long. 1 second of sound corresponds to many vectors. If each RNN only generates one vector, then the sequence length is very long, the RNN is easy to rot, and the model Too much pressure and low efficiency. If r=3, then the sequence length becomes 1/3 of the original. But r=1 is no problem in v2.
There is a big difference in the data between the training and testing of the model. Training the model against the training data (correct answer) is easy to fail in the formal test. The catch here randomly discards part of the correct answer, so The model also had high scores. When the test was fully fired, it did well.
Since the input here is spectrogram, unlike the character model that knows when it ends, each timestep will add a query module (Binary's classifier, based on the threshold to determine whether it needs to end), to determine whether it needs to end:
Insert picture description here

Post processing

Eat the output of the RNN in the figure above and spit out another row of vectors.
Insert picture description here
The idea of ​​adding Post processing is: the output of RNN is a sequence, and the output of each current moment is generated based on the output of the previous moment as input. When the model finds that there is an error in the front, there is no way to go back and modify it. Use the Post processing module to allow the model to revise the previous output.
Therefore, there are two targets when training the model
1. I hope that the Mel-spectrogram output by RNN is as close to the correct answer as possible
. 2. I hope that the Mel-spectrogram (or linear-spectrogram) output by CBHG is closer to the correct answer. Ok,
finally the spectrogram gets the sound wave through Vocoder (expand this later, this module is generally trained separately)
Griffin-Lim in v1
Wavnet in v2
Insert picture description here

Tacotron performance

The performance index is: MOS, people listen to the sound directly, and then score (0-5 points)
Version 1 [Wang, et al., INTERSPEECH'17]
Insert picture description here
Version 2 [Shen, et al., ICASSP'18]
Insert picture description here
WaveNet is much better than Griffin-Lim (this is very intuitive WaveNet is a DL model)
Insert picture description here
WaveNet needs to be trained, but there are differences in the results of training WaveNet with the above data. If you use speech synthesis data to train WaveNet, the final effect is the best. If you use real human voices Training WaveNet is less effective.
Insert picture description here

Guess you like

Origin blog.csdn.net/oldmao_2001/article/details/109099576