Papers read Tacotron2

Summary

This paper describes Tacotron2, a direct conversion from text to speech neural networks. The system is embedded in a cyclic sequence of characters into a sequence of Mel spectrum consisting of the neural network, and is the result of a modification of after WaveNet, the effect of the model is synthesized waveform spectrum in FIG. This model has achieved good results MOS4.53. To validate our design choices, we introduced a simplified system testing of critical components and to assess the impact of conditions using the Mel spectrum as WaveNet input (not the language, duration and function). We further demonstrated the use of such a compact representation of the acoustic intermediate can be greatly reduced in size WaveNet architecture.

introduction

TTS Despite decades of research, but there is still a big problem. Over time, different techniques has been deployed in this area. Synthesis of cascaded single-ended selected, previously recorded waveform pieced together before long time are relatively advanced technology. Statistical parameter speech synthesis, speech which directly produces a smooth trajectory is characterized by the vocoder synthesized, and then the voice synthesis to resolve cascade false boundary problem. However, compared with the human voice, or slightly unnatural.

WaveNet, is to generate a time domain model, the sound produced can be comparable with humans, and now a lot of TTS systems have been in use. But WaveNet input (language feature, the prediction of the number of the fundamental frequency and phoneme duration), requires a lot of expertise in the field to produce, which includes well-designed and powerful text analysis system Dictionary (pronunciation guide).

Tacotron, the sequence is a structural sequence may be generated from a series of character spectrogram, simplifies the traditional voice synthesizing process, only a single network trained according to the data in place of the language and acoustic characteristics. To encode speech spectrogram obtained proceeds, Tacotron using a Griffin-Lim phase estimation algorithm, then the short-time inverse Fourier transform. By author said, this is only a temporary use, compared with Griffin_Lim, WaveNet advantage in terms of quality.

In this article, we describe a unified, combined with previous experience in the best way, all done by the neural network: a sequence can be generated Mel spectrum to sequence Tacotron style model, with a WaveNet vocoder . Directly on the character and the corresponding waveform training, very good model, and effect real draw.

DeepVoice3 also a similar approach, however, is different from our system, he did not sound out and prove a person. Char2Wav also similar method, plus a nerve vocoder, however, they do not use the intermediate representation, and its model architecture are quite different.

2 model architecture

Our proposed system consists of two components, as shown in FIG.
Here Insert Picture Description

  • A cyclic sequence to the sequence features attentional mechanisms prediction network, which is a sequence having a sequence of characters entered can be predicted Mel spectrum of FIG.
  • WaveNet improved version of it can generate a time domain waveform diagram according to FIG Mel spectrum

2.1 Internal feature representation

In this work, we chose a low voice said: Mel spectrum, to connect the two parts. Representation domain waveform using the calculation, we can be more easily trained two components separately. This representation is also smoother than the waveform sample, but also better training using squared error loss, because it is the phase of each frame is constant.

Mel spectrum spectrogram and related linear, i.e., short-time Fourier transform. It is through the STFT frequency axis obtained by nonlinear transformation, and subjected to measurement of the human auditory system from the respective inspiration and with less frequency content summarized latitude. Then that describes the advantages of Mel spectrum of it, stressed that the details in the lower frequencies, but also kicked out of the noise, because of this, so we occupied for many years.

FIG linear spectral phase information is discarded, and the information Griffin-Lim algorithms such as discarded can be estimated, so that time domain conversion may be performed by a short-time inverse Fourier transform. Mel spectrum even more lost information presented inverse problem challenging. However, the acoustic and linguistic features WaveNet used and compared Mel spectrum is simple, low-level representation of audio signals.

2.2 spectrum predicted network

And Tacotron as Mel spectrum by using FIG 50ms frame size, frame shift 12.5ms, short-time Fourier transform of the Hanning window, taken calculated. We 5ms frame to move the experiments to match the original input frequency WaveNet the conditions, but there are a lot of pronunciation problems.

We use the channel span is 80 to 125Hz 7,6kHz mel filter STFT convert the Mel scale, and then performs logarithmic dynamic range compression. Before the logarithmic compression, the amplitude of the output of the filter is limited to the minimum 0.01, to limit the dynamic range of the logarithmic domain.

The network of attention mechanism with an encoder and a decoder. The decoder contains a character into an invisible characterized decoder which then predicted spectrum ablation. The learning characters embedded into the 512-dimensional character to represent the input character will be embedded after convolution three layers, each layer comprising convolution shape 5 512 * 1 filter, i.e., each filter spans five characters and then batch normalization and ReLU activated. As in Tacotron, the input character sequence is convolution layer was simulated long sentence context. The last layer is the convolution output is transmitted to a single bidirectional LSTM layer contains 512 units (256 in each direction), the generated coding feature.

The results of the encoder is supplied to the focus mechanism, the decoder output for each step, the complete coding sequence for the context of the vector summed fixed length. We use the position sensitive attention mechanism that extends the mechanism for additional attention to using the previous decoder time step in the accumulated weight of attention as an additional function. The focus mechanism is moved forward again always input, thus reducing the potential failure, which comprises the sequence is repeated or omitted decoder. 32 is used to calculate the position of the characteristic length of the one-dimensional convolution filter 31, the position of the input and hidden features projected onto a 128-dimensional characterization, the computed weights attention.

The decoder is a recurrent neural network autoregressive, predictable from the primary input sequence encoding a Mel spectrum of FIG. The first step is predicted spectral pre passed a small network, this network is pre fully connected layers, with each layer containing 256 ReLU units. As a pre-network bottleneck is really important for learning feature. Attention context of the vector and the output of this network pre pieces together, the one-way pass LSTM 2014 units of a two-layer stack. LSTM context of the vector output and attention spliced ​​together, then subjected to a linear projection to predict the target frame spectrum. Finally, the Mel spectrum predicted frame layer 5 by convolution processing network after network prediction residual spectrum frame. After each of the network processing * 1 512 5 convolution kernel composition, followed by normalization batch layer, the last layer of the convolution, each layer with tanh activation.

We minimize the sum before and after the network mean square error (MSE), to help convergence. We have also come to the right by the number of the output density distribution is modeled using a hybrid network likelihood loss experiments, in order to avoid assuming that changes over time is constant peony stone is more difficult to find training and will not bring better sound samples.

Parallel frame prediction spectrum, and the output of the decoder LSTM attention context of the vector are joined together, projected as a scalar to a sigmoid activation function, to predict the probability of the output sequence is completed when completed. In the inference process to stop token spontaneously terminate the program, rather than a fixed time. Specifically, at the completion of the first frame generated probability exceeds a threshold value of 0.5.

The network layer is the convolution dropout 0.5, LSTM zoneout0.1 to use regularization. For inferring output variation is introduced, the probability of loss of only 0.5 pre-applied network autoregressive encoder.

Tacotron and original compared to the simpler the model, the encoder and decoder are used in ordinary LSTM and convolution layer, CBHG GRU Tacotron stacked structure and use. We did not use the output of the reduction factor, each decoding step only a single output frame spectrum.

2.3 WaveNet vocoder

We use a modified structure WaveNet for it Mel spectrum is converted into a time domain waveform of FIG. As the original structure, cavity 30 has a convolution, three times in total, the expansion ratio k layer is equal to 2 to the power p, p is equal to k (mod10). To handle spectrogram frame 12.5ms frame jump, adjusting the stack using only two layers on the sample, rather than three.

WaveNet softmax not be used like a layer prediction discrete fragments, and then draws PixelCNN ++ Parallel WaveNet, and distribution logic components mixture using 10 Logistic distribution (MoL) 16 to generate a frequency of 24kHz depth speech samples. To calculate the distribution of the hybrid logic, the output of the stack is passed ReLU WaveNet activation function, and then connected to a linear projection layer prediction parameter for each BI. Negative real loss function using the calibration data for computing log-likelihood function is obtained.

3 results

3.1 training step

Our training process includes the first independent training feature forecasting network, and then predicted based on the output characteristics of the network, to train the improved version of the WaveNet.

To train the network training feature, we specify training on a single batch of GPU 64, a standard maximum likelihood training procedure, the incoming is not in the decoder prediction results, but the correct result, such a method is referred teacher- forcing. We use the optimizer and set parameters Adam = 0.9 beta1, beta 2 = 0.999, [epsilon] = 10 -6 and learning rate 10 -3 diminishing to 10 in 50,000 iterations are beginning exponentially descending -5 . We also applied the weight is 10 -6 of L2 regularization.

We then train a modified version of WaveNet, with a predicted results and calibration data to predict network alignment. That is, the prediction data is generated in the teacher-forcing mode: each frame is predicted based on the input sequence encoded calibration data and the previous frame corresponding to the frequency spectrum, which ensures a predicted frame for each target waveform samples completely aligned.

Then the training is 32 GPU 128 in batch training, optimizer or Adam, beta1 = 0.9, = 0.999 beta] 2, [epsilon] = 10 -8 and fixed learning of 10 -4 . It helps to optimize the use of the most recent update of the heavy weighted average model. Therefore, we used when updating network parameters exponential decay rate is a weighted average of 0.9999, anyway, is the better, in order to accelerate convergence, the enlarged waveform diagram 127.5 times, so that the initial output of the hybrid logic layer closer to the final distribution.

We train Tacotron data on North American English, including 24.6 hours of professional female voices, text phrases is standardized. For example, "16" is sixteen

3.2 Assessment

Generated voice in the reasoning stage, when there is no calibration data, it is not the same as training time, in the decoding process step prediction results directly on the amount of wear.

The next step is emmm Comparative Experiment
Here Insert Picture DescriptionHere Insert Picture DescriptionHere Insert Picture DescriptionHere Insert Picture DescriptionHere Insert Picture Descriptionsound spectrum prediction network (Tacotron2)
Tacotron2 thesis + Code Explanation

Published 163 original articles · won praise 117 · views 210 000 +

Guess you like

Origin blog.csdn.net/u010095372/article/details/104258292