Paper Translation - Speech Synthesis: Tacotron 2

Original paper address: https://arxiv.org/abs/1712.05884

 

NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
Jonathan Shen1, Ruoming Pang1, Ron J. Weiss1, Mike Schuster1, Navdeep Jaitly1, Zongheng Yang ∗2, Zhifeng Chen1, Yu Zhang1, Yuxuan Wang1, RJ Skerry-Ryan1, Rif A. Saurous1, Yannis Agiomyrgiannakis1, and Yonghui Wu1
1Google, Inc.
2University of California, Berkeley

fjonathanasdf,rpang,[email protected]

 

Summary

This paper describes Tacotron 2, a neural network architecture for synthesizing speech directly from text. The system consists of two parts, a feature prediction network with a cyclic seq2seq structure that maps character vectors to mel spectrograms, followed by a revised version of the WaveNet model that synthesizes the mel spectrograms into time-domain waveforms. Our model achieved a mean opinion score (MOS) of 4.53, while the MOS score for professionally recorded speech was 4.58. To validate the model design, we conduct a stripping experimental study of key components of the system and evaluate the impact of using Mel-spectrum instead of linguistic, pitch, and F0 features as input to WaveNet. We further show that the WaveNet architecture can be significantly simplified using a compact acoustic intermediate representation.

Index Terms — Tacotron 2, WaveNet, text-to-speech

1 Introduction

Generating natural speech from text (Speech Synthesis, TTS) has been studied for decades [1] and remains a challenging task. The leading technologies in this field are constantly changing with the development of the times. The unit selection and spliced ​​synthesis method, a technique for stitching together small fragments of pre-recorded speech waveforms [2, 3], has represented the state of the art for many years. Statistical parameter speech synthesis method [4, 5, 6, 7] is to directly generate a smooth trajectory of speech features, and then pass it to a vocoder to synthesize speech. This method solves many boundary artifacts in the splicing synthesis method. question. However, the speech generated by the systems constructed by these methods is often ambiguous and unnatural compared to human speech.

WaveNet [8] is a generative model of time-domain waveforms that starts to generate speech fidelity comparable to human speech, and the model has been applied to some complete speech synthesis systems [9, 10, 11]. However, WaveNet's input data (linguistic features, predicted log fundamental frequency (F0), and phoneme durations) require extensive domain expertise to generate, an exhaustive text analysis system, and a robust phonetic dictionary ( pronunciation guide).

Tacotron [12] is a seq2seq architecture [13] that generates magnitude spectrograms from character sequences, which trains a single neural network with only input data to replace linguistic and acoustic feature generation modules, thus simplifying traditional speech Synthetic pipeline. To finally synthesize the magnitude spectrogram, Tacotron uses the Griffin-Lim [14] algorithm to estimate the phase and then applies an inverse short-time Fourier transform. The authors point out that the Griffin-Lim algorithm produces characteristic artifacts and synthesized speech with lower fidelity than the method used in WaveNet, so this is only a temporary method and will be replaced by a neural vocoder in the future.

In this paper, we describe a unified complete neural network speech synthesis method that combines the strengths of the previous two methods: a seq2seq Tacotron-style model [12] is used to generate mel spectrograms, followed by a A revised version of the WaveNet Vocoder [10, 15]. The system allows end-to-end training to learn speech synthesis directly using character sequences and speech waveform data, and the naturalness of the synthesized speech is close to that of real human speech.

Deep Voice 3 [11] describes a similar approach, however, unlike our system, its speech fidelity is not yet comparable to human speech. Char2Wav [16] proposed another similar approach, also using neural vocoder for end-to-end TTS learning, but it uses a different intermediate feature representation (traditional vocoder features) than ours, and their model architecture is similar to We are very different.

 

2. Model Architecture

Our proposed system consists of two parts, as shown in Fig. 1:

(1) A feature prediction network based on cyclic seq2seq that introduces an attention mechanism to predict the frame sequence of the Mel spectrum from the input character sequence;

(2) A revised version of the WaveNet network for learning to generate time-domain waveform samples based on the predicted sequence of mel-spectral frames.

Tacotron 2 System Architecture

 

 

 

 

 

 

 

 

 

2.1. Intermediate feature representation

In this study, we use a low-level acoustic representation - the Mel frequency spectrogram - to bridge the two parts of the system. Mel spectrograms are easily obtained by computing time-domain waveforms, and using such a representation gives us the possibility to train two components independently. Mel spectra are smoother than waveform samples and are easier to train with mean squared error loss (MSE) since they are phase-invariant to each frame.

Mel-frequency spectrograms are related to linear frequency spectrograms - ie, the amplitude of the short-time Fourier transform. Inspired by response testing of the human auditory system, the Mel spectrum is obtained by applying a nonlinear transformation to the frequency axis of the short-time Fourier transform, compressing the frequency range with fewer dimensions . This frequency scaling method, similar to the auditory system, emphasizes low-frequency details of speech, which are critical to speech intelligibility, while attenuating high-frequency details, which are often dominated by friction and other popping noises, so Basically there is no need to model high frequency parts in high fidelity processing. It is because of this property that Mel-scale-based feature representation has been widely used in speech recognition over the past few decades.

Linear spectrograms discard phase information (and are therefore lossy), and algorithms like Griffin-Lim [14] can estimate the discarded phase information using an inverse short-time Fourier transform The spectrogram is converted into a time domain waveform. Mel spectrograms throw away more information and thus pose a challenge for the inverse waveform synthesis task. However, compared to the linguistic and acoustic features used in WaveNet, the mel spectrogram is simpler and is a lower-level acoustic representation of the audio signal, so when a neural vocoder is constructed using a model similar to WaveNet, the mel spectrogram is Training speech synthesis on the graph should be more straightforward. We will show that high-quality audio can be generated from mel spectrograms using a modified version of the WaveNet architecture.

2.2. Spectrogram prediction network

In Tactron, a 50-ms frame length, a 12.5-ms frame shift, a Hanning window interception, and a short-time Fourier transform (STFT) were applied to derive the linear spectrum. The linear frequency of the STFT is then filtered using an 80-channel Mel filter bank with a frequency range between 125 Hz and 7.6 KHz, followed by a logarithmic function for range compression, thereby converting the STFT amplitude to a Mel scale. Before logarithmic compression, the output amplitude of the filter bank is stabilized to a minimum of 0.01 in order to limit its dynamic range in the logarithmic domain.

The spectrogram prediction network consists of an encoder and a decoder that introduces an attention mechanism. The encoder converts the character sequence into a hidden layer representation, which is then accepted by the decoder to predict the spectrogram. The input characters are encoded into 512-dimensional character vectors, and then passed through a 3-layer convolution, each layer of convolution contains 512 5 x 1 convolution kernels, that is, each convolution kernel spans 5 characters, followed by batches Batch normalization [18] and ReLU activation function. Like in Tacotron, convolutional layers model the large-span context (e.g. N-grams) of the input character sequence. The output of the last convolutional layer is passed to a bidirectional [19] LSTM [20] layer to generate encoded features, this LSTM contains 512 units (256 units in each direction).

An attention network is constructed to consume the output of the encoder. For each output of the encoder, the attention network reduces the encoded sequence to a fixed-length context vector. We use the position-sensitive attention mechanism in [21], which extends the additive attention mechanism [22], making it possible to use the accumulated attention weights from previous decoding processes as additional features, thus making the model more efficient along the input sequence. Consistency is maintained when moving forward, reducing potential failures in the decoding process such as subsequence duplication or omission. The position feature is convolved with 32 1-dimensional convolution kernels of length 31, and then the input sequence and position feature are projected to the 128-dimensional hidden layer representation, and the attention probability is calculated.

The decoder is an autoregressive recurrent neural network that predicts an output spectrogram from an encoded input sequence, one frame at a time. The spectral frames predicted in the previous step are first fed into a two-layer fully connected “pre-net” consisting of 256 hidden ReLU units per layer. We found that as an information bottleneck layer, pre-net is important for learning attention. force is necessary. The output of the pre-net and the attention context vector are concatenated together and passed to a two-layer stacked unidirectional LSTM of 1024 units. The output of the LSTM is again concatenated with the attention context vector and then projected through a linear transformation to predict the target spectral frame. Finally, the target spectral frame is subjected to a 5-layer convolutional "post-net" to predict a residual superimposed on the spectral frame before convolution to improve the entire process of spectral reconstruction. Each layer of post-net consists of 512 5 × 1 convolution kernels and a batch normalization process. Except for the last layer of convolution, the batch normalization process of each layer is followed by a tanh activation function.

We minimize the mean squared error before and after post-net to aid convergence, and we also experiment with a mixed density network [23, 24] to minimize the log-likelihood loss of the output distribution in the hope that the output spectrum can be avoided is a DC constant, but it turns out that doing so makes training more difficult and does not result in better synthetic samples.

Parallel to the prediction of the spectral frame, the output of the decoder LSTM is concatenated with the attention context vector, projected into a scalar and passed to the sigmoid activation function to predict the probability of whether the output sequence has been completed. This "stop sign" is predicted, allowing the model to dynamically decide when to end spectrum generation at inference time, rather than running it all the time for a fixed time.

The convolutional layers in the network are regularized with dropout [25] with probability 0.5, and the LSTM layers are regularized with zoneout [26] with probability 0.1. To bring some variation to the output during inference, dropout with probability 0.5 is only applied to the pre-net of the autoregressive decoder.

Compared with Tacotron, our model uses a more compact building block, instead of using the "CBHG" stack structure and GRU recurrent layers in Tacotron in the encoder and decoder, instead using ordinary LSTM and convolutional layers. We do not use a "reduction factor" in the output of the decoder, i.e. each decoding step only outputs a single spectral frame.

2.3. WaveNet Vocoder

We use a modified version of the WaveNet architecture in [8] to inversely transform the Mel spectral feature representation into time-domain waveform samples. In the original architecture of WaveNet, there are 30 expanded convolutional layers, which are divided into 3 cycles, that is to say, the expansion rate of the kth (k = 0 : : 29) layer is equal to the p power of 2, and p is equal to k (mod 10).

However, instead of using a softmax layer to predict discrete segments like WaveNet, we borrowed from PixelCNN++ [27] and a recent improved version of WaveNet [28], using a 10-component mixed logistic distribution (10-component MoL) to generate 16-bit frequencies at 24 kHz Deep speech samples. To compute the mixed logistic distribution, the stacked outputs of WaveNet are passed to the ReLU activation function, which is then connected to a linear projection layer to predict parameters (mean, log scale, mixed weights) for each hybrid. The loss function is calculated using a negative log-likelihood function that scales the real data.

The original WaveNet uses language features at 5 ms frame rate, phoneme duration, and logarithmic fundamental frequency (F0). We noticed in our experiments that the frame rate at 5 ms is too tight, resulting in significant pronunciation problems when predicting spectral data frames, so we modified the WaveNet architecture to change the frame rate by using 2 layers of upsampling in the transposed convolutional network. became 12.5 milliseconds.

 

3. Experiments & Results

3.1. Building the training

Our training process consists of first training the feature prediction network separately, and then training a modified version of WaveNet based on the output of the feature prediction network.

We use the maximum likelihood training procedure to train the feature prediction network on a single GPU (not the prediction results but the correct results are passed in the decoder side, this method is also called teacher-forcing), specifying batch size is 64, using the Adam optimizer and setting parameters β1 = 0:9; β2 = 0:999; epsilon = 10−6, learning rate initial value 10-3 and exponentially decreasing to 10-5 after 50000 iterations, using L2 regularization with weight 10-6.

Then align the prediction results output by the feature prediction network with the calibration data. We use the aligned prediction results to train the modified WaveNet, that is to say, these prediction data are generated in teacher-forcing mode, so each spectrum Frame data is aligned to exactly one waveform sample. During the training process, use the Adam optimizer and specify the parameters β1 = 0:9; β2 = 0:999; epsilon = 1e-8, the learning rate is fixed at 10-4, and the batch training with a batch size of 128 is distributed on 32 GPUs Updates are performed on and synchronized, which helps to balance the weights of the entire model with the most recent updates. So we use an exponentially weighted average with a decay rate of 0.9999 when updating the network parameters – this process is used in inference (see [29]). To speed up convergence, we scale up the target waveform with a scaling factor of 127.5, which makes the initial output of the hybrid logic layer closer to the final distribution.

We train all models on the in-house American English dataset, which contains 24.6 hours of speech data from a professional female announcer. All text in the dataset is spelled out, e.g. "16" is written as "sixteen", i.e. all models are trained on pre-normalized data.

3.2. Evaluation

When generating speech in the inference stage, there is no calibration data, so unlike the teacher-forcing method in the training stage, we directly pass in the prediction result of the previous step in the decoding process.

We randomly selected 100 texts from the test dataset as the evaluation dataset, and the speech generated with this evaluation dataset was sent to a scoring service team like Amazon Mechanical Turk for Subjective Mean Opinion Score (MOS), each evaluation data at least There are 8 people scoring, the score is from 1 to 5, and the scoring interval is 0.5. The evaluation of each model is performed independently of each other, so the outputs of two different models are not directly compared when assigning a score to a rater.

Note that although the evaluation sample instances in the evaluation dataset are not included in the training dataset, there are still some repeated patterns and the same words in the two datasets, which may result in a relatively poor result compared to the dataset generated with random words High MOS score. But doing this we can easily compare with the calibrated real data. Because all the systems involved in the comparison were trained on the same dataset, it is still meaningful to compare with each other.

Table 1 shows the comparison results of our method and other methods. To better isolate the effects of using mel spectrograms as features, we modified the WaveNet architecture using methods similar to those described above, trained WaveNet models with linguistic features, and compared the results. We also compare the original Tacotron model using linear spectrograms and Griffin-Lim to synthesize speech, with the concatenated [30] and parametric baseline systems already in production at Google. We find that the proposed system significantly outperforms all other TTS systems, and its results are comparable to calibrated real speech.

Tacotron2 MOS Score

 

 

 

 

 

We also conducted a detailed comparative evaluation of the synthesized results of our system and the calibrated real speech, and the raters were asked to give a range of -3 (the synthesized result is much worse than the calibrated real speech) to 3 (the synthesized result is much better than the calibrated real speech) between the scores. The overall mean score of −0:270 ± 0:155 indicates that raters prefer to calibrate real speech to a small, but statistically significant, level. For a detailed analysis, please refer to Figure 2. Reviews from raters indicate that occasional mispronunciation is the main reason for preferring calibrated speech.

Detailed analysis of Tacotron2 synthesized speech

 

 

 

 

 

In Annex E [11], we manually analyze the error patterns of the system in the test dataset of 100 sentences. In the speech synthesized from these sentences, there were no word repetitions, 6 mispronunciations, 1 skipped word, and 23 unnatural rhythms, such as accents placed on the wrong syllables or words, or unnatural tones. Finally our model achieves a MOS score of 4.354. These results show that the system can reliably pay attention to the entire input, but there is room for improvement in prosody modeling.

Finally, we evaluate the speech synthesized using 37 news headlines to test the system's ability to generalize on out-of-domain data. In this evaluation, our model achieved a MOS score of 4:148 ± 0:124, while WaveNet trained with linguistic features achieved 4:137 ± 0:128. A careful comparative evaluation of the speech generated by the two systems also showed that the two systems were on a par – a statistically non-significant 0:142 ± 0:338 preference for our system. The reviews of the check raters also show that the speech generated by our proposed neural system model is more natural and human-like. This result points to a challenge faced by end-to-end neural approaches: model training needs to be performed on data covering the target domain.

3.3. Peeling studies

3.3.1. Predicted features vs calibrated real data

Whereas the two components of our model are trained independently, the WaveNet component relies on the feature prediction results of the previous component for training. An alternative approach is to train WaveNet using mel spectrograms extracted from the calibrated real data , which allows training in isolation from the feature prediction network. We show the possibilities of this approach in Table 2.

Comparison of intermediate expression effects for Tacotron2 training

 

 

 

 

As expected, performance is best when training and inference use the same type of features, however when inference is performed using predicted features while training with mel spectrograms of ground truth data, the scores are worse than when the two are swapped. This may be due to the fact that the model trained on the calibration data cannot handle the inherent noise contained in the predicted features during inference. The difference between the scores before and after the swap can be reduced by data augmentation.

3.3.2. Linear Spectrogram

To compare with the effect of Mel spectrograms, we train the feature prediction network to predict linear frequency spectrograms so that the spectrogram can be inversely transformed using the Griffin-Lim algorithm.

Mel/Linear Spectrogram Comparison

 

 

 

 

As pointed out in [10], WaveNet generates much higher quality speech than Griffin-Lim algorithm. But using linear spectrograms or mel spectrograms doesn't make much difference. In this way, the mel spectrogram is better to use because it is a more compact representation. It will be interesting work in the future to study how to achieve a compromise between the number of mel frequency channels and the quality of speech (MOS).

3.3.3. Post-processing network

Since the predicted feature frames cannot be used before being decoded, we use a convolutional post-processing network after decoding to include both past and future frames to improve the prediction of future frames. However, since WaveNet already includes convolutional layers, it may be questioned whether a post-processing network is necessary with the WaveNet vocoder. To answer this question, we compared the results with and without the post-processing network, and the MOS score without post-processing was only 4:429 ± 0:071, while the MOS score with post-processing was 4:526 ± 0:066 , so empirically post-processing networks are still an important part of network design.

3.3.4. Simplifying WaveNet

The most typical feature of WaveNet is that it uses expanded convolution, and the receptive field increases exponentially with the number of convolution layers. We propose a hypothesis that since Mel spectrograms are more accurate representations of sound waves than linguistic features, and also capture long-term dependency information in frame sequences, shallower networks with smaller receptive fields may also be satisfactory. of solving the problem. We evaluate WaveNet with different receptive field sizes and number of convolutional layers to test our proposed hypothesis.

As shown in Table 4, we found that compared to the baseline model with 30 layers of convolution and a receptive field of 256 ms, the model can still generate high-quality speech when the convolutional layers are reduced to 12 layers and a receptive field of 10.5 ms. This result confirms the observation in [9] that a large receptive field is not necessary for speech quality. However, we speculate that it is the choice of the Mel spectrogram that allows for this reduction in complexity.

On the other hand, if we remove all the dilated convolutions, the receptive field is 2 orders of magnitude smaller than the baseline model, and the speech quality drops drastically, even though the stack is still as deep. This suggests that the model needs enough timeline context to generate high-quality speech.

WaveNet Architecture Simplified

 

 

 

 

4 Conclusion

This paper details Tacotron2, a system that integrates two components, a seq2seq recurrent network with attention mechanism to predict Mel spectrograms, and a WaveNet vocoder. Revised version. The speech synthesized by the system has reached the same level of Tacotron in prosody, while the speech quality has reached the level of WaveNet. The system can be directly trained through data, without relying on complex feature engineering, and achieves the current best speech quality, which is very close to the quality of human speech.

 

5. Acknowledgments

The authors thank Jan Chorowski, Samy Bengio, Aaron van den Oord, and the WaveNet and Machine Hearing teams for their helpful discussions and advice, as well as Heiga Zen and the Google TTS team for their feedback and assistance with running evaluations.

 

6. REFERENCES

[1] P. Taylor, Text-to-Speech Synthesis, Cambridge University

Press, New York, NY, USA, 1st edition, 2009.

 

[2] A. J. Hunt and A. W. Black, “Unit selection in a concatenative

speech synthesis system using a large speech database,” in

Proceedings of ICASSP, 1996, pp. 373–376.

 

[3] A. W. Black and P. Taylor, “Automatically clustering similar

units for unit selection in speech synthesis,” in Proceedings of

Eurospeech, September 1997, pp. 601–604.

 

[4] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMMbased speech synthesis,” in Proceedings of ICASSP, 2000, pp. 1315–1318.

 

[5] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric

speech synthesis,” Speech Communication, vol. 51, no. 11, pp.

1039–1064, 2009.

 

[6] H. Zen, A. Senior, and M. Schuster, “Statistical parametric

speech synthesis using deep neural networks,” in Proceedings

of ICASSP, 2013, pp. 7962–7966.

 

[7] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and

K. Oura, “Speech synthesis based on hidden Markov models,”

Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.

 

[8] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,

A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,

“WaveNet: A generative model for raw audio,” CoRR, vol.

abs/1609.03499, 2016.

 

[9] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gib- ¨

iansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and

M. Shoeybi, “Deep voice: Real-time neural text-to-speech,”

CoRR, vol. abs/1702.07825, 2017.

 

[10] S. O. Arik, G. F. Diamos, A. Gibiansky, J. Miller, K. Peng, ¨

W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker

neural text-to-speech,” CoRR, vol. abs/1705.08947, 2017.

 

[11] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, ¨

S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-

speaker neural text-to-speech,” CoRR, vol. abs/1710.07654,

2017.

 

[12] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,

N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,

Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron:

Towards end-to-end speech synthesis,” in Proceedings of Interspeech, Aug. 2017.

 

[13] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence

learning with neural networks.,” in NIPS, Z. Ghahramani,

M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,

Eds., 2014, pp. 3104–3112.

 

[14] D. W. Griffin and J. S. Lim, “Signal estimation from modified

short-time Fourier transform,” IEEE Transactions on Acoustics,

Speech and Signal Processing, pp. 236–243, 1984.

 

[15] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and

T. Toda, “Speaker-dependent WaveNet vocoder,” in Proceedings of Interspeech, 2017, pp. 1118–1122.

 

[16] J. Sotelo, S. Mehri, K. Kumar, JF Santos, K. Kastner,

A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech

synthesis,” in Proceedings of ICLR, 201

 

[17] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously

spoken sentences,” IEEE Transactions on Acoustics, Speech

and Signal Processing, vol. 28, no. 4, pp. 357 – 366, 1980.

 

[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating

deep network training by reducing internal covariate shift,” in

Proceedings of ICML, 2015, pp. 448–456.

 

[19] M. Schuster and K. Paliwal, “Bidirectional recurrent neural

networks,” IEEE Transactions on Signal Processing, vol. 45,

no. 11, pp. 2673–2681, Nov. 1997.

 

[20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.

 

[21] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems, 2015, pp.

577–585.

 

[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings

of ICLR, 2015.

 

[23] C. M. Bishop, “Mixture density networks,” Tech. Rep., 1994.

 

[24] M. Schuster, On supervised learning from sequential data with

applications for speech recognition, Ph.D. thesis, Nara Institute

of Science and Technology, 1999.

 

[25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning

Research, vol. 15, no. 1, pp. 1929–1958, 2014.

 

[26] D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, NR

Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville, et al.,

“Zoneout: Regularizing RNNs by randomly preserving hidden

activations,” in Proceedings of ICLR, 2017.

 

[27] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++: Improving the PixelCNN with discretized logistic

mixture likelihood and other modifications,” in Proceedings of

ICLR, 2017.

 

[28] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,

K. Kavukcuoglu, G. van den Driessche, E. Lockhart, LC Cobo,

F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,

E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast HighFidelity Speech Synthesis,” CoRR, vol. abs/1711.10433, Nov. 2017.

 

[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic

optimization,” in Proceedings of ICLR, 2015.

 

[30] X. Gonzalvo, S. Tazari, C.-a. Chan, M. Becker, A. Gutkin, and

H. Silen, “Recent advances in Google real-time HMM-driven

unit selection synthesizer.,” in Proceedings of Interspeech,

2016.

 

[31] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and

P. Szczepaniak, “Fast, compact, and high quality LSTM-RNN

based statistical parametric speech synthesizers for mobile devices,” in Proceedings of Interspeech, 2016

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324391874&siteId=291194637