DURIAN: DURATION INFORMED ATTENTION NETWORK FOR MULTIMODAL SYNTHESIS Paper understanding

-1. Description

DURIAN: DURATION INFORMED ATTENTION NETWORK FOR MULTIMODAL SYNTHESIS

DURIAN: Multimodal synthesis that tells the network duration

  • Paper, which is later than Tacotron, should be easier to align. I hope that the training will be faster.

0. Summary

In this paper, we propose a universal and robust multi-modal synthesis system that can generate high natural language and facial expressions simultaneously. The key part of the system is the Duration Informed Attention Network (DurIAN) , an autoregressive model in which the alignment between the input text and the output acoustic features has been explicitly inferred from the duration model. This is different from the existing end-to-end attention mechanism, and shows that various inevitable manual work in end-to-end speech synthesis systems, such as Tacotron, cannot be completely avoided. In addition, DurIAN can be used to generate high-quality facial expressions, and can synchronize voice and facial data with the generated voice in parallel/non-parallel. In order to improve the efficiency of speech generation, we also proposed a parallel generation strategy based on the multi-band WaveRNN model . Proposed Multi-band WaveRNN effectively reduces the total computational complexity from 9.8 GFLOPS to 3.6 GFLOPS, and can generate audio 6 times faster than real-time speed on a single CPU core. We have proved that DURIAN can produce highly natural speech, on par with the current state-of-the-art end-to-end systems, while avoiding word skip/repetition errors in these systems . Finally, a simple and effective method introduces the fine-grained control method of voice and facial expression ability .

1. Introduction

Traditional speech synthesis methods, including connection methods [1, 2] and statistical parameter systems [3, 4, 5] are all based on acoustic feature analysis and synthesis. These methods are still mainly used in industrial applications due to their advantages in robustness and efficiency. However, these methods have the following disadvantages: The naturalness of the generated speech is poor. End-to-end methods [6, 7, 8, 9, 10, 11] have recently received much attention because their comprehensive results are significantly more natural and simplify the training process. Unfortunately, the existing end-to-end systems lack robustness when generating speech, because they will produce unpredictable artifacts in which random words in the source text are repeated or skipped in the generated speech [7,11 ] esp. Synthesize text outside the time domain. For multi-modal synthesis tasks, synchronization between speech and facial expressions is another challenge faced by end-to-end systems. Voice and facial features can be paired end-to-end to generate models. This method requires a large amount of paired voice and facial expression data for training. Such paired voice facial expression data is expensive to collect, and pictures from different sources cannot be obtained in the required voice and virtual scene

  • [1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376, IEEE, 1996
  • [2] A. W. Black and P. A. Taylor, “Automatically clustering similar units for unit selection in speech synthesis.,” 1997.
  • [3] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 3, pp. 1315–1318, IEEE, 2000.
  • [4] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” speech communication, vol. 51, no. 11, pp. 1039–1064, 2009.
  • [5] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in 2013 ieee international conference on acoustics, speech and signal processing, pp. 7962–7966, IEEE, 2013
  • [6] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017
  • [7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783, IEEE, 2018
  • [8] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality tts with transformer,” arXiv preprint arXiv:1809.08895, 2018
  • [9] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” arXiv preprint arXiv:1807.07281, 2018
  • [10] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” 2017
  • [11] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017

It is somewhat similar to FastSpeech, but in fact, FastSpeech is distillation, imitating the teacher's attention effect through distillation, and its upper limit is the merger of Transformer TTS + discretization Phoneme Duration;

And DuRIAN is; combining the history of ForceAlignment as information, and then adopting attention to fine-tune, each performing its duties, the amount of information has also increased, and there is a specific amount of time corresponding to each frame, rather than simply like FastSpech with only one more Monotonically rising information

 

In this article, we propose the Duration Awareness Network (DurIAN), which is a general multimodal comprehensive framework that produces highly natural and robust speech and facial expressions1. DurIAN is a combination of traditional parameter system and the latest end-to-end system, which can achieve naturalness and robustness in speech generation at the same time. This latest end-to-end system surpasses traditional parameter systems from multiple angles, including the use of encoder To replace the manually designed language function, the automatic regression model to solve the problem of excessive smoothness of the prediction problem, the use of a neural vocoder to replace the traditional source filter vocoder and an attention mechanism for end-to-end training and optimization

 

Our observation and analysis show that the instability of speech in existing end-to-end systems is caused by the end-to-end attention mechanism. Therefore, the core idea behind DurlAN replaces the end-to-end attention mechanism with an alignment model similar to the parameter system, while retaining other advancements of the existing end-to-end system. Aligment + Attention is introduced. Duration module The existence of, it can also be easily generated by the face, without parallel corpus

 

The main contributions of this article are as follows:

  • 1. We suggest replacing the end-to-end attention mechanism in the Tacotron 2 [7] system with the model in the traditional parametric system. Our experience shows that the proposed method can produce very high natural speech equivalent to that produced by Tacotron 2, while Durian produces more robust and stable speech. Note that it is not that Attention is not used, but Alignment is done in the traditional way, and there is an Attention for adjustment.
  • 2. We use Skip Encoder to represent the phoneme sequence and the prosodic structure in the hierarchical Chinese prosody to improve the processing of the DurAIN system in generalized Chinese speech synthesis tasks outside the domain
  • 3. We propose a simple but effective method for fine-grained style control under supervised settings without fine-grained labels during training. It is an extension of traditional multi-style training (he conventional multi-style training)
  • 4. We describe a multi-band synchronized parallel WaveRNN algorithm (a multi-band synchronized parallel WaveRNN) to reduce the original WaveRNN model [14] and speed up the inference process on a single CPU

2 DurIAN

  • The length N of the hidden state output from the skip encoder is different from the length N of the input sequence because the hidden state associated with the prosodic boundary is excluded from the final output of the skip encoder
  • The state expansion here is basically to copy the hidden state sequentially according to the duration of a given phoneme sequence. During training, given the input phoneme sequence and the target acoustic feature y1:T, the duration of each phoneme is obtained by forced alignment. In the synthesis stage, we use the phoneme duration predicted by the duration model. The extended hidden state in the alignment model can be accurately paired with the target sound frame to train the decoder network to automatically predict each sound frame

 

2.1. Skip Encoder

Note: In fact, it needs to be considered that Sil corresponds to the frame of Force Alignment

The main purpose of the skip encoder is to encode the representation of the phoneme sequence and the hierarchical prosodic structure in the hidden state. Prosodic structure is an important part of improving the generalization ability of foreign text speech synthesis system in Chinese speech synthesis tasks. In order to generate input that skips the encoder, the source text is first converted into phoneme sequences. In order to encode different levels of prosodic structures, we insert special symbols representing the boundaries of different levels of prosody between the input phonemes. The above figure illustrates how to insert these special symbols that represent the boundary of prosody

  • The details of skip are omitted, look at the code at that time

2.2. Alignment Model

As expected, I won't write in detail:

  • An important task of speech synthesis is to reveal the hidden alignment between the phoneme sequence and the target feature/spectrum sequence. End-to-end systems rely on attention-based mechanisms to discover this consistency. However, existing end-to-end attention mechanisms often produce unpredictable artifacts, in which some words are skipped or repeated in the generated speech. Since the production speech synthesis system has a very low tolerance for this instability, the end-to-end speech synthesis system has not been widely deployed in practical applications. In DurIAN, we replace the attention mechanism with an alignment model [15, 16], where the alignment between the phoneme sequence and the target acoustic sequence is inferred from the phoneme duration prediction model. The duration of each phoneme is measured by the number of aligned acoustic frames. During the training process, the alignment between the acoustic frame sequence and the input phoneme sequence can be obtained by forced alignment, which is widely used in speech recognition. Then, alignment is used for hidden state expansion, it just copies the hidden state based on the phoneme duration. In the synthesis process, a separate duration model is used to predict the duration of each phoneme. This duration model is trained to minimize the mean square error between the predicted phoneme duration and the duration obtained by forced alignment, given the entire sentence. After the state is expanded, the relative position of each frame in each mobile phone is encoded as a value between 0 and 1, and appended to the encoder state. The extended encoder state is similar to the estimated attention context in the end-to-end system, except that in durian they are inferred from the predicted call duration

2.3 Decoder

I didn't make it clear, wait until the code to see. In short, the Decoder part should also have an Attention

  • Still use tangled reduce_factor?

2.4 Multimodal Synthesis

slightly

3. Fine-grained Style Control

slightly

4. Multi-band WaveRNN

slightly

5. Other

Guess you like

Origin blog.csdn.net/u013625492/article/details/114827085