Paper Reading_Speech Synthesis_VALL-E

paper reading

number headings: auto, first-level 2, max 4, _.1.1
name_en: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
name_ch: Neural Codec Language Model to achieve zero-shot TTS
paper_addr: http://arxiv .org/abs/2301.02111
date_read: 2023-04-25
date_publish: 2023-01-05
tags: ['deep learning','speech synthesis']
author: Chengyi Wang, Microsoft
code: https://github.com/microsoft /unilm/tree/master/valle

1 Feedback

Speech synthesis model, the input is the text to be synthesized, a 3-second recording, and the output is the synthesized speech content consistent with the recording.

2 Differences from traditional TTS

The previous speech model is: phoneme -> Mel cepstrum -> audio; VALL-E is: phoneme -> discrete coding -> audio.

3 main contributions

• We propose VALL-E, a TTS framework that effectively uses contextual learning capabilities, with audio codec codes as intermediate representations to replace traditional mel spectrograms . • Construct a general TTS system in the speaker dimension
by utilizing a large amount of semi-supervised data .
• VALL-E is able to provide different outputs with the same input text, and maintain the acoustic environment of the voice prompt and the emotion of the speaker .
• Hints in zero-shot scenarios to synthesize natural speech with high speaker similarity .

4 Background

The challenges of synthesizing audio data include the large number of probabilities that need to be generated at each time step and the long sequence lengths. To address these issues, speech quantization techniques are used to compress data and improve inference speed. Vector quantization is widely used in self-supervised speech models for feature extraction , such as vq-wav2vec and HuBERT.
Recent studies have shown that encoding in self-supervised models can also reconstruct content and infer faster than WaveNet. However, the speaker identity has been discarded and the reconstruction quality is poor. The AudioLM model effectively solves the above problems. Deep learning has also made significant improvements in audio encoding, where Encodec is used as the audio encoder.

5 methods

5.1 Problem statement

where y is an audio sample and x = {x0, x1, . . . , xL} is its corresponding phoneme transcription, we use a pre-trained neural codec model to encode each audio sample into a discrete acoustic code denoted as Encodec(y) = C, where T is the downsampled utterance length. The reconstructed waveform Decodec© ≈ y^.
In the inference process, given a phoneme sequence and a 3-second enrollment recording of an unseen speaker, the acoustic code matrix with the corresponding content and speaker’s voice is first estimated by the trained language model, the model is inferred, and then a high-quality voice.

5.2 Training

Two conditional language models are designed in a hierarchical manner, one for generating voice c1 (autoregressive AR) and one for fine-tuning voice c2-8 (NAR non-autoregressive). The combination of AR model and NAR model provides a good trade-off between speech quality and inference speed.
Autoregressive uses phoneme sequences as phoneme cues for language models in order to generate speech with specific content. Codes for the other seven quantizers were generated using non-autoregressive (NAR) models. Different from AR, the NAR model allows each token to participate in all input tokens in the self-attention layer.

5.3 Reasoning

A model is considered contextually capable if it can synthesize high-quality speech for an unseen speaker without fine-tuning.
The text is first converted into phoneme sequences, and the recordings are encoded into acoustic matrices, forming phoneme cues and acoustic cues. For AR models, using a sampling-based approach conditioned on cues can significantly increase the diversity of outputs. For the NAR model, greedy decoding is used to select the token with the highest probability. Finally, a decoder is used to generate waveforms conditioned on the eight code sequences.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130791377
Recommended