Paper Reading_Speech Synthesis_Spear-TTS

Paper information

number headings: auto, first-level 2, max 4, _.1.1
name_en: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
name_ch: Speak, read and prompt: a small amount of supervision to achieve high-fidelity text to Voice
paper_addr: http://arxiv.org/abs/2302.03540
date_read: 2023-04-25
date_publish: 2023-02-07
tags: ['Deep Learning','TTS']
author: Eugene Kharitonov, Google research
code: https ://google-research.github.io/seanet/speartts/examples/

1 Feedback

This is a complete TTS system, which can be regarded as an extension of AudioLM.

2 Summary

A multilingual speech synthesis system, using a large amount of unsupervised data and a small amount of supervised data training, combines two types of discrete speech representations, decoupling: generating semantic tags (reading) from text, and regenerating sound tags from semantic tags ( Speaking) two parts, using a large amount of pure audio data to train the "speaking module", reducing the "reading module"'s demand for parallel data (parallel data refers to: text-to-speech data pairs).
For speaker control, using a cueing method, only 3 seconds of audio is needed to synthesize the speech of a speaker not seen in the training set.
Experiments show that SPEAR-TTS is comparable to state-of-the-art methods in character error rate using only 15 minutes of parallel data, and subjective tests demonstrate that it is comparable to real speech in terms of naturalness and acoustic quality.

3 Discrete Speech Representations

See AudioLM for details

3.1 Semantic token

The role of semantic markup is to provide a rough, high-level condition to generate the subsequent acoustic markup. Therefore, a representation should be provided where linguistic content (from phonetics to semantics) is salient while disregarding paralinguistic information such as speaker identity and acoustic details.
To obtain such a representation, a self-supervised speech representation model based on w2v-BERT is trained . The model combines Mask language modeling and contrastive learning to obtain speech representations. After training, run k-means clustering on the mean-variance normalized output of a particular layer . Use centroid indices as discrete markers .

3.2 Acoustic token

Acoustic markers are discrete audio representations that provide high-fidelity reconstructions of acoustic details. A SoundStream neural codec is trained to reconstruct speech while compressing it into a number of discrete units . SoundStream achieves this by adding a residual quantizer in the bottleneck of the convolutional autoencoder .

4 SPEAR-TTS Overview

SPEAR-TTS extends AudioLM by taking text as the generation condition . As shown in Figure-1, it is mainly divided into two scenarios: S1 converts text into discrete semantic tags, and S2 converts semantic tags into acoustic tags, and then converts them into audio using SoundStream .
Two-step conversion is required because: semantic information is logically between text and acoustic information; and semantic transacoustics only requires unlabeled audio data training . In addition, a third scenario similar to AudioLM can be added to improve the quality of synthesized speech by predicting the acoustic signature corresponding to the fine residual vector quantization level.

5 S1: Improve supervision efficiency

By supervised learning the mapping from text to semantic tags, using the speech synthesis dataset to extract semantic tags, the S1 is transformed into a sequence-to-sequence seq2seq task, specifically using the Transformer structure.

Supervised learning requires a large amount of labeled data, which is difficult for small languages . Two improvement strategies are used in this paper:

5.1 Pre-training

Pre-train the Transformer of the Encoder-Decoder on a denoising pre-training task. The model is fed a corrupted version of the original semantic token sequence , and the goal is to produce the corresponding uncorrupted token sequence.
Typical corruption methods include random replacement, deletion, and shadowing of individual tokens or entire token ranges. Methods that independently delete individual tokens with constant probability are observed in preliminary studies to be more efficient than other alternatives.
After the model is pre-trained on the denoising task, it is fine-tuned on the S1 task. Freeze the upper layers of the encoder and the parameters of the decoder when fine-tuning.

5.2 Back translation: Backtranslation

The same text sequence can correspond to multiple types of audio, such as different voices, accents, prosody, emotional content, and recording conditions. This makes the text and audio highly asymmetric. The back-translation approach is to train a speech-to-text model using the available parallel data pair, and use it and the corpus from audio-only to generate parallel data , augmenting the model's training data.

Look at Figure-2 from the bottom left. First, use the limited data damage method (noise and then denoise) to pre-train the model P to generate semantic tokens to represent audio data; then train the back-translation module and use a small amount of parallel data to fine-tune the decoder , train model B; use the back-translation method of model B and a large amount of unlabeled data to generate a large amount of parallel data that can be used for training (top right); finally use all parallel data to fine-tune the model (bottom right) and only fine-tune the lower layers of the encoder .

6 S2: Controlling the generation process

The second scenario is to map semantic tags to acoustic tags. Here, semantic acoustic tags can be extracted from sentences in audio-only datasets, and then the Transformer model can be trained to implement the translation function of seq2seq. The second stage generates utterances with random variations in speech, tempo, and recording conditions , reproducing the feature distribution observed in the training data.
Since the training of S1 and S2 is decoupled , S2 preserves the diversity of generated speech while S1 is trained on a single-speaker dataset .

In order to control the characteristics of the speaker's voice, both cases with and without audio prompts are considered during training . As shown in Figure 3:

The red blocks here are semantic tokens, the yellow blocks are acoustic tokens, and the gray blocks are prompt separators. In the scenario of generating speech from audio cues (below), the following concatenated sequences are trained: semantic tokens from cues, semantic tokens from targets, and acoustic tokens from cues. The model generates acoustic tokens (Output) corresponding to the semantic tokens from the target, while preserving the speech and speaking conditions in the acoustic tokens from the cues.
At training time, two non-overlapping speech windows are randomly selected from each training set, from which sequences of semantic and acoustic labels are computed. Think of one of the windows as a prompt and the other as a target output .
During inference, the input is also the first three blocks, and the output is generated using the autoregressive method.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130791365
Recommended