- code & pretrained model
- 2022ACL
- Junyi Ao, Rui Wang Southern University of Science and Technology & Hong Kong Polytechnic
abstract
SpeechT5 projects speech and text into a shared high-dimensional space to extract common modal representations.
The encoder-decoder structure, and six modal-specific (speech/text) pre/post-nets, process text and speech separately.
Gain advantages in multiple downstream tasks, including ASR, TTS, speech translation, VC, speech identification (SID), speech enhancement (SE)
intro
The pre-training model has been successful in NLP, and there are also such successful precedents as wav2vec and HuBERT in speech tasks.
However, the problems of the existing speech pre-training models are: (1) most of them are self-supervised training through unlabeled speech data, ignoring the importance of text data, and lack the ability of modality conversion for some language tasks; (2) Most models only rely on the pretrained speech encoder, and then connect to downstream tasks. No pre-trained decoder is used for seq2seq generation.
This paper proposes the SpeechT5 model, a unified-modal pre-training framework, which makes full use of unlabeled audio and text data to complete the conversion between speech and text (various forms). Different prenets map speech/text to the same shared space, and the encoder-decoder network completes the seq2seq conversion, and then generates speech/text via a separate postnet.
For the alignment of text and speech data, (1) map speech/text to the shared vector quantization space, (2) randomly mix quantized latent representations and contextual states, which can help the quantizer to better perform cross-modal modeling .
method
Model Architecture
- Input/Output Representations:
- text:encoder/decoder input and output are both text
- speech: encoder input wav, decoder output mel, connected to HFG
- Encoder-Decoder Backbone
- transformer encoder-decodert
- Relative position embedding is added to the dot-product weights of slf-attention
- Speech Pre/Post-Net
- speech encoder prenet: The convolutional feature extractor of wav2vec 2.0, compressing the waveform
- Speech decoder prenet: 3 linear ReLU, input log mel-fbank, splicing x-vector (through a layer of linear), as input, control multi-speaker synthesis.
- speech decoder postnet: (1) linear prediction log mel-fbank Y f Y^fYf + 5*conv1d Residual prediction details supplement; (2) Predict the stop token of mel.
- Text Pre/Post-Net: shared embedding matrix
- pre-net : convert token index into embedding vector
- post-nets: input embedding, the token with the highest probability of softmax multi-category prediction,
Pre-Training
- Speech Pre-Training
- Similar to HuBERT, using the masked language model, the audio is first processed into frame-level information ZZZ , then randomly select 8% steps as the mask at the beginning of the segment, mask length = 10, and then input the masked features into the transformer-encoder to predict high-dimensional representations, using CE Loss constraints.
H=speech_erncoder_prenet(audio)
U=transformer_encoder(H)
Z =k-meas(U)
- Similar to HuBERT, using the masked language model, the audio is first processed into frame-level information ZZZ , then randomly select 8% steps as the mask at the beginning of the segment, mask length = 10, and then input the masked features into the transformer-encoder to predict high-dimensional representations, using CE Loss constraints.
- Reconstruction loss L1 loss + BCE stop tokens loss
Y = speech − decoder − postnet − output , X : input Y= speech-decoder-postnet-output, X: inputY=speech−decoder−p o s t n e t−output,X:enter
- text pretraining: Lossy text X ′ X^{'}X' As input, predict output, use a certain proportion of mask for input, and replace some words of mask with mask token. Refer to the strategies of BART and T5.
- joint pre-training: The method mentioned above only models one mode, which is difficult to apply to cross-modal conversion. This paper proposes a cross-modal quantization method. Speech and text use the same codebook to discretize features , and map to the same space.
U , CU,CU,C are the output of the encoder and the fixed codebook features.
Replacing a part of the text representation with the quantized features of the corresponding time step can help the quantizer to use cross-modal information. Diversity loss helps to learn more shared information. pk p_kpkis the outcome probability result of multi-classification.
- total loss
fine tuning
- After pre-training, perform finetune on the data set of the downstream task. At this time, select the corresponding module in a targeted manner, such as the ASR task, select speech-encoder prenet, encoder-decoder, text-decoder pre/postnet
experiment
- model arch: 12-layer transformer encoder + 6-layer transformer decoder, MHA=12. The encoder parameter settings are the same as wav2vec and HuBert.
- Dataset: speech-LibriSpeech 960h data, text 400M sentences,
- Machine: 32*V100, bs=90s speech/12k tokens text perGPU, updated once in 2 steps, a total of 500k steps