SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

  • code & pretrained model
  • 2022ACL
  • Junyi Ao, Rui Wang Southern University of Science and Technology & Hong Kong Polytechnic

abstract

SpeechT5 projects speech and text into a shared high-dimensional space to extract common modal representations.
The encoder-decoder structure, and six modal-specific (speech/text) pre/post-nets, process text and speech separately.
Gain advantages in multiple downstream tasks, including ASR, TTS, speech translation, VC, speech identification (SID), speech enhancement (SE)

intro

The pre-training model has been successful in NLP, and there are also such successful precedents as wav2vec and HuBERT in speech tasks.
However, the problems of the existing speech pre-training models are: (1) most of them are self-supervised training through unlabeled speech data, ignoring the importance of text data, and lack the ability of modality conversion for some language tasks; (2) Most models only rely on the pretrained speech encoder, and then connect to downstream tasks. No pre-trained decoder is used for seq2seq generation.
insert image description here

This paper proposes the SpeechT5 model, a unified-modal pre-training framework, which makes full use of unlabeled audio and text data to complete the conversion between speech and text (various forms). Different prenets map speech/text to the same shared space, and the encoder-decoder network completes the seq2seq conversion, and then generates speech/text via a separate postnet.
For the alignment of text and speech data, (1) map speech/text to the shared vector quantization space, (2) randomly mix quantized latent representations and contextual states, which can help the quantizer to better perform cross-modal modeling .

method

insert image description here

Model Architecture

  • Input/Output Representations:
    • text:encoder/decoder input and output are both text
    • speech: encoder input wav, decoder output mel, connected to HFG
  • Encoder-Decoder Backbone
    • transformer encoder-decodert
    • Relative position embedding is added to the dot-product weights of slf-attention
  • Speech Pre/Post-Net
    • speech encoder prenet: The convolutional feature extractor of wav2vec 2.0, compressing the waveform
    • Speech decoder prenet: 3 linear ReLU, input log mel-fbank, splicing x-vector (through a layer of linear), as input, control multi-speaker synthesis.
    • speech decoder postnet: (1) linear prediction log mel-fbank Y f Y^fYf + 5*conv1d Residual prediction details supplement; (2) Predict the stop token of mel.
  • Text Pre/Post-Net: shared embedding matrix
    • pre-net : convert token index into embedding vector
    • post-nets: input embedding, the token with the highest probability of softmax multi-category prediction,

Pre-Training

  • Speech Pre-Training
    • Similar to HuBERT, using the masked language model, the audio is first processed into frame-level information ZZZ , then randomly select 8% steps as the mask at the beginning of the segment, mask length = 10, and then input the masked features into the transformer-encoder to predict high-dimensional representations, using CE Loss constraints.
      H=speech_erncoder_prenet(audio)
      U=transformer_encoder(H)
      Z =k-meas(U)

insert image description here
- Reconstruction loss L1 loss + BCE stop tokens loss
Y = speech − decoder − postnet − output , X : input Y= speech-decoder-postnet-output, X: inputY=speechdecoderp o s t n e toutput,X:enter
insert image description here

  • text pretraining: Lossy text X ′ X^{'}X' As input, predict output, use a certain proportion of mask for input, and replace some words of mask with mask token. Refer to the strategies of BART and T5.
    insert image description here
  • joint pre-training: The method mentioned above only models one mode, which is difficult to apply to cross-modal conversion. This paper proposes a cross-modal quantization method. Speech and text use the same codebook to discretize features , and map to the same space.
    U , CU,CU,C are the output of the encoder and the fixed codebook features.
    insert image description here
    Replacing a part of the text representation with the quantized features of the corresponding time step can help the quantizer to use cross-modal information. Diversity loss helps to learn more shared information. pk p_kpkis the outcome probability result of multi-classification.
    insert image description here
  • total loss
    insert image description here

fine tuning

  • After pre-training, perform finetune on the data set of the downstream task. At this time, select the corresponding module in a targeted manner, such as the ASR task, select speech-encoder prenet, encoder-decoder, text-decoder pre/postnet

experiment

  • model arch: 12-layer transformer encoder + 6-layer transformer decoder, MHA=12. The encoder parameter settings are the same as wav2vec and HuBert.
  • Dataset: speech-LibriSpeech 960h data, text 400M sentences,
  • Machine: 32*V100, bs=90s speech/12k tokens text perGPU, updated once in 2 steps, a total of 500k steps

ASR evaluation

insert image description here

TTS evaluation

insert image description here

VC evaluation

insert image description here

Guess you like

Origin blog.csdn.net/qq_40168949/article/details/129739985