wav2vec2.0: A Framework for Self-Supervised Learning of Speech Representations

1.wav2vec2.0: A Framework for Self-Supervised Learning of Speech Representations

(1) Paper ideas

Based on the idea of ​​vq2vec, through the voice input of the mask on the latent space, train a contrastive task to distinguish the true quantized latent variable representation from other negative examples (at the same time training the latent representation of the quantization process), the resulting representation, Fine-tuning based on a small amount of labeled data has achieved good results. Compared with vq2vec, there is no need to connect to bert. Combining the original two separately trained models into one is a better engineering implementation.

(2) Model architecture

Insert picture description here

  • feature encoder
    X->Z: The original raw audio is convolved in multiple layers to output a latent speech representation. Specifically, it includes: multiple blocks including: convolutional layer + gelu layer
    . The convolution in the first block is followed by group normalization. Connect gelu. Add layer normalization to the output channels of the network.


  • Compared with vq-wav2vec, contextualized representations with transformers Z->C is not excessive, and the continuous speech representation is directly input. Instead of using absolute position coding, the convolution of kernel size 128 and 16 groups is used as relative position coding.

  • The Quantization Module
    converts z into a discrete representation through product quantization. This is different from vq-wav2vec (the codebook sharing vector is better in the case of multiple groups). For multiple groups G, G codebooks are used, and the final vector characterization is the concatenation of the characterizations obtained from different codebooks, and then one Linear transformation.

Insert picture description here
The calculated probability of the v-th input from the g-th codebook can be calculated using the above formula.

(3) Training details

The training goal is to make all masked time steps learn to change the correct quantized q from K distractors q ^ \hat qq^ Distinguish in

  • Masking masks
    the continuous latent speech representation output from the feature encoder, and selects the probability of the start index of the mask according to the probability of p=0.065. The span of the mask is 10. The result shows that about 49% of the time steps are masked and the mask span is The average length is 14.9, or 299ms.

  • Representation of training goals

Insert picture description here
Where Lm represents contrastive loss, Insert picture description here
where distractors are uniformly selected from other mask time steps in the same sequence. sim (a, b) = a T b / ∣ ∣ a ∣ ∣ ∗ ∣ ∣ b ∣ ∣ sim(a,b) = a^Tb/||a||*||b||s i m ( a ,b)=aTb/ab

Ld stands for diversity loss.
Insert picture description here
The loss is to average the use of V representations of each codebook. The specific method is to maximize the entropy of the average softmax distribution I (GxV), where p ˉ g \bar p_gpˉg Represents the average of the codebook g softmax distribution.

The L2 regularization is added to the activation function of the last layer of the feature encoder before layer normalization, in order to stabilize the training.

The finetuning part connects a randomly initialized linear layer to the context network result and trains by minimizing ctc loss.

(4) Experimental configuration and results

The pre-training stage adopts two configurations: base model and large model.
Language model adopts 4-gram and transformer LM.
Insert picture description here
The result of training with a small amount of labeled data.

Insert picture description here
The comparison result of the training result under the labeled data at 960h and the supervised models/ semi-supervised model.
Insert picture description here
Reached a new state of art in TIMIT phoneme recognition. Reduced PER by 23% and 29%.
Insert picture description here
Continuous input (which retains better context information) predicts the performance of the quantized targets (conducive to the stability of training) training method the best.
Possible future improvements: (1) Replace transformer +ctc with the seq2seq model; (2) The acoustic model's word list (characters) does not match the LM model word list (word based). (3) Use new word pieces (4) for data balancing. (5) Introduce self-training.

Guess you like

Origin blog.csdn.net/pitaojun/article/details/108164898