1.wav2vec2.0: A Framework for Self-Supervised Learning of Speech Representations
(1) Paper ideas
Based on the idea of vq2vec, through the voice input of the mask on the latent space, train a contrastive task to distinguish the true quantized latent variable representation from other negative examples (at the same time training the latent representation of the quantization process), the resulting representation, Fine-tuning based on a small amount of labeled data has achieved good results. Compared with vq2vec, there is no need to connect to bert. Combining the original two separately trained models into one is a better engineering implementation.
(2) Model architecture
-
feature encoder
X->Z: The original raw audio is convolved in multiple layers to output a latent speech representation. Specifically, it includes: multiple blocks including: convolutional layer + gelu layer
. The convolution in the first block is followed by group normalization. Connect gelu. Add layer normalization to the output channels of the network. -
Compared with vq-wav2vec, contextualized representations with transformers Z->C is not excessive, and the continuous speech representation is directly input. Instead of using absolute position coding, the convolution of kernel size 128 and 16 groups is used as relative position coding. -
The Quantization Module
converts z into a discrete representation through product quantization. This is different from vq-wav2vec (the codebook sharing vector is better in the case of multiple groups). For multiple groups G, G codebooks are used, and the final vector characterization is the concatenation of the characterizations obtained from different codebooks, and then one Linear transformation.
The calculated probability of the v-th input from the g-th codebook can be calculated using the above formula.
(3) Training details
The training goal is to make all masked time steps learn to change the correct quantized q from K distractors q ^ \hat qq^ Distinguish in
-
Masking masks
the continuous latent speech representation output from the feature encoder, and selects the probability of the start index of the mask according to the probability of p=0.065. The span of the mask is 10. The result shows that about 49% of the time steps are masked and the mask span is The average length is 14.9, or 299ms. -
Representation of training goals
Where Lm represents contrastive loss,
where distractors are uniformly selected from other mask time steps in the same sequence. sim (a, b) = a T b / ∣ ∣ a ∣ ∣ ∗ ∣ ∣ b ∣ ∣ sim(a,b) = a^Tb/||a||*||b||s i m ( a ,b)=aTb/∣∣a∣∣∗∣∣b∣∣
Ld stands for diversity loss.
The loss is to average the use of V representations of each codebook. The specific method is to maximize the entropy of the average softmax distribution I (GxV), where p ˉ g \bar p_gpˉg Represents the average of the codebook g softmax distribution.
The L2 regularization is added to the activation function of the last layer of the feature encoder before layer normalization, in order to stabilize the training.
The finetuning part connects a randomly initialized linear layer to the context network result and trains by minimizing ctc loss.
(4) Experimental configuration and results
The pre-training stage adopts two configurations: base model and large model.
Language model adopts 4-gram and transformer LM.
The result of training with a small amount of labeled data.
The comparison result of the training result under the labeled data at 960h and the supervised models/ semi-supervised model.
Reached a new state of art in TIMIT phoneme recognition. Reduced PER by 23% and 29%.
Continuous input (which retains better context information) predicts the performance of the quantized targets (conducive to the stability of training) training method the best.
Possible future improvements: (1) Replace transformer +ctc with the seq2seq model; (2) The acoustic model's word list (characters) does not match the LM model word list (word based). (3) Use new word pieces (4) for data balancing. (5) Introduce self-training.