Paper Reading_Audio Representation_wav2vec_2.0

Paper information

name_en: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
name_ch: wav2vec 2.0: Speech representation self-supervised learning framework
paper_addr: http://arxiv.org/abs/2006.11477
date_read: 2023-04-27
date_publish: 2020- 10-22
tags: ['Deep Learning','Audio Representation']
author: Alexei Baevski, Facebook AI
code: https://github.com/pytorch/fairseq

1 Feedback

The model is used for speech recognition, and the model structure combines CNN and Transformer. The article is concise and concise, and the structure is very comfortable.

2 Summary

First learn the representation of audio from unlabeled speech, and then fine-tune it with a small amount of labeled data, the obtained model is better than the model trained with a large amount of labeled data, and the principle is very simple.
Using only ten minutes of labeled data and 53k hours of unlabeled data for pre-training, a WER of 4.8/8.2 is achieved. This demonstrates the feasibility of speech recognition using a limited amount of labeled data.

3 Introduction

Speech recognition systems typically require tens of thousands of hours of transcribed speech (speech + corresponding text) to achieve acceptable performance, and for most of the nearly 7,000 languages ​​in the world, there is not that much labeled data.
Neural networks benefit from large amounts of unlabeled training data. Self-supervised learning methods can learn general data representation from unlabeled data examples , and then fine-tune the model on labeled data. This has led to important advances in both natural language processing and computer vision.
A self-supervised learning framework proposed in this paper aims to learn general data representations from raw audio data. This method uses a multi-layer convolutional neural network to encode speech audio, uses a method similar to the mask in NLP, constructs a contextualized representation through the Transformer network, and trains the model by comparing tasks.

4 models

The model first uses the convolutional network to map the input audio X to the hidden space Z, and then sends Z to the Transformer network to construct the representation C to extract relevant information from the context; in addition, the feature code Z is also sent to the quantization tool to generate quantized Indicates Q (discrete). A representation of the audio is thus learned.

4.1 Feature Encoder

The encoder consists of multiple blocks containing temporal convolutions followed by layer normalization and GELU activation functions. The raw waveform input to the encoder is normalized to have zero mean and unit variance. The encoder outputs to Transformer.

4.2 Combining context representation through Transformer

The output of the feature encoder is fed to the context network of the Transformer architecture. Use convolutional layers as relative positional embeddings. We add the output of the convolution and the GELU to the input, then apply layer normalization.

4.3 Quantification model

During the self-supervised training phase, the output of the feature encoder z is discretized into a limited set of speech representations via multiplicative quantization . Product quantization is equivalent to selecting quantized representations from multiple codebooks and concatenating them. Given G codebooks or groups, select an entry from each codebook and concatenate the generator vectors e1, ..., eG and apply a linear transformation. Gumbel softmax supports selecting discrete codebook entries in a fully differentiable manner.

5 Training & Experiment

5.1 Masking

Similar to BERT's Mask method, Mask removes some of the features after the Encoder, randomly selects a certain proportion of time steps as the starting point without repetition, and shields each starting point for M consecutive time steps , and the masking intervals may overlap.

5.2 Objectives

During pre-training, the loss function Lm is optimized by contrastive learning, and the loss Ld is used to encourage the model to use the codebook.

where a is a hyperparameter.

5.2.1 Loss for Contrastive Learning

The c and q output by the context network are representations of the quantized latent space:

sim is used to calculate the distance between the context representation and the quantized latent space.

5.2.2 Diversity loss

5.3 Fine tuning

The pre-trained model is fine-tuned for speech recognition: using the Librispeech dataset, the audio representation is mapped to the classification task by adding a linear projection on top of the context network, and the model is optimized by minimizing the CTC loss.
LibriSpeech is a corpus containing about 1000 hours of 16kHz English pronunciation, the data is derived from the audiobooks of the LibriVox project, and has been carefully segmented and aligned.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130791244