Masked Autoencoders (MAE) of self-supervised learning - audio recognition

Masked Autoencoders (MAE) of self-supervised learning - audio recognition

1. References

Masked Autoencoders that Listen

2.Background

Transformers andself-supervised learning (self-supervised learning) occupy a dominant position in computer vision (Computer Vision, CV) and natural language processing (natural language processing, NLP).

Using BERT for masked autoencoding provides a new state-of-the-art technology for various NLP tasks through self-supervised pre-training on large-scale language corpora. Similarly, in the CV community, Vision Transformers (ViT) are becoming more and more popular in self-supervision. In image representation learning, masked autoencoders (MAE) bring the CV community closer to the success of BERT in NLP.

In this work, the aspect of listening, that is, audio recognition, is mainly studied, such asAudioset (the largest audio data set ), ambient sound recognition (ESC-50), voice command recognition (SPC-2, SPC-1), speaker recognition ( VoxCeleb).

3. Masked autoencoder

 MAE is shown in the figure above.

① Divide the audio time-frequency spectrum into many patches, and mask most of the patches;

② By encoding the remaining visible patch blocks;

③ Then through the decoding operation, the sequence recovery and mask patch blocks are reconstructed and output;

④ Calculate the MSE loss with the target time spectrogram to update the encoder and decoder;

This encoder uses12-layer ViT-Base (ViT-B)

Decoder usesstandard Transformer module.

See the original text for specific details.

4. Fine-tune to downstream tasks

In the end, MAE only retains the encoder part, and the decoder will be deleted so that it can be applied to downstream tasks.

5.Results

 The spectrum repair result is shown in the figure above

The results of MAE downstream tasks are shown in the table above

6. Application expansion

The MAE pre-trained model can be used for various downstream tasks and is very effective in improving the recognition rate.

Guess you like

Origin blog.csdn.net/pk296256948/article/details/128666880