【Read papers】MAE

Model overview
Transformer Encoder and decoder based on pure attention mechanism
BERT The encoder that uses the transformer is extended to more general NLP tasks. It does not 完形填空require 自监督训练机制labels. By predicting the masked words in a sentence, the ability to extract text features is obtained, and the application of the transformer is expanded.
ViT Apply transformer to CV
MAE CV version of BERT. Based on this ViT article, change to a self-supervised training method, and extend the entire training to unlabeled data

Link to the paper:
Masked Autoencoders Are Scalable Vision Learners

1. Title

Masked : with a mask, and the masked in BERT is also the meaning of cloze.
autoencoder : self-encoder, auto does not mean automatic, but "self". For example, there is a class of models in machine learning called autoregressive models. It means that the labels (y) and samples (x) are from the same thing.
scalable : expandable, the model is relatively large
vision learners : visual learners, classifier is not emphasized, etc. It shows that this model has a wide range of applicability, and it is a backbone model. For example, in the language model, we use the previous words to predict some of the following words; in another sample, the predicted word (y) is also called another sample x itself.
Two words that are usually used in writing papers: scalable and efficient .

  • scalable: the model is relatively large
  • efficient: the model is faster

2. Introduction

Using masked autoencoders in computer vision is not very new either. For example, denoising atuoencoder: add a lot of noise to a picture, and learn to understand the picture through denoising. But MAE is based on transformer.

三、Related work

  • Application of BEitT
    BERT on image. A discrete label is given to each patch, and the original pixel information is directly restored in MAE.
  • Self-supervised learning
    contrastive learning uses data augmentation

In related work, it is best to explain clearly the difference between your own work and related work.

4. Approach

insert image description here
MAE is an autoencoder that sees part of the data and can reconstruct the complete original signal.

Map the observed signal to a latent representation (a representation in the semantic space), and then use a decoder to reconstruct the original signal from this latent representation.

Unlike the classic autoencoder, it is an asymmetric structure. The masked token will not be input to the encoder to save overhead.

  • Masking
    randomly samples 75%, making the task more difficult and forcing the model to learn useful information.

  • MAE encoder
    a ViT but applied only on visible, unmasked patches. Input the unmasked patches into ViT.

  • The MAE decoder
    decoder needs to see the covered blocks, and also needs to see the uncovered blocks, so that it can be reconstructed. The decoder is no longer a simple linear layer, but another transformer that requires an input positional encoding for the covered blocks. In addition, the decoder is mainly used in the pre-training stage. When MAE is used for other tasks , the decoder is not needed, and only the encoder needs to be used to encode the picture. (When predicting, the decoder should be separated, and then the encoder should be used as a feature extractor )

  • The last layer of the Recontruction target
    decoder is a linear layer. If patch_size=16*16, then this linear layer will be projected to a vector with a length of 256, and then reshape to 16*16 to restore the original pixel information. The loss function uses MSE, which only performs loss on covered blocks. It is also possible to perform a normalization on the pixels to be predicted, so that the mean value becomes 0 and the variance becomes 1, making the value more stable.

  • Simple implementation: Simple implementation of
    random sampling; when the decoder is used, the previously covered pacth will be generated into a vector (initialization) with the same length as before, and position codes will be added to all patches. After training the loss function, the final vector can be trained, and finally the pixel information can be reconstructed after reshape.
    Through shuffle and unshuffle, no sparse operations are required, and the implementation is very fast and does not affect the subsequent ViT operations.

5. Experimental design

So MAE finally does pre-training for a classification task?

Yes, pre-training. VITAE, a recent paper on 2D human body key point detection, also uses MAE pre-training, and the effect of SOTA is really good.

It is also possible to use only two transformers?

<-MAE only uses the encoder block in the Transformer paper as the constituent unit of the encoder and decoder. The encoder and decoder concepts of the self-encoder used by MAE are different from Transformer.

how to find ideas

insert image description here

Guess you like

Origin blog.csdn.net/verse_armour/article/details/128447022
Recommended