Basic thesis study (5) - MAE

MAE:Masked Autoencoders Are Scalable Vision Learners

Self-Supervised Learning

  • Step1: First use the unlabeled data set to train the parameters from a piece of white paper to the preliminary pre-training model, and you can get the Visual Representation of the data
  • Step2: From the initial shape, use the labeled data set to train the parameters to the full shape according to the difference of your downstream tasks Downstream Tasks . Note that this is 2 stages.

insert image description here

The first stage does not involve any downstream tasks. It is to pre-train with a bunch of unlabeled data without specific tasks. This is called in a task-agnostic way in official language .

The second stage involves downstream tasks, which is to fine-tune downstream tasks with a bunch of labeled data. This is called in a task-specific way in official language .

Self-Supervised Learning is not only in the field of NLP, but also has many classic works in the field of CV and voice, as shown in Figure 2 below. It can be divided into 3 categories: Data Centric , Prediction (also called Generative) and Contrastive .
insert image description here
One of the mainstream is the method based on Generative and the method based on Contrative. Here is a brief introduction as shown in the figure below.

  • Generative-based methods mainly focus on 重建误差, for example, for NLP tasks, a token is covered in the middle of a sentence, and the model is used to predict, and the error between the obtained prediction result and the real token is used as a loss. Such as Diffusion, VAE, etc.
  • Contrastive-based methods do not require the model to be able to reconstruct the original input, but expect the model to be able to 在特征空间上对不同的输入进行分辨. Such as SimCLR, etc.

insert image description here

1. Masked AutoEncoders (MAE) Principle Architecture

What masked autoencoders (MAE) has to do is still through self-supervised learning将被masked抹去的图像块补充上 . It belongs to the type of Generative (Predictive) pre-training. Another well-known example of this type of self-supervised learning is BERT.

For the BERT model, a sentence covers some tokens in the middle, let the model predict, and make the error between the obtained prediction result and the real tokens as a loss. It tells us that directly reconstructing sentence can also work very well.

For the MAE model, an image covers some patches in the middle, let the model predict, and make the error between the obtained prediction result and the real image patches as a loss. It tells us that directly reconstructing the original image can also do a good job.

insert image description here

MAE architecture: Mask removes random patches of the input image and reconstructs them . It is based on two core ideas: the researchers proposed an asymmetric encoder-decoder architecture , in which the Encoder encoder only 可见operates on a subset of patches (that is, tokens that are not masked) , and the Decoder decoder can learn from the latent representation Reconstruct the original image with the masked token . The architecture of the Decoder can be a very lightweight model, and the specific architecture has a great impact on the performance of the model. The researchers further found that Mask 掉大部分输入图像 (例如 75%)会产生重要且有意义the self-supervised task of .

insert image description here
The MAE method is strictly a type of Denoising Auto-Encoders (DAE) , a class of autoencoders that corrupt the input signal and learn to reconstruct the original, uncorrupted signal of. The Encoder and Decoder structures of MAE are different and asymmetric. The Encoder encodes the input into a latent representation, and the Decoder will reconstruct the original signal from the latent representation.

MAE and ViT work in the same way, dividing the image into regular, non-overlapping patches. Then select some patches according to the uniform distribution without repetition and mask the remaining patches. The mask ratio used by the author is high enough, so the redundant information of the patches is greatly reduced, making it not so easy to reconstruct images in this case.(Hard Sample思想,增大loss加速收敛)

Algorithm flow:

  • First, divide the input image into patches, execute the mask operation (75%), then send only the visible patches to the encoder, and then use the output of the encoder (latent representations) and mask tokens as the input of the lightweight decoder. Construct the whole image

  • Encoder: The encoder is actually ViT. After dividing the input image into non-overlapping patches, linear projection is performed, plus positional embeddings (the sine-cosine version), and then sent to transformer blocks

  • Decoder: also uses ViT, takes mask tokens + encoded visible patches as input, plus position encoding (the sine-cosine version). The last layer of the decoder is linear projection, the number of output channels is the same as the number of pixels in a patch (convenient for reconstruction), and then reshape to reconstruct the image. The loss function uses MSE, and the loss function is only calculated for masked patches (same as BERT). At the same time, the author also tried the normalization method, which is to calculate the mean and standard deviation of the pixel values ​​in a patch, and then perform normalization on the patch. At this time, some changes have taken place in the reconstruction task of the encoder, and it is necessary to reconstruct the normalized pixel values. Experiments show that this a better way

  • The design of the decoder in MAE is not important, because after the pre-training is over, only the encoder is retained, and the decoder only needs to complete the image reconstruction task during the pre-training. But the author also said that the decoder determines the semantic level of latent representations

Why has BERT (2018) been proposed for so long, until BEIT (2021.6) and MAE (2021.11), there has not been a very similar CV BERT in the CV field?

  1. The mainstream architectures of CV and NLP are different : until the emergence of ViT (2020.12), the mainstream architecture of CV has always been based on convolutional networks, and the mainstream architecture of NLP has always been based on Transformers. The convolution kernel acts on each grid. Intuitively speaking, the concept of token like Transformer cannot be generated. That is to say, if we only use convolutional networks, the establishment of the concept of image token is not so intuitive. Therefore, it is not suitable for self-supervised learning on the basis of tokens like Transformer, which is the first difficulty.
  2. The information density of language and pictures (video) is different : language is a signal created by humans, it is highly semantic, information-dense. And pictures (videos) are naturally generated signals, which have heavy spatial redundancy. That is, a part of the patches that blocks the picture can easily be imagined by looking at the patches around it. Therefore, for language and images, one has a high information density and the other has a low information density. This is the second difficulty. What is the solution? The author proposes a simple strategy: the proportion of patches that block the picture is higher. For example, if you blocked 30% of the patches in a picture before, you could easily predict it from the surrounding patches; now if you block 90% of the patches in the picture, can you easily predict it from the surrounding patches?
  3. The Decoder part in AutoEncoder (the module that reconstructs the mapped intermediate features into input) plays different roles in CV and NLP: in the CV field, the role of Decoder is to reconstruct image pixels, so the output semantic level of Decoder is very low. In the field of NLP, the role of Decoder is to reconstruct sentence words, so the output semantic level of Decoder is very rich.

1.1 MAE Encoder

MAE Encoder uses the ViT architecture, but only works on unmasked images. Like the idea of ​​ViT, MAE Encoder will first encode pictures through Linear Projection, add position encoding, and then send them into a bunch of continuous Transformer Blocks. But the encoder only operates on a small subset (eg, 25%) of the entire image patches set, and removes masked patches (75%). This is different from BERT. BERT uses special characters to replace the masked part, while MAE does not use mask marks.

1.2 MAE Decoder

MAE Decoder adopts the Transformer architecture and inputs the entire image patches collection, not only unmasked tokens (blue color blocks in the figure), but also the masked parts (gray color blocks in the figure). Each mask token is a shared, learned vector indicating that there is a token to be predicted. The authors also add positional embeddings to all tokens in this full image patch collection, positional encodings representing information about each patch's position in the image.

MAE Decoder is only used to perform image reconstruction tasks during pre-training. Because the characteristic of self-supervised learning is that only the final pre-trained Encoder is used to complete the classification task. Therefore, the decoder structure can be flexibly designed independent of the encoder design. The authors experiment with a very small decoder that is narrower and shallower than the encoder. With this asymmetric design, tokens can be processed by a lightweight decoder, which greatly reduces the pre-training time.

1.3 Self-supervised learning objective function Reconstruction Target

The last layer of the Decoder is a Linear Projection layer whose output channel number is equal to the number of pixels in the image. So the output of the Decoder will be further reshaped into the shape of the image. The loss function is MSE Loss, that is, the closer the distance between the reconstructed image and the input image is, the better.

The author also tried another loss function, which is to first calculate the mean and deviation of the pixel value of each patch, and use them to normalize each pixel value of this patch. Finally, the normalized pixel value is used for MSE Loss calculation. But it is found that the effect of doing this is better than the direct MSE Loss.

1.4 Implementation

The specific implementation method of MAE is:

  • Firstly, image tokens are obtained through Linear Projection and position encoding.
  • Randomly shuffle these tokens, discarding the last part according to the masking ratio.
  • Output the unmasked patches to the Encoder to get the representation of these tokens.
  • Combine the output of the Encoder with masked tokens (learnable vectors), execute the unshuffle operation to restore the order, and then input them into the Decoder together.
  • The time overhead of shuffle and unshuffle operations is negligible.

Advantages of MAEs

(1) Scalable: the encoder only operates visible patches, and the mask tokens are given to the decoder with few parameters to operate, which greatly reduces the amount of calculation, especially when the ratio of the mask is high, greatly reducing the pre-training time and allowing MAE It can be easily scaled to a larger model (enabling us to easily scale MAE to large models), and through experiments, it is found that as the model increases, the effect becomes better and better

(2) High capacity and good performance (very high-capacity models that generalize well): Using the MAE pre-training method, a large model can be trained, such as ViT-Large/Huge, when the pre-trained ViT-Huge When migrating to downstream tasks, the model performs very well, even surpassing the same model using supervised pre-training (achieves better results than its supervised pre-training counterparts), which shows that the representation learned by MAE pre-training can be well generalized to Downstream tasks (these pre-trained representations generalize well to various downstream task)

2. Experimental Analysis

Self-supervised pre-training on ImageNet-1K, using the standard ViT structure. After pre-training, use the encoder for fine-tuning and linear probing. Because it is used for image classification, it is similar to ViT. Add a class token (an auxiliary dummy token) to the input ), the experimental results show that using average pooling can achieve the same effect

(1) Pre-training stage

Color jittering (one of the methods of data enhancement), drop path (a variant of dropout), and gradient clip (setting a threshold to prevent gradient explosion/disappearance) are not used. It is the same as the official code of ViT, using xavier uniform to initialize all Transformer blocks. Use the linear learning rate scaling rule

insert image description here

(2) End-to-end fine-tuning

使用layer-wise learning rate decay

insert image description here

(3)linear probing

Training settings refer to MoCov3, linear probing and end-to-end fine-tuning are very different, regularization may lose model performance for linear probing, so like in MoCov3, some regularization strategies are discarded
insert image description here

(4) Partial fine-tune:

Linear probing lacks nonlinear modeling capabilities (it misses the opportunity of pursuing strong but non-linear features—which is indeed a strength of deep learning), partial fine-tune only fine-tunes the last layers of the encoder, its hyperparameters and other settings and fine-tuning Same as (table 9), except fine-tuning epochs are adjusted

The top-1 accuracy (224x224) is calculated in all four stages, and ViT-Large is used as the baseline for ablation study. Comparing ViT-Large training from scratch (200 epochs) and fine-tuning (50 epochs), it can be found that the effect of train from scratch is not as good as fine-tuning
insert image description here

Pre-training with MAE only needs ImageNet-1k to achieve 87.8% Top-1 accuracy, surpassing all ViT variant models in ImageNet-21k pre-training. From the perspective of method, MAE chooses to directly reconstruct the elements of the original image, and proves its feasibility, which changes people's cognition, and can cover almost all recognition tasks in CV, which seems to open a new direction. It is very important to directly reconstruct the elements of the original image, because through this form, the author has completed the MIM task in the most intuitive way, making the potential of MIM gradually confirmed. The transition from MLM to MIM has been proven, and it is not far from the CV pre-training large model of GPT3.

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/132503568