[Paper Reading] MAE Masked Autoencoders Are Scalable Vision Learners


foreword

Inspired by BERT in natural language processing, after considering the difference between natural language and images, MAE is proposed according to the characteristics of images. MAE masks random blocks of images and reconstructs missing pixels. Its core design: 1. Propose an asymmetric encoder and decoder architecture 2. Cover a large proportion of the input image to produce an unobvious and meaningful self-supervision task.
Paper: https://arxiv.org/abs/2111.06377
Code: https://github.com/facebookresearch/mae


1. Why propose MAE?

Because of the rapid development of hardware, today's models can easily overfit 1 million images, and the requirements for data volume increase. This preference for data is well addressed in natural language processing through self-supervised pre-training.
BERT has achieved great success, but the development of automatic coding methods in the visual field lags behind NLP.
what makes masked autoencoding different between vision and language?

  1. The architecture is different. If you use a mask in convolution, you can’t distinguish the boundary. My understanding is that convolution processes the entire image. A sliding window moves on the entire image to get an output. In convolution, the entire image is connected together. But with the emergence of ViT, he processes multiple patches of an image, each patch is independent and can be masked.
  2. Information density is different.
  3. The decoder of an autoencoder, which maps the latent representation back to the input, plays a different role in reconstructing text than images. The words predicted by the text decoder contain rich semantic information. In vision, the decoder reconstructs pixels, and the semantic level of the output is lower than that of common recognition tasks.

2. MAE Approach

insert image description here

The image has very redundant information, and MAE covers most of the image, forcing the model to learn the most specific information (a very challenging self-supervised task).
Masking: Divide the image into regular non-overlapping blocks according to ViT, sample a subset of it, and mask the rest. Strategy: Sample random blocks following a uniform distribution. Pros: 1. The occlusion rate is high, creating a task that cannot be easily solved by extrapolating from visible neighboring blocks. 2. Uniform distribution prevents potential center bias 3. Highly sparse input creates opportunities to design an efficient encoder.
MAE encoder: The encoder only processes visible image blocks, like ViT, input linearly projected image blocks with position information. Computational and memory drops due to removal of masked blocks.
MAE deocoder: Decoder input: 1. Encoded visible image blocks 2. Covered blocks. These covered blocks are expressed as the same vector, which can be obtained through learning.
This decoder is only used during pre-training and can be designed independently of the encoder. They use lightweight encoders. When using the trained encoder for downstream tasks, many effective image features have been learned from the encoder.
Reconstruction target: MAE reconstructs the output by predicting the pixel value of each mask block, and each element in the decoder output represents a vector of pixel values ​​for a block. Loss only calculates the MSE Loss of the mask block. The use of normalized pixel values ​​as reconstruction targets is also studied, which improves the training quality.
Simple implementation: In general, the original image is first generated as a token column, and patched one by one to do a linear projection + position information. Shuffle the sequence randomly and remove it later. Completed random sampling into the encoder. After encoding, add the masked token to the shallow token obtained after encoding, so that its length is the same as the original. Then unshuffle is restored to the original order, and the position information is added as the input of the decoder.

Summarize

Self-supervised learning in vision may be on a similar trajectory as in NLP. Images are visual simulations that have not been semantically broken down into words. Instead of removing objects, we remove random image patches that are most likely not to constitute a semantic segment.
insert image description here
MAE infers complex, holistic reconstructions showing that it has learned many visual concepts, namely semantics. We hypothesize that this behavior occurs through rich hidden representations inside the MAE. We hope this perspective inspires future work.
Models make predictions based on the training set, reflect bias in the data, and may generate content that does not exist.

Mushen's video:
MAE has two improvements to ViT. ViT said that the effect of moving to BERT is not very good

  1. More image blocks are covered (strong data enhancement method, the article says that the performance of 1600 rounds can still be improved. In other words, the model of 1600 rounds has not yet converged. New ideas, ①Can other data be considered? enhanced)
  2. ViT used a linear layer output at that time. He felt that the direct output of the linear layer was too large for the image span, so he improved the output with Transformer. (MAE uses the ViT architecture, and the author also said that other new architectures can be used to try new ideas, ②new model)
    and new ideas ③Is it possible to use a new loss.

MAE follow-up work: https://zhuanlan.zhihu.com/p/528720386

Guess you like

Origin blog.csdn.net/goodenough5/article/details/129702375