He Yuming's MAE is included in CVPR 2022 Oral! Up to 87.8% accuracy! New representative works in the field of self-supervision

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Author: happy | Reprinted from: Extreme City Platform

Guided reading

 

He Yuming proposed Masked AutoEncoders, a scalable self-supervised learning scheme for computer vision. The proposed MAE is extremely simple: block random masking of the input image and reconstruct the missing pixels. This scheme makes the resulting high-precision model have good generalization performance: only ImageNet-1K, ViT-Huge achieves 87.8% top1 accuracy.

Amusi noticed not long ago that the work of MAE has been included in CVPR 2022 (Oral) , the code has been open sourced, and the number of github stars has exceeded 3.3k!

There have also been some new works applying MAE to tasks such as 3D point clouds, medical images, and multimodality. The direction of MIM is bound to become more and more popular in the near future, and everyone can pay attention to it.

94805416357fef86943fb250af6c4027.png

MAE Google Scholar has been cited over 100+!

1fb5f74bc726a6c45998931bdeb1747f.png

The following is the text of the interpretation of the paper, and it is recommended to read and study!

0854839b6388ee39b151f53a2a263d01.png

Masked Autoencoders Are Scalable Vision Learners

Paper: https://arxiv.org/abs/2111.06377

Code (open source):

https://github.com/facebookresearch/mae

Kaiming produced, must be a boutique! This article continues its usual style: simple and functional. This article still belongs to the research field of Kai Ming in the last two years: the field of self-supervision (self-supervised learning is what he brought on fire). The starting point of this paper is the masked self-encoding mechanism of BERT: remove a part of the data and learn the removed content. Masked auto-encoding originated from CV but flourished in NLP. Kai Ming asked this question: what caused the difference between masked self-encoding in vision and language? Attempts to explain from different angles lead to the MAE of this paper.

Abstract

Kai Ming proposes Masked AutoEncoders (MAE), a scalable self-supervised learning scheme for computer vision. The proposed MAE is extremely simple: block random masking of the input image and reconstruct the missing pixels. It is based on the following two core designs:

  • We design an asymmetric codec architecture, in which the encoder only acts on visible blocks (without mask information), while the decoder reconstructs the original image through latent expression and mask information;

  • We find that masking the input image with a high ratio (say 75%) can yield an important and meaningful self-supervised task.

The above two designs allow us to train large models more efficiently: we speed up training by 3x or more, while improving model accuracy. The proposed scheme enables the obtained high-precision model to have good generalization performance: only ImageNet-1K, ViT-Huge achieves 87.8% top1 accuracy . Transfer of downstream tasks achieves better performance than supervised training, confirming the scalability of the proposed scheme.

Extreme Lite

Briefly describe this article with the following sentences:

  • Kaiming produced, must be a boutique! MAE continues its consistent research style: simple and practical;

  • MAE arose from denoising auto-encoding, but flourished from NLP's BERT. So what causes the difference in the performance of MAE in CV vs NLP? This is the starting point of this article .

  • Angle 1: The architecture of CV and NLP is different . The "regular" operation of convolution is often used in CV, and it was not until recently that ViT broke the architectural difference;

  • Angle 2: The information density is different . Languages ​​are invented by humans, with high semantics and information density; while images are natural signals with heavy spatial redundancy: missing blocks can be reconstructed from nearby blocks without any global understanding. To overcome this discrepancy, we employ a simple strategy: a high-ratio random block mask , which drastically reduces redundancy.

  • Angle 3: The decoder of the autoencoder plays a different role in reconstruction . In terms of vision tasks, the decoder performs pixel reconstruction with lower semantic information; while in NLP, the decoder predicts missing words, which contain rich semantic information.

  • Based on the above three points of analysis, the author proposes a very simple masked autoencoder MAE for visual representation learning.

  • MAE adopts an asymmetric encoder-decoder architecture. The encoder only acts on the visible image blocks (that is, a certain proportion of the input image blocks is discarded, and the discarding proportion is as high as 75%) and generates implicit expressions. The decoder uses the mask token and Implicit expressions are taken as input and the missing blocks are reconstructed.

  • ViT-H with MAE achieves a new record on the ImageNet-1K dataset: 87.8%; at the same time, the model pre-trained by MAE has very good generalization performance.

Method

The proposed MAE is a very simple autoencoder scheme: the original signal is reconstructed based on the given partial observation information . Similar to other autoencoders, the proposed MAE consists of an encoder that maps the observed signal to an implicit representation, and a decoder that reconstructs the implicit representation to the original signal. The difference from the classic autoencoder is that we adopt an asymmetric design, which makes the encoder only rely on partial observation information (without mask token information), while the lightweight decoder is connected with the resulting implicit expression and mask The token is reconstructed from the original signal (see the figure below).

a3d1cbc2e144b5cc6ea61396126e7cc3.png

Masking  refers to ViT, we split the input image into non-overlapping blocks, then sample a part of the blocks and remove the rest (i.e. Mask). Our sampling strategy is very simple: random sampling without repetitions from a uniform distribution . We call this sampling strategy "random sampling". Random sampling with a high mask ratio can eliminate redundancy to a great extent, thereby constructing a task that cannot be easily solved by inference from neighboring blocks (refer to the diagram below). A uniform distribution, on the other hand, avoids the potential center bias problem.

de678f136ce48971a8b664a4acc50f08.png bb3c8559a49f5a55dcc9c25dc8fd86c0.png

MAE Encoder  The encoder in MAE is a ViT, but only works on visible unmasked blocks. Similar to standard ViT, this encoder encodes blocks by linear projection onto positional embeddings, which are then processed through a series of Transformer modules. However, because the codec is only processed in a small subset of blocks (such as 25%), and the mask Token information is not used. This allows us to train a very large encoder .

MAE Decoder  The input of the MAE decoder contains: (1) the output of the encoder; (2) the mask token. As shown in Figure 1, each mask token shares a learnable vector, which is used to indicate the missing block to be predicted. At this point, we add location embedding information to all tokens. The decoder also contains a series of Transformer modules.

Note: The MAE decoder is only used for image reconstruction in the pre-training stage, and the encoder is used to generate image representations for recognition . Therefore, the design of the decoder can be independent of the design of the encoding, with a high degree of flexibility. During the experiment, we adopted a narrow and shallow very small decoder, for example , the calculation amount of each token in the default decoder is less than 10% of the encoder . With this asymmetric design, the full set of tokens is only processed by the lightweight decoder, which greatly reduces the pre-training time.

Reconstruction target  This MAE reconstructs the original information by predicting the pixel value of each mask block . The last layer of the decoder is a linear projection with the number of output channels equal to the number of pixels per block. The output of the encoder will be reshaped to build the reconstructed image. The loss function uses MSE, Note: Similar to BERT, the loss is only calculated in the mask block.

We also investigate a variant where the reconstruction target is the normalized pixel value of each mask block . Specifically, we calculate the mean and standard deviation of each block and use it to normalize the block, and finally use the normalized pixels as the reconstruction target to improve the expressiveness.

Simple implementation  MAE pretraining is extremely efficient, and more importantly: it does not require any specific sparsity operations. The implementation process can be described as follows:

  • First, we generate tokens for each input block through linear projection and position embedding;

  • Then, we randomly shuffle the token sequence and remove the last part of the token according to the mask ratio;

  • Second, after completing the encoding, we insert the mask token into the encoding block and unshuffle to obtain the full sequence token for alignment with the target;

  • Finally, we apply the decoder to the above full sequence token.

As mentioned above: MAE does not require sparse operations. In addition, shuffle and unshuffle operations are very fast, and the amount of computation introduced is negligible.

Experiments

We perform self-supervised pre-training on the ImageNet-1K dataset, and then evaluate the expressiveness of the pre-trained model through supervised training.

Main Properties

84af6676e96167d2556efa5e949baf25.png

Baseline: ViT-Large . We use ViT-Large as the backbone of our ablation experiments, and the table above shows the performance comparison between training from scratch and fine-tuning with MAE. It can be seen that: training from scratch (200epoch), the performance of ViT-L is 82.5% without strong regularization technique; while MAE (note: only fine-tuning 50epoch) has achieved a large performance improvement.

3dbe34e016d2ea3423ab472a02aee653.png

The above table compares the ablation experiments from different angles, one by one.

Decoder Design  can be seen from Table1a and Table1b: the design of the decoder can be very flexible . All in all, the default decoder is very lightweight, with only 8 modules, a dimension of 512, and the computation per token is only 9% of the encoding.

The important design of Mask Token  MAE: skip the mask token in the encoding stage and process it in the decoding stage. Table1c gives the performance comparison, and we can see that the use of masked tokens by the encoder will lead to performance degradation .

Reconstruction target  Table1d compares the performance of different reconstruction targets. It can be seen that the introduction of block normalization can further improve the model accuracy .

Data Augmentation  Table1e compares the impact of different data augmentation, we can see: MAE only needs crop to perform very well, but adding ColorJitter will affect performance . Also, surprisingly: when data augmentation is not used, MAE performance is also very good .

Mask Sampling  Table 1f compares different mask sampling strategies. It can be seen that different sampling strategies have better performance, while random sampling has the best performance .

Masking ratio The figure below shows the effect of the masking ratio. It can be seen that the optimal ratio is surprisingly high . A mask ratio of 75% benefits both supervised training methods (end-to-end fine-tuning and linear probing). This is diametrically opposed to the behavior of BERT, which has a mask ratio of 15%.

985fe721b5af9d0e95e76ab3d769a01d.png

At the same time, as can be seen from the above figure: end-to-end fine-tuning and linear probing have different trends:

  • For linear probing, the performance of the model increases very steadily with the mask ratio until it reaches the highest point, and the accuracy difference is about 20%;

  • For fine-tuning, model performance is extremely insensitive over a wide range of mask scales, and all fine-tuning results outperform linear probing.

The following figure of Training Schedule shows the performance comparison of different training mechanisms (800 epoch pre-training is used at this time). It can be seen that longer training can bring more certain accuracy improvement . The author also mentioned that even 1600 epoch training did not find the performance saturation of linear probing. This is in stark contrast to the 300epoch training saturation in MoCoV3 : at each epoch, MAE only sees 25% of the image patches; MoCoV3 sees 200% or more of the image patches.

d802d1ad2ced94b09812e876d99791b1.png

Comparisons with Previous Results

4b16590ab95acb56b15a57732aa9697c.png

The above table shows the performance comparison of the proposed MAE with other self-supervised schemes, from which we can see:

  • For ViT-B, the performance of different schemes is very close; for ViT-L, the performance difference of different schemes becomes larger. This means: reducing overfitting for larger models is more challenging .

  • MAE can be easily scaled to larger models with steady performance gains. For example: ViT-H achieved an accuracy of 86.9%, and after fine-tuning of size 448, the performance reached 87.8% , surpassing the previous best 87.1% of VOLO (size 512). Note: This result uses only ViT, a better network representation may be better.

Transfer Learning Experiments

20035af82bccc01585fc65be14054875.png

The table above shows the comparison of transfer performance between COCO detection and segmentation tasks. It can be seen that compared with supervised pre-training, MAE achieves the best full configuration . When the backbone is ViT-B, the MAE can achieve a 2.4AP improvement; when the backbone is ViT-L, the performance is improved by 4.0AP.

60c5a2bf6aec45ffe14f7bfad7595ef3.png

The table above shows the comparison of migration performance on the ADE20K semantic segmentation task. It can be seen that MAE can greatly improve the performance of ViT-L, which is 3.7 higher than supervised training.

This is the end of the full text. For more experimental results and analysis suggestions, please refer to the original text.

The above paper PDF and code download

Background reply: MAE, you can download the above paper and code

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

 
  
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer或者目标检测 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer或者目标检测+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/124138643