MAE self-supervised method based on convolutional neural network

This article is shared from Huawei Cloud Community " MAE Self-Supervision Method Based on Convolutional Neural Network ", author: Hint.

Image self-supervised pre-training algorithm is an important research direction in recent years, and MAE is a representative method based on ViT, which has learned robust visual features. The full name of MAE is Masked Autoencoders, which is a self-supervised pre-training method proposed by He Kaiming. It draws on the pre-training task of BERT, masks the patch of the input image in a larger proportion, and uses the asymmetric ViT codec structure to perform Reconstruction tasks for masked patches. This method outperforms previous contrastive learning methods such as the MoCo series etc. in performance. However, the structure of ViT is complex and the amount of calculation is huge. The MAE-like method based on CNN has high research value. However, due to the structural characteristics of CNN, the conventional MAE method cannot be directly applied to CNN. This paper introduces the ICLR2023 method Spark[1], which implements CNN-based MAE.

cke_146.png

As shown in the above figure, for a masked input picture, the statistical histogram is calculated for the input of ViT and CNN. The histogram of ViT is consistent with the distribution of the unmasked picture, while the histogram of CNN has changed a lot. This is because the ViT structure is naturally suitable for processing variable-length and irregular inputs, and there is no overlapping calculation between different inputs. CNN's sliding window operation and regular convolution kernel shape cause the model to be seriously affected by the mask part.

cke_147.png

Therefore, the author borrows from the sparse convolution in the field of 3D point cloud. This convolution only calculates the unmasked pixels, ignores the masked pixels, and can handle irregular input, achieving a similar effect to ViT. In addition, in order to learn multi-scale features, the author designed a hierarchical decoder, referring to the structural design of UNet, so that the model can learn multi-scale features and adapt to the multi-level structure of CNN.

cke_148.png

From the following experimental results, the performance of the method is comparable to the original MAE method, and achieved SOTA results in various downstream tasks. The author also proved the effectiveness of each design module and the generality of the method.

cke_149.png

cke_150.png

cke_151.png

cke_152.png

cke_153.png

[1]Tian K, Jiang Y, Diao Q, et al. Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling[J]. arXiv preprint arXiv:2301.03580, 2023.

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/132236371