ConvMAE: Masked Convolution Meets Masked Autoencoders

Self-Supervised Learning, also known as self-supervised learning, machine learning is divided into supervised learning, unsupervised learning and semi-supervised learning. Self-Supervised Learning is a kind of unsupervised learning, mainly hoping to learn a general feature expression for downstream tasks (Downstream Tasks). The main way is through self-supervision. As a representative work, Kaiming's MoCo sparked a wave of heated discussions, and Yann Lecun also said at AAAI that Self-Supervised Learning is the general trend of the future.

ConvMAE: What did Masked Convolution Meets Masked Autoencoders do? It is mainly proposed that the multi-scale hybrid Convolution-Transformer model can help the Masked Auto-Encoding (MAE) training paradigm and help it learn better representations. What is the better feature representation of the question? If you are given noise images, can you still learn? It's an annoying point, hahaha.
1. In the pre-training stage of Self-supervised Learning
, we use unlabeled data sets (unlabeled data), because labeled data sets are very expensive, and how much manual labor is required to label, the cost is quite high, too expensive . Instead, unlabeled datasets are just crawling around the web, and it's cheap. When training model parameters, we do not seek to use labeled data to train this parameter in one step from an initialized blank paper, because the data set is too expensive. So Self-Supervised Learning wants to train the parameters from a blank piece of paper to preliminary formation, and then from preliminary training to full formation. Note that this is 2 stages. This thing that has been trained to take shape, we call it Visual Representation. When pre-training the model, it is the process of the model parameters from a blank sheet of paper to the initial shape, or use an unlabeled data set. Wait for me to train the model parameters to a close, and then use the labeled data set to train the parameters to be fully formed according to your downstream tasks (Downstream Tasks), then the amount of data set used at this time does not need to be too large. More, because the parameters have been trained almost after the first stage.
The first stage does not involve any downstream tasks. It is to pre-train with a bunch of unlabeled data without specific tasks. This is called in a task-agnostic way in official language. The second stage involves downstream tasks, which is to fine-tune downstream tasks with a bunch of labeled data. This is called in a task-specific way in official language.
The above words are the core idea of ​​Self-Supervised Learning, as shown in Figure 1 below.
insert image description here

Self-Supervised Learning methods can be divided into 3 categories: Data Centric, Prediction (also called Generative) and Contrastive.

insert image description here

One of the mainstream is the method based on Generative and the method based on Contrative. As shown in Figure 3 below, here is a brief introduction. Generative-based methods mainly focus on reconstruction errors. For example, for NLP tasks, a token is covered in the middle of a sentence, and the model is used to predict, and the error between the obtained prediction result and the real token is used as a loss. The method based on Contrastive does not require the model to be able to reconstruct the original input, but hopes that the model can distinguish different inputs in the feature space.
insert image description here
1.2 Motivation of ConvMAE
ConvMAE This method is based on the argument that there are already many works (such as MoCo[1], MAE[2], BEiT[3], DINO[4]) that have verified the training paradigm of MAE Self-Supervised Learning It can help unleash the potential of the Vision Transformer model and achieve very good performance on the next task.

As a representative work of this paradigm, MAE develops an asymmetric encoder-decoder architecture, in which the encoder only operates on the visible patch subset (that is, tokens that are not masked), and another asymmetric decoder can be derived from Latent representations and masked tokens reconstruct the original image. The architecture of the Decoder can be a very lightweight model, and the specific architecture has a great impact on the performance of the model. The researchers further found that masking out a large portion of the input image (e.g. 75%) yields important and meaningful self-supervised tasks. At the same time, the training paradigm of MAE can not only learn a representation with strong discriminative performance (Discriminative) without the need for a large-scale data set (JFT-300M, ImageNet-22K), but also can be easily extended (Scalable) to On a larger model, and through experiments, it is found that as the model increases, the effect is getting better and better.

In order to speed up ViT training and get better performance, a lot of work has verified the local inductive bias (local inductive bias) (such as SMCA-DETR [5], SAM-DETR [6], DAB-DETR [7], Uniformer [ 8], CoAtNet[9], ConViT[10], Early Convolution[11]) and can further help improve the performance of the ViT model. At the same time, this performance improvement can also be achieved through a multi-scale pyramidal representation (such as Swin Transformer[12], PVT[13]). The effectiveness of the combination of the two has been verified in a large number of supervised learning tasks of recognition, detection, and segmentation.

So a natural question is: Can this multi-scale pyramidal architecture + local inductive bias model be able to further tap and improve the performance of MAE after the MAE training method?
This article explores this question. ConvMAE in short is: multi-scale pyramid architecture + local inductive bias model, using MAE's Self-supervised Learning training method.

Compared with MAE-Base, ConvMAE-Base improves the fine-tuning accuracy of ImageNet-1k to 85.0% (+1.4%), the AP box of the Mask-RCNN COCO detection task to 53.2% (+2.9%), and the UpperNet's The mIoU of the ADE20k segmentation task improves to 51.7% (+3.6%).

1.3 ConvMAE Encoder Architecture
The MAE approach is shown in Figure 3 below. MAE is a framework for pre-training with ViT as the model architecture in a self-supervised manner. The method of MAE is simple: Mask out random patches of the input image and reconstruct them. It is based on two core ideas: the researchers developed an asymmetric encoder-decoder architecture, where one encoder only operates on a subset of visible patches (i.e. tokens that are not masked), and another simple decoder can Reconstruct the original image from learnable latent representations and masked tokens. The architecture of the Decoder can be a very lightweight model, and the specific architecture has a great impact on the performance of the model. The researchers further found that masking out a large portion of the input image (e.g. 75%) yields important and meaningful self-supervised tasks. Combining these two designs enables efficient training of large models: speeding up training by a factor of 3 or more and improving accuracy.
insert image description here
Compared with the MAE framework, ConvMAE has made some small but very effective improvements. As mentioned above, its characteristics are: multi-scale pyramid architecture + local inductive bias model.

As shown in Figure 4 below is the ConvMAE framework, which also has an Encoder and Decoder. Encoder is a convolution-transformer hybrid architecture, and Decoder is a pure transformer architecture.

First look at the gray Encoder part in the upper left corner. It includes 3 stages, let h and w be the size of the input image, and the output features of each stage are respectively insert image description here
. The first two stages are convolution modules, which use Masked Convolutional Block to operate on features, and its structure is shown in the lower right corner of the figure below (the Depthwise Convolution uses a 5×5 convolution kernel). Between each stage, a convolution with a stride of 2 is performed for the downsampling operation. The last stage is the Transformer module, which enlarges the receptive field and integrates the features of all patches. In addition, the authors found that absolute position encoding performance is optimal.
insert image description here
1.4 ConvMAE mask strategy
MAE uses a random mask strategy for the patch of the input image, however, the same strategy cannot be directly applied to the encoder of ConvMAE. Because the features of ConvMAE are gradually downsampled in different stages, if a random mask is performed on the features, it will cause each token in the stage3 stage to have a part of the visible information. Therefore, the ConvMAE author's approach is to mask the output of stage3 (such as 75%), and then upsample these masks by 2 times and 4 times respectively to obtain the masks of the first two stages. These masked tokens are discarded during the encoding phase, and it is hoped that they can be reconstructed after the Decoder. In this way, ConvMAE only needs to keep at least 25% of the tokens for training.

However, the receptive field of Depthwise Convolution using 5×5 in the first two stages may be larger than the size of a masked patch. Therefore, in order to ensure the quality of pre-training, the author used masked convolution[14][15] in the first two stages to ensure that it is The masked part will not participate in the encoding process.

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/hasque2019/article/details/124816019