VIT: Transformer's milestone in CV


foreword

Transformer[1] was originally proposed for the NLP field, and has been a great success in the NLP field, almost defeating the RNN model, and has become a new generation of baseline model in the NLP field. This paper is also inspired by it, trying to apply Transformer to the CV field. Through the experiments in this article, the best model given can achieve an accuracy rate of 88.55% on ImageNet1K (first pre-trained on a large dataset JFT), indicating that Transformer is indeed effective in the CV field, especially in With the support of large dataset pre-training .

How much is the support of this big data? In the paper, the author has done related experiments, as shown in the figure below: the horizontal axis is different data sets, and the capacity of the data sets from left to right is (1.3 million, 21 million, 300 million). The vertical axis is classification accuracy. The performance range between the two grays in the figure is the performance range that the ResNet pure convolutional network can achieve; circles of different colors represent VIT models of different sizes. The results show that when the data set capacity is about 1 million, such as ImageNet-1k, the classification accuracy of the VIT model is not as good as that of the CNN model; when the data set capacity is about 21 million, such as ImageNet-21k, VIT The classification accuracy of the model is similar to that of the CNN model; when the data set capacity is about 300 million, such as JFT-300M, the classification accuracy of the VIT model is slightly better than that of the CNN model;
insert image description here

论文名称:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
论文下载链接:https://arxiv.org/pdf/2010.11929
论文pytorch代码实现:https://github.com/Arwin-Yu/Deep-Learning-Classification-Models-Based-CNN-or-Attention


Due to the length of the article, this article does not introduce the Transformer model and the self-attention mechanism in detail, but these are the prerequisite knowledge for learning VIT.

一、Vision Transformer

In the design of the model, follow the original transformer as much as possible. The purpose is to provide a unified algorithm framework that can be shared by CV and NLP, so VIT has also dug a big hole in subsequent multimodal tasks, especially in the task of combining text and images.

In this article, the author mainly compares the three models of ResNet, ViT (pure Transformer model) and Hybrid (convolution and Transformer hybrid model).

The figure below is the model framework of Vision Transformer (ViT) given in the original paper. In simple terms, the model consists of three modules:

  • Linear Projection of Flattened Patches (Embedding layer, responsible for mapping submaps into vectors)
  • Transformer Encoder (a more detailed structure is given on the right side of the figure, responsible for calculating and learning the input information)
  • MLP Head (the final layer structure used for classification, similar to the top layer design commonly used by CNN)

insert image description here

1.Linear Projection of Flattened Patches

For the standard Transformer module, the required input is a token (vector) sequence, that is, a two-dimensional matrix [num_token, token_dim]. For image data, its data format is [H, W, C], which is a three-dimensional matrix, which is obviously not what Transformer wants. Therefore, it is necessary to transform the data through an Embedding layer first.

Taking ViT-B/16 as an example, the input image (224x224) is divided into patches of 16x16 size. After division, ( 224 / 16 ) * ( 224 / 16 ) =196 Patches will be obtained. Then each Patch is mapped to a one-dimensional vector by linear projection ( Linear Projection ).

As for the linear mapping ( Linear Projection ), specifically, a convolution kernel size of 16x16, a step size of 16, and a convolution kernel number of 768 are used directly to implement linear mapping. This convolution operation produces a shape change of [224, 224, 3] -> [14, 14, 768], and then flatten the two dimensions of H and W ( Flattened Patches ), and the shape changes to ( [14, 14, 768] -> [196, 768]), which happens to become a two-dimensional matrix at this time, which is exactly what Transformer wants. Among them, 196 represents the number of patches, and the shape of each Patch data is [16, 16, 3], and a vector with a length of 768 is obtained through convolution mapping (hereinafter directly referred to as token).

Note that [class] token and Position Embedding need to be added before entering Transformer Encoder . In the original paper, the reason why the author uses [class] token instead of GAP (global average pooling) for classification is mainly to refer to bert, and ensure that the model structure is as similar as possible to the transformer to prove the effectiveness of the transformer in migrating to the image field. The specific method is to insert a [class] token specially used for classification into a bunch of tokens obtained after Linear Projection of Flattened Patches . This [class] token is a trainable parameter, and the data format is the same as other tokens. A vector, taking ViT-B/16 as an example, is a vector with a length of 768, which is spliced ​​together with the tokens generated from the picture before, and the dimension changes to Cat([1, 768], [196, 768]) -> [197, 768]. Since the self-attention mechanism in the transformer block can pay attention to all token information, we have reason to believe that [class] token, like GAP, can integrate all the information learned by the transformer for subsequent classification calculations.

Then the Position Embedding uses a trainable one-dimensional position code (1D Pos. Emb.), which is directly superimposed on the tokens (add), so the shape must be the same. Taking ViT-B/16 as an example, the shape is [197, 768] after splicing [class] token just now, so the shape of Position Embedding here is also [197, 768]. Self-attention is the interaction between all elements in pairs, so there is no order, but the picture is a whole, sub-picture patches have their own order, and are related in spatial position, so we need to give patch embedding Adding a set of positional parameters such as positional embedding allows the model to learn the spatial position correlation between patches by itself.

In CNN, the inductive bias (the prior knowledge given to the model when designing the model) runs through the entire model, and the convolutional inductive bias conforms to the nature of the image, that is, local correlation (localilty) and translation invariance (translationally equivalent). . For ViT, only the MLP is local and translationally equivalent, and the self-attention layer is global. The 2d neighborhood structure of the picture is not used very much, it is only used when the picture is cut into patches at the beginning. It is worth noting that the position encoding is randomly initialized at the beginning, without carrying any 2D position information about the patch, and the spatial relationship between patches must be learned from scratch. Therefore, ViT does not use too much inductive bias, so the training results on small and medium-sized data sets are not as good as CNN. However, if there is support for large data, VIT can get higher performance than CNN, which reflects to a certain extent The knowledge learned by the model from big data is more reasonable than the prior knowledge given to the model by people.

Finally, the author also conducted a series of comparative experiments for different Position Embedding methods, and the results are shown in the figure below. In the source code, 1D Pos. Emb. is used by default. Compared with not using Position Embedding, the accuracy rate has increased by about 3 points, which is not much different from 2D Pos. Emb. The author's explanation is that VIT operates on patch-level, not pixel-level. Specifically, on the patch-level, the spatial dimension is (224/16) x (224/16), which is much smaller than that of the pixel-level (224 x 224), and learning to represent the spatial position at this resolution, regardless of Whichever strategy you use, it's easy, so the results are about the same.
insert image description here

2.Transformer Encoder

The Transformer Encoder is actually stacking the Encoder Block L times as shown in the figure below, and is mainly composed of the following parts:

  • Layer Norm [2], this Normalization method is mainly proposed for the NLP field, here is to perform Norm processing on each token, which is similar to BN
  • Multi-Head Attention , this structure is exactly the same as that in the Transformer model, so I won't describe it here.
  • Dropout/DropPath [3] is the Dropout layer used directly in the code of the original paper, but the implemented code uses DropPath (stochastic depth), and the latter may be better.
  • MLP Block , which is fully connected + GELU activation function + Dropout, is also very simple. It should be noted that the first fully connected layer will quadruple the number of input nodes [197, 768] -> [197, 3072], the first The two fully connected layers will restore the original number of nodes [197, 3072] -> [197, 768]
    insert image description here

3.MLP Head

The output shape and the input shape after passing the Transformer Encoder above remain unchanged. Take ViT-B/16 as an example, whether the input is [197, 768] or [197, 768]. Note that after the Transformer Encoder, there is actually a Layer Norm that has not been drawn. Below is the ViT model I drew myself to see the detailed structure.

Here we only need classification information, so we only need to extract the corresponding result generated by [class] token, that is, [1, 768] corresponding to [class] token is extracted from [197, 768], because self-attention calculation The characteristics of global information, this [class] token has already integrated the information of other tokens. Then we get our final classification result through MLP Head. In the original paper of MLP Head, it is said that when training ImageNet21K, it is composed of Linear+tanh activation function+Linear. But when migrating to ImageNet1K or your own data, only one Linear is enough.
insert image description here
It is worth noting that the author also compared the effect of [class]token and GAP through some ablation experiments in the original text. The results prove that the classification accuracy achieved by GAP and [class]token is similar. Therefore, In order to imitate Transfomer as much as possible, the calculation method of [class] token is selected here. The specific experimental results are shown in the figure below. It is
insert image description here
worth noting that when choosing the calculation method of GAP, a smaller learning rate should be used, otherwise the final accuracy will be affected. One point worth summarizing here is: in deep learning, sometimes an operation does not work well, it may not necessarily be a problem of operation, but it may also be a problem of training strategy, that is, "alchemy skills".

4.model scaling

In Table 1 of the paper, there are three model (Base/Large/Huge) parameters of different sizes. In addition to the Patch Size of 16x16 in the source code, there are also 32x32. The Layers in the table below is the number of times the Encoder Block is repeatedly stacked in the Transformer Encoder. The Hidden Size corresponds to the dim (length of the vector) of each token after passing through the Embedding layer. The MLP size is the first fully connected node of the MLP Block in the Transformer Encoder. The number (four times the Hidden Size), Heads represents the number of heads of Multi-Head Attention in Transformer.
insert image description here

二、Hybrid Vision Transformer

In Model Variants in Chapter 4.1 of the paper, the Hybrid hybrid model is mentioned in more detail, which is to combine traditional CNN feature extraction with Transformer. The figure below draws a hybrid model using ResNet50 as a feature extractor, but the Resnet here is somewhat different from the Resnet mentioned earlier. First of all, the StdConv2d used in the convolutional layer of R50 here is not the traditional Conv2d, and then all BatchNorm layers are replaced with GroupNorm layers. In the original Resnet50 network, stage1 is stacked 3 times, stage2 is stacked 4 times, stage3 is stacked 6 times, stage4 is stacked 3 times, but in R50 here, the 3 Blocks in stage4 are moved to stage3, so stage3 Repeat stacking 9 times in total.

After feature extraction through R50 Backbone, the obtained feature matrix shape is [14, 14, 1024], and then input to the Patch Embedding layer. Note that the kernel_size and stride of the convolutional layer Conv2d in Patch Embedding both become 1, which is only used to Adjust the channel, and finally it will become the shape of [196, 768]. The latter part is exactly the same as what was mentioned in the previous ViT, so I won't repeat it here.
insert image description here

The experimental results are as follows. The horizontal axis represents the computational complexity of the model, that is, the model size; the vertical axis represents the classification accuracy. The left picture shows the comprehensive performance on the five data sets, and the right picture shows the performance on the ImageNet data set. The results show that when the model is small, hybrid-vit performs best, which is understandable, after all, hybrid-vit combines the advantages of the two algorithms. However, when the model is large, the pure vit model works best. I personally think that this shows to a certain extent that the knowledge learned by the vit model from the data set is more meaningful than the knowledge that people give to the cnn model based on prior knowledge.

insert image description here

3. Summary

The starting point of VIT's work is to prove that the model Transformer migrated from the NLP field can still handle image data well, especially with the support of big data. This shakes the dominance of the CNN model in the CV field for the first time. Therefore, many scholars want to figure out what makes Transformer work so well?

Since the self-attention mechanism was promoted in Transformer's thesis, for many years to come, people preconceived that self-attention played an important role in Transformer, but some recent work has proved that this is not the case. Someone replaced the multi head self-attention operation in Transformer Encoder in VIT with MLP, and the model can still achieve good performance. This work proves that self-attention is not a necessary operation in Transformer. It is worth noting that when the mutil head self-attention operation is replaced by MLP, the entire VIT model actually becomes a pure MLP model, and it develops back to the starting point algorithm of deep learning - the neural network algorithm, and even forms CNN, Transformer, MLP tripartite situation.

In addition, some radical scholars directly replace the self-attention in Transformer with a pooling layer without learnable parameters, and the resulting model can still have good performance. Therefore, they believe that the key to the success of Transformer is the overall model framework design. They call this framework MetaFormer. So what exactly makes Transmer so effective? There is still no uniform answer to this question in the academic world. Some people say that self-attention is all you need; some say that MLP is all you need; some say that patch is all you need, and some say that MetaFormer is all you need. However, I think that the premise of a good Transformer effect is the support of a large amount of training data. Training these data requires a large amount of computing resources, so the answer is Money is all you need! (Just kidding, don't take it seriously).

In CV, the deep learning algorithm initially developed from MLP to CNN, and then from CNN to Transformer, and now it seems to have returned to MLP. However, it should be noted that the current MLP model, such as mlp-mixer, is different from the original MLP It's different, it's improved. Historically, the development of technology has always been in a spiral. I am looking forward to whether CNN will regain its statistical status in CV after the impact of Transformer and MLP.

4. Citation

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[2] Ba JL, Kiros JR, Hinton GE. Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016.

[3] Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. Q. (2016, October). Deep networks with stochastic depth. In European conference on computer vision (pp. 646-661). Springer, Cham.

Guess you like

Origin blog.csdn.net/qq_39297053/article/details/126181162