ViT (Vision Transformer) paper notes

ViT (Vision Transformer) paper notes

(AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE)
原文代码:https://github.com/google-research/vision_transformer

Abstact

Although the Transformer architecture has become the de facto standard for natural language processing tasks, its application in computer vision is still limited. In vision, attention is either used in conjunction with convolutional networks, or it is used to replace some of the convolutional networks. These components (replacing part of the CNN model), while maintaining its overall structure. We show that this reliance on CNNs is unnecessary and that a pure Transformer applied directly to a sequence of image patches can perform image classification tasks well. When pre-trained on large amounts of data and fed to multiple small and medium-sized image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) can achieve superior performance compared to state-of-the-art convolutional networks. At the same time, the computing resources required for training are greatly reduced (the small resource requirements here are relative, it itself requires 2500 days of TPU v3).

1、Introduction

Self-attention-based architectures, especially Transformers (V aswani et al., 2017), have become the preferred models for natural language processing (NLP). The main approach is to pre-trainon large text corpora and then performFine-tuning (Devlin et al., 2019). Due to Transformers' computational efficiency and scalability, it has been able to train models of unprecedented scale with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). As the model and dataset grow,performance still shows no sign of saturation.

Transformer application in the CV fieldDifficulties:

​ 1) In the nlp field, the sequence size of Transformer is generally several hundred, and the size of Bert is 512, due to the calculation characteristics of Transformer , the time complexity of the model is O(n^2), which is already a relatively complex model. Used in the field of CV, the sample set generally used for training is 224x224. After converting it into 1d data and passing it to the Transformer, its sequence size becomes 224x224, which is 100 times that of Bert, and the calculation amount is very large.

​ 2) The amount of calculation will further increase when applied to other tasks of CV. For example, when the input image in video classification is 800x800, the task amount will further increase.

However, in computer vision, convolutional structures still dominate (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by the success of NLP, many works try to combine CNN-like architecture with self-concern (Wang et al., 2018; Carion et al., 2020 year), some workscompletely replaced convolution ( axis attention mechanism) (Ramachandran et al., 2019; Wang et al., 2020a). The latter model, while theoretically valid, has not yet scaled efficiently on modernhardware accelerated processors due to the use of specialized attention patterns. Therefore, in large-scale image recognition, the classic ResNet-like architecture is still the state-of-the-art (Mahajan et al., 2018; Xie et al. , 2020; Kolesnikov et al., 2020).

Inspired by the success of Transformer scaling in NLP, we try to apply the standard Transformer directly to the image with as few modifications as possible. To this end, we divide an image into multiple patches, and use the linear embeddings sequence of these patches as the Transformer's Enter. Image patches are treated in the same way as tokens (words) in NLP applications. We trained the model for image classification in a supervised manner. (Crop the image into a 16x16 patch and pass it in, so that the previous sequence of 224x224-sized images becomes a sequence of 14x14=196) a>

image-20220404093127826

When trained on moderately sized datasets such as ImageNet, without strong regularization, these models produce accuracy that is several percentage points lower than a similarly sized RESNET. This seemingly frustrating result may be expected: Transformer lacks some of the inductive biases inherent in CNN (specifically a kind of prior knowledge). Or it is an assumption made in advance. There are two assumptions in CNN. The first one is Locality refers to the relative Adjacent areas will have adjacent features, and the second one istranslation equians (translation invariance), the formula The above is f(g(x)) = g(f(x)) ). Examples include equivariance and locality in translation and therefore do not generalize well in situations where the amount of data is insufficient.

However, the situation changes if the model is trained on a larger dataset (14M-300M images). We found that large-scale training outperforms inductive bias. Our Visual Transformer (ViT) achieves superior results when pre-trained at sufficient scale and transferred to tasks with fewer data points. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or exceeds the state-of-the-art on multiple image recognition benchmarks. In particular, the best model achieves an accuracy of 88.55% on ImageNet, 90.72% on ImageNet ReaL, 94.55% on CIFAR-100, and 94.55% accuracy on VTAB-suite19 The accuracy on the task reached 77.63%.

2、Related Work

Transformers were proposed by V aswani et al. (2017) for machine translation and have become the state-of-the-art method in many NLP tasks. Large Transformers-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses denosing self-supervised pre-training tasks, while the GPT line of work uses language modeling as its pre-training task. training tasks (Radford et al., 2018; 2019; Brown et al., 2020). (BERT predicts in a cloze-like manner, while GPT predicts the next word in a sentence, or next word prediction. Both methods are self-supervised < /span>)

Simply applying self-attention to an image requires each pixel to pay attention to other pixels. Since the number of pixels is a quadratic cost, it cannot be scaled to the actual input size. Therefore, to apply transformers in image processing, several approximation methods have been tried in the past. Parmar et al. (2018) only apply self-attention in local neighborhoods of each query pixel, rather than globally. This local multi-head point product self-attention block can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In another line of work, Sparse Transformer (Child et al., 2019) employs a scalable global self-attention approximation so that it can be adapted to images. Another way to measure attention is to apply it to blocks of different sizes (Weissenborn et al., 2019), and in the extreme case only along the respective axes (axis attention, x then Y) (Ho et al. et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures have demonstrated promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.

The most relevant to us is the model of Cordonnier et al. (2020), which extracts from input images of size 2×2 patches, and apply full self-focus on top. The model is very similar to ViT, but our work further demonstrates that large-scale pre-training enables vanilla transformers to compete with (or even outperform) state-of-the-art CNNs. Furthermore, Cordonnier et al. (2020) used a small patch size of 2 × 2 pixels, which makes the model only suitable for small resolution images, And we also processmedium resolutionimages.

There is also interest in combiningconvolutional neural networks (CNN) with forms of self-attention, for example by enhancing image classification feature maps (Bello et al., 2019), or by using self-attention to further process the output of the CNN, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020) or unified text vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).

Another recent related model is Image GPT (iGPT) (Chen et al., 2020a), which applies transformers to image pixels after reducing the image resolution and color space. The model is trained as a generative model in an unsupervised manner, and the resulting representation can then be fine-tuned or linearly probed to improve classification performance, achieving a maximum accuracy of **72%** on ImageNet.

Our work adds to the number of papers exploring image recognition on a larger scale than the standard ImageNet dataset. State-of-the-art results on standard benchmarks can be achieved using additional data sources (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020). In addition, Sun et al. (2017) and Kolesnikov et al. (2020) studied how the performance of CNN changes with dataset size; Djolonga et al. (2020) studied from large-scale datasets such as ImageNet-21k and JFT-300M Conducted an empirical exploration of CNN transfer learning. We also focus on the latter two datasets, but train transformers instead of the ResNet-based models used in previous work.

3、Method

In the model design, we followed the original transformer as much as possible (V aswani et al. 2017). One advantage of this intentionally simple setup is that it is possible to use the scalable NLP Transformer architecture and itsalmost out of the box >. Efficient implementation

image-20220405101444836

Note:A lot of content here draws on the Bert model. For example, the position information Position Embedding is added to record the relative position of the picture, using something like (Extra learnable embedding–cls) [class] The mark used here is [] with position information of 0. The function of this is that there will be multiple outputs through the transformer encoder. Use [*** *] for final classification (this article believes that this mark can learn useful information from other embeddings). MLP Head is a simple classification head. The Transformer structure has not changed, only encode, no decode

Rough parameter derivation:image-20220405105618415

image-20220405105722489

For image preprocessing steps and dimension calculations, please refer to Bilibili video 34 minutes:

https://www.bilibili.com/video/BV15P4y137jb?spm_id_from=333.1007.top_right_bar_window_history.content.click

There are also several ablation experiments in the appendix, which specifically talks about the output of the transformer encoder Using the cls tag and directly connecting to a Global Average without using it The effect of Pooling is the same, and it also introduces the changes in their position information relative to the original transformer. Using 2D tags to record is more suitable for images. The dimension used in 1D is d. After changing to 2D Use x, y matrices to represent x and y respectively. Side d), it can also be seen from the experiment that using 1D and 2D tags has little impact, but the author tries to keep it consistent with the original Transformer as much as possible The same structure was changed to adapt to the image.

3.1、Vision Transformer(ViT)

In order to clarify the calculation process, the original text is used here:

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-ZrMPmpNo-1649226839696) (D:/Typora_img/image-20220405110001879.png)]

Similar to BERT’s [class] tag, we prepare a learnable embedding in advance for the embedding patch sequence (z00=xclass), whose state at the output of the transformer encoder (z0L) is used as the image representation y (Eq. 4). During pre-training and fine-tuning, a classification head is connected to z0L. The classification head is implemented by an MLP with a hidden layer during pre-training and a linear layer during fine-tuning.

Position embeddings will be added to patch embeddings to preserve position information. We use standard learnable 1D Position embeddings because we do not observe significant performance gains from using the more advanced 2D-aware position embeddings ( Appendix D.4). The generated sequence of embedding vectors is used as input to the encoder.

The Transformer encoder (Vaswani et al., 2017) consists of alternating multi-head self-attention (MSA, see Appendix A) and MLP blocks (Eqs. 2, 3) layers. A hierarchical model (LN) is applied before each block and residual connections are applied after each block (Wang et al., 2019; Baevski & Auli, 2019).

The formula of the entire process is expressed as follows:

image-20220405110611524

Inductive bias:We note that the Vision Transformer has less image-specific inductive bias than CNN. In CNN, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer of the entire model. In ViT, only the MLP layer is local and translation equivariant, while the self-attention layer is global. The 2D neighborhood structure is used very sparingly: at the beginning of the model, by cutting the image into small pieces and adjusting the positional embedding of the different resolution images at fine-tuning time (described below). In addition to this, the initial position embeddinghas no information about the two-dimensional positions of patches, and all spatial relationships between patches must be learned from scratch

Hybrid Architecture: As an alternative to the original image patches, the input sequence can be formed from the feature map of the CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from CNN feature maps. As a special case, patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the transformed dimension. As mentioned above, categorical input embeddings and positional embeddings are added. (No preprocessing, but using CNN to reduce the image into a small image of 14x14 size)

3.2. Fine-Tuning and Higher Resolution (bigger picture)

Typically, we pre-train ViT on large datasets and fine-tune it on (smaller) downstream tasks. To do this, we remove the pre-trained prediction head and attach a D×K feedforward layer initialized to zero, where K is the number of downstream classes. It is often beneficial to fine-tune at a higher resolution than before training (Touvron et al., 2019; Kolesnikov et al., 2020). When fed higher resolution images, we keep the patch size constant, which results inlarger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory limits), however, pre-trained positional embeddings may no longer make sense. Therefore, we perform 2D interpolation on the pre-trained position embeddings based on their locations in the original image. Please note that thisresolution adjustment and patch extraction manually injects the induced bias of the 2D structure of the image into the vision converterUnique point.

4、Experiments

We evaluated the representation learning capabilities of ResNet, Vision Transformer (ViT) and hybrid. To understand the data requirements of each model, we pre-train on datasets of different sizes and evaluate on a number of benchmark tasks. When considering the computational cost of the pre-trained model, ViT performs very well (the computational cost is relatively small), with low pre-training cost in most cases The recognition benchmark has reached the state-of-the-art level. Finally, we conducted a small experiment using self-supervision and showed that self-supervised ViT is promising for the future.

4.1、Setup

Datasets: To explore the scalability of the model, we use the ILSVRC-2012 ImageNet dataset, which contains 1k classes and 1.3 million images (we will It is called ImageNet), its superset ImageNet-21k contains 21k categories and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) contains 18k categories and 303M high-resolution images. We deduplicated the pre-training dataset and the test set of the downstream tasks following Kolesnikov et al. (2020). We transfer models trained on these datasets to several benchmark tasks: ImageNet on raw validation labels and cleaned ground truth labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009) , Oxford IIIT Pets (Parkhi et al., 2012) and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, preprocessing follows Kolesnikov et al. (2020).

The detailed network parameters are as follows:

image-20220406092015738

We also evaluated the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates low data transfer on different tasks, using 1000 training examples per task. These tasks are divided into three groups: Natural tasks - such as the above tasks, Pets, CIFAR, etc. Specialized tasks - medical and satellite imagery, and structured tasks - tasks requiring geometric understanding such as localization.

**Model Variants:** We based the ViT configuration on the one used for BERT (Devlin et al., 2019), as shown in Table 1. The "Basic" and "Large" models adopt the BERT model directly, and we added the larger "Large" model. In what follows, we use short notations to represent model size and input patch size: for example, ViT-L/16 represents the “large” variant with a 16×16 input patch size. Please note that the sequence length of Transformer is inversely proportional to the square of the patch size, so the smaller the patch size, the higher the computational cost of the model.

For the baseline CNN, we use ResNet (He et al., 2016) but replace the batch normalization layer (Ioffe & Szegedy, 2015) with a group normalization layer (Wu & He, 2018) and use normalized convolutions (Qiao et al., 2019). These modifications improve transmission (Kolesnikov et al., 2020), and we call the modified model “ResNet (BiT)”. For hybrids, we provide the intermediate feature map to ViT as a "pixel" patch size. To experiment with different sequence lengths, we either (i) take the stage 4 output of regular ResNet50 or (ii) remove stage 4, put the same number of layers in stage 3 (keeping the total number of layers), and then Get the expanded stage 3 output. Option (ii) results in a 4x longer sequence length and the ViT model is more expensive.

Training & Fine-tuning: We use Adam (Kingma & Ba, 2015) with β1=0.9, β2=0.999, and batch size 4096 for all models (including Resnet) trained, and applied a high weight decay of 0.1, which we found to be very useful for transfer across all models (Appendix D.1 shows that Adam performed slightly better than SGD for Resnet in our setting compared to conventional practice). We use linear learning rate warm-up and decay, see Appendix B.1 for details. For fine-tuning we use SGD with momentum with a batch size of 512, see Appendix B.1.1 for all models. For the ImageNet results in Table 2, we fine-tuned at higher resolutions: 512 for ViT-L/16 and 518 for ViT-H/14, also using the average of Polyak & Juditsky (1992) with coefficients 0.9999 (Ramachandran et al., 2019; Wang et al., 2020b).

**Metrics:** We report results on downstream datasets via few-shot or fine-tuning accuracy. Fine-tuning captures the performance of each model after fine-tuning on their respective datasets. By solving a regularized least squares regression problem, the (frozen) representation of a subset of training images is mapped to {−1,1}K target vectors. This formula allows us to recover the exact solution in closed form. While we focus primarily on fine-tuning performance, we sometimes use linear few-shot accuracy for fast dynamic evaluation because fine-tuning is too expensive.

4.2 COMPARISON TO STATE OF THE ART(对比)

image-20220406093718937

It can be seen that the effect of VIT-H is the best in every experiment (one is not the best, but it is almost), using a larger and more expensive model structure. Although it is better than the previous sota model, there is no obvious improvement. Therefore, the author analyzes the calculation time. Compared with the previous model, the training time of ViT is greatly reduced (the last row in the table can only be said to be relatively small and wealthy. )

How much data should be used to use ViT?

image-20220406094404505

The area marked by the gray squares refers to the training effect range of BiT, and the other colored dots refer to the effects of ViT models of different sizes.

It can be seen from the figure that in a small amount of data set ImageNet, the overall effect of ViT is not as good as BiT. As the amount of data increases, the effect of ViT gradually surpasses BiT, so it can be concluded that, < a i=1>The ViT model works well on large data sets, at least on data such as ImageNet-21K. If it is a small data set, it is better to use a traditional convolutional neural network

Comparison of small sample results

image-20220406095138097

Same as the picture in the previous section, ResNet performs better when the sample size is small, and the effect of ViT becomes better as the data size becomes relatively larger. In the last case, the effect of ViT is slightly better than Res152, the author proposed The application of small sample learning on ViT is a future research direction (another pit has been dug)

Why is ViT cheaper than pre-training of convolutional neural networks?

image-20220406095719843

The pre-trained data sets are all performed on the JFT data set.

test:

​ Average-5: Do evaluation on five data sets and average five data sets: imageNet real, pets, flowers, cifar 10, cifar100

​ The ImageNet of another picture is more important and should be made separately.

1) If you compare ViT and Bit with different calculation time complexities, ViT has a higher accuracy under the same time complexity.

2) In the case of less time complexity, the hybrid model (Hybrid) has the best effect, which shows that the hybrid model absorbs the respective advantages of ViT and ResNet. However, as the time complexity increases, the effect of the hybrid model approaches the result of the ViT model, or even inferior to the effect of ViT. Therefore,data pre-training has a great impact on the performance of ViT >.

3) The effects of BiT and ViT have not yet reached their peak, and the growth trend is still linear.

4.5. Inspecting Vision Transformer (visualization)

To begin to understand how the visual transformer processes image data, we analyze its internal representation. The first layer of the visual transformer linearly projects flat patches into a lower dimensional space (Equation 1). Figure 7 (left) shows the top main components of the learned embedding filter. These components resemble a reasonablelow-dimensional representationof the fine structure within each patch(lines, colors, patches, etc.) =3>Basic function.

patch embedding:

image-20220406101846483

position embedding

1) Some position information has been learned. For example, in the small picture of the center point, you can see that the points closer to the center have higher activation (closer to 1), and the points farther away from the center point have lower activation (closer to 0).

2) After learning the row and column information, the relative activation of the same row and peer corresponding to the position of each picture is higher (1D encoding has learned the 2D position information)

image-20220406102157439

Mean attention distance (pixels) of different heads in different network depths:

It can be seen that as the depth of the network increases, the distance between bulls also increases. At the beginning, some heads were very close to each other, and some were very far away, which shows that attention can pay attention to global information from the beginning. The pixels between the heads of subsequent layers have become very large, which shows that the network no longer judges based on nearby pixels and has learned high-level semantic information (learned the semantics of The concept of)

image-20220406103006487

In order to verify the above point of view, the author mapped the output token to the original image, as shown below:

image-20220406103521618

We found that some heads focus on most of the images already at the bottom, which shows that the model does use the ability to integrate information globally. The attention distance of other attention heads at lower levels has always been very small. This highly localized attention is less obvious in the hybrid model in which ResNet is applied before the Transformer (Figure 7, right), which suggests that it may be similar to that in CNN The early convolutional layers have similar functions. Furthermore, the attention distance increases with network depth. Globally, we find that the model focuses on image regions related to classification semantics (Figure 6) (head of The attention distance between them becomes larger, which means that the network no longer uses the relationship between adjacent pixels to judge and classify, but becomes the semantic information of higher-level features)

image-20220406103705341

4.6. Self-supervision (self-supervision)

Transformers perform well in NLP tasks. However, their success stems not only from excellent scalability but also fromlarge-scale self-supervised pre-training. We also simulated the masked language modeling task used in BERT and conducted a preliminary exploration of masked patch prediction for self-supervision. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% compared to training from scratch, butStill 4% behind supervised pre-training. Appendix B.1.2 contains more details. We leave the exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Hénaff et al., 2020) to the future. Work. (Comparative learning, transfer learning, horizontal transfer to other fields and other big pitfalls)

5、Conclusion

We explore the direct application of transformers in image recognition. Unlike previous work using self-attention in computer vision, we do not introduce image-specific sensing biases into the architecture beyond the initial patch extraction step. Instead, we interpret the image as a series of patches and process them using the standard Transformer encoder used in NLP. This simple yet scalable strategy is surprisingly effective when combined with pre-training on large datasets. As a result, Vision Transformer meets or exceeds the state-of-the-art on many image classification datasets while having relatively low pre-training cost. (Which sums up our results well)

While these preliminary results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks such as detection and segmentation. Our results, together with those of Carion et al. (2020), show the promise of this approach. Another challenge is the continued exploration of self-supervised pre-training methods. Our preliminary experiments show that self-supervised pre-training has improved, but there is still a large gap between self-supervised pre-training and large-scale supervised pre-training. Finally, further extending ViT may improve performance. (Digging holes shows that ViT can also do a lot of things in other fields)

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/123989506