Vision Transformer Study Notes

foreword

本博客记录我学习vision Transformer这篇文章的过程,旨在记录整合各大网页资料,方便日后查看。


Table of contents

foreword

1. Learning link

1. The main learning address of this article 

2. VIT's worthwhile learning links

3. Learning links of relevant knowledge points

2. Thesis study notes

1. About VIT

2. Title

3. Abstract

4. Introduction

5. Conclusion

6. Related Work

7. Method

7.1 Vision Transformer (VIT)

7.2 Fine-tuning and higher resolution (fine-tuning and higher resolution)

8. Experiments

8.1 Setup 

8.1.1 Datasets

8.1.2 Model Variants (models and variants)

8.2 Comparison to state of the ART (comparison with the latest technology)

8.3 Pre-trainign Data Requirements (pre-training data requirements)

8.4 Scaling Study

8.5 Inspecting Vision Transformer(VISION TRANSFORMER检验)

8.6 Self-Supervision (self-supervision)

3. Summary of Vision Transformer


1. Learning link

1. The main learning address of this article 

  Intensive reading of VIT papers (Li Mu): Intensive reading of ViT papers paragraph by paragraph【Intensive reading of papers】

  VIT paper link: https://arxiv.org/pdf/2010.11929.pdf

  VIT source link: https://github.com/rwightman/pytorch-image-models

  VIT paper translation: https://blog.csdn.net/jjw_zyfx/article/details/125036387

2. VIT's worthwhile learning links

[ViT Model] How does the arrogant Transformer "wave"! : https://www.bilibili.com/video/BV13B4y1x7jQ 

【How much do you know】What is ViT (Vision Transformer)? : https://www.bilibili.com/video/BV18u411m7PY

  ViT (Vision Transformer) analysis: https://zhuanlan.zhihu.com/p/445122996

3. Learning links of relevant knowledge points

  Inductive Bias: https://blog.csdn.net/qq_39478403/article/details/121107057


2. Thesis study notes

1. About VIT

Vision Transformer: Opening a New Era in CV Field

  • Challenged the absolute dominance of CNN in the CV field proposed by Alexnet in 2012
  • With enough pre-training data, moving the Transformer of NLP to CV can also achieve good results
  • VIT not only breaks the barriers of CV and NLP, but also has great potential in the field of multimodality

The effect of Vision Transformer: paperwithcode website (integrating the ranking of the best performing methods in a certain field or a certain data set)

  • Based on the ImageNet dataset on the paperwithcode website, the top rankings are all based on Vision Transformer
  • Based on the CoCo dataset (target detection) on the paperwithcode website: the top rankings are all based on Swin Transformer (ICCV 21 best paper: multi-scale ViT)
  • It can be applied in visual fields such as semantic segmentation, instance segmentation, video, medical treatment, and remote sensing

Examples of Convolutional Neural Networks that perform poorly on CNN but perform well on VIT

  • Such as: occlusion, offset of data distribution (removal of texture), bird head + confrontation patch, pictures scattered and rearranged

2. Title

The author is from Google research and Google brain team 

An image  is worth 16*16 words:Transformers for image recognition at scale

Translation: A picture is equivalent to a lot of 16*16 words: Transformer for large-scale image recognition

3. Abstract

Paper translation:

  • Transformer is a basic operation in NLP, but the application of transformer in CV is limited. In the field of vision, attention is either used in combination with convolutional networks, or replaces some parts of convolutional networks while keeping the overall structure unchanged .
  • This paper shows that the reliance on CNNs is not necessary and that pure transformers applied directly to sequences of image patches can also work well in image classification tasks.
  • When pre-training large amounts of data and transferring it to multiple medium- or small-scale image recognition benchmarks, the Vision Transformer (ViT) achieves superior results compared to SOTA convolutional networks while requiring fewer computational resources for training .

Q1: How is the attention in CV used?

  • attention + CNN, or attention replaces CNN components but still maintains the overall structure of CNN.

Q2: How to understand that the overall structure of CNN remains unchanged?

  • For example, ResNet 50 has 4 stages (res2 res3 res4 res5), the stage remains unchanged, and attention replaces this operation in each block of each stage without changing the overall structure.

Q3: Does VIT require less computing resources for training?

  • No, less training resources == TPUv 3 + 2500 days, "fewer" is relative.

4. Introduction

Paper translation:

  • Based on the self-attention architecture, especially Transformers, is already a must-have model for NLP. The mainstream method is proposed by BERT, pre-training on large-scale data sets, and fine-tuning on small data sets in specific fields. Due to the better computational efficiency and scalability of Transformer, it is now possible to train models with more than 100 billion parameters. There has been no performance saturation as models and datasets evolve .
  • In the field of computer vision, convolutional structures still dominate. Inspired by the success of NLP, multiple works have attempted to combine CNN-like structures with attention mechanisms (Wang et al., 2018; Carion et al., 2020), and some even replaced convolutions (Ramachandran et al., 2019 ; Wang et al., 2020a). This direct replacement of convolutions, while theoretically possible, has not yet scaled efficiently on modern hardware accelerators due to the use of specialized attention patterns. In large-scale image recognition, classic ResNet-like architectures are still the best.
  • Inspired by the success of large-scale applications of Transformers in NLP, we try to apply standard Transformers directly to images with as little modification as possible . To do this, we split an image into multiple image blocks , and weave these blocks into a linear sequence as input to the Transformer. Image blocks are made in the same way as tokens in NLP. We train a model on image classification in a supervised manner .
  • When trained without strong regularization on moderately sized datasets such as ImageNet, these models yield modest accuracies a few percentage points lower than comparable sized ResNets . This seemingly dismal result may be expected: Transformers lack some of the inductive biases inherent in CNNs , such as translation invariance and locality, and thus, when trained with insufficient amounts of data, Transformers do not It has good generalization ability.
  • However, if it is trained on a larger dataset (dataset of 14M-300M pictures). We find that large-scale training outperforms inductive bias in CNNs. Our Vision Transformer (ViT) achieves very good results when pretrained on a sufficiently large dataset and then transferred to tasks with few data points. When pretrained on the publicly available ImageNet-21k dataset or the company's internal JFT-300M dataset, ViT approaches or outperforms the state-of-the-art benchmark models in multi-image recognition. In particular, our best model achieves 88.55% accuracy on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on VTAB of 19 tasks.

Q1: What is the phenomenon of performance saturation?

  • In many cases, better results cannot be obtained simply by expanding the data set or model, especially when expanding the model, it is easy to have the problem of overfitting.

Q2: Is there any difficulty in applying Transformer to CV?

  • Yes, calculate the self-attention of pixels, the sequence is long, and the dimension explodes. The computational complexity of Transformer is the square of the sequence length n O(n^2). Now the sequence length that the hardware can support is generally hundreds or thousands. BERT sequence is 512 long. In the field of computer vision, we have to convert a 2D image into a 1D sequence. The most intuitive way is to treat each pixel as a sequence element. For classification tasks, generally speaking, a 224 * 224 resolution picture has 50176 pixels, and the sequence length is nearly 100 times that of BERT. For detection or segmentation tasks, many model inputs have become 600*600 or larger, with higher complexity.

Q3: How is the structure of CNN generally combined with the attention mechanism?

  • Since the sequence length is too long, then we find a way to reduce the sequence length. For example, if the input sequence of pixels in the original image is too long, then the feature map in the middle of the CNN network can be used as input to enter the Transformer, such as the example given in the paper CVPR 18 Local Network. For a stage, the feature map of res4 is 14*14, and the length of the sequence after stretching is only 196.
  • In the example given in the article, the ways to reduce the length of the sequence are: using the feature map as the transformer input (Wang et al 2018) and the self-attention mechanism completely replacing the convolution (Ramachandran et al., 2019 stand-alone attention isolated self-attention; Wang et al., 2020 axial attention axis attention). Stand-alone attention isolated self-attention: use the local window local small window to control the computational complexity of the transformer, a bit like convolution, convolution also has locality, local window convolution. Axial attention Axial attention: 2 1D sequential operations to reduce computational complexity, because the sequence length of the picture n = H * W is too long, split the 2D matrix into 2 1D vectors, first do a self- attention, and then do a self-attention in the W width dimension.

Q4: How to apply a standard Transformer directly to an image with as little modification as possible?

  • Divide the picture into many patches. The size of each patch element is 16*16. For example, the input picture size is 224x224, and the picture is divided into fixed-size patches. The patch size is 16x16, and each image will generate 224x224/16x16=196 pieces patch, that is, the length of the input sequence is 196.

Q5: Why is VIT less accurate than CNN? What is inductive bias?

  • Transformer has less inductive biases (inductive bias) than CNN, and inductive biases inductive bias refers to: prior knowledge or advance assumptions. The inductive biases that CNN often has are locality (locality) and translation equiivariance (translation invariance).
  • Locality: CNN uses a sliding window to perform convolution on the picture, assuming that adjacent areas of the picture have similar features. For example, the probability that a table and a chair are together is higher, and the closer items are more correlated.
  • translation equivariance: formula:  f(g(x)) = g( f(x) ), f: convolution, g: translation, no matter whether you do translation g first or convolution f first, the final result is the same, because the convolution kernel of CNN is like a template template, no matter where the same object moves, Encountered the same convolution kernel, its output is consistent.
  • With the inductive bias of locality and translation equivariance, CNN has a lot of prior information, and only needs less data to learn a model well. The Transformer does not have such prior information, and can only learn the perception of the visual world from the image data.

Summary of the introduction:

  1. The first paragraph: Transformer expands very well in NLP. It is not saturated due to large models and large data sets. The performance has been improving. Can Transformer also be greatly improved in CV?
  2. The second paragraph: previous work. Who has done such a good idea? Explain clearly the difference between your work and related works. The previous work was CNN+attention or attention instead of CNN, but there was no work that directly used the transformer in the CV field, and did not get a good expansion effect.
  3. The third paragraph: Vision Transformer uses the standard Transformer model, which is to divide the picture into picture blocks, and then input it into the Transformer. The CV problem is understood as an NLP problem, and ViT integrates the fields of CV and NLP.
  4. The fourth + fifth paragraphs: Show the results. With enough data set training, ViT can achieve good results.

5. Conclusion

Paper translation:

  • We have explored the direct application of Transformers to image recognition . Unlike previous use of self-attention in computer vision, we do not introduce image-specific inductive biases into our architecture except for the initial patch extraction step. Instead, we decompose the image into a sequence of image blocks and process it through a standard Transformer encoder just like in NLP. This simple but scalable strategy works surprisingly well when combined with pre-training on large datasets. Moreover, Vision Transformer achieves settings beyond the best models on many image classification datasets, while the pre-training overhead is relatively small.
  • Although these initial results are very encouraging, many challenges remain. One of them is to apply ViT to other vision tasks, like detection tasks and segmentation tasks . Our results and those of Carion et al. (2020) show that this approach is promising. Another challenge is to continue to explore self-attention pre-training methods. Our preliminary experiments show that self-attention pre-training is improved, but there is still a large gap between self-supervised and large-scale supervised pre-training . Finally, further extensions of ViT may improve performance.

Q1: Can Transformer do CV?

  • The problem that arises after the release of VIT——In addition to classification on CV, can Vision Transformer perform two other mainstream vision tasks, image segmentation and target detection? DETR (Carion et al. 2020) The masterpiece of target detection has changed the way the target detects the frame, indicating that VIT should also work well for other CV tasks.
  • In fact, one and a half months after VIT came out (December 2020), the detection aspect launched: ViT-FRCNN , which uses VIT for detection; the segmentation aspect launched SETR , which uses VIT for segmentation. Three months later, Swin Transformer , which combines Transformer and multi-scale design, was born. It proves that Transformer can be used as a general backbone network backbone in the CV field.
  • Another future direction of work is self-supervised pre-training. The large transformer models of NLP all use self-supervised pre-training. ViT has initial experiments to prove that self-supervised pre-training is also possible, but there is a gap with supervised training. It is possible to achieve better results by making the ViT larger.
  • The VIT type lays the foundation for the CV field and stimulates subsequent exploration. Such as the further development of CV in the visual field, and multi-modal tasks, can a transformer handle CV and NLP.

6. Related Work

Paper translation:

  • Transformers is a machine translation method proposed by Vaswani et al. in 2017, and has since become the most advanced method in many NLP tasks. Large Transformer-based models are mostly pre-trained on large corpora and then fine-tuned for specific tasks: BERT (Devlin et al., 2019) uses self-supervised pre-training tasks for denoising, while the GPT family uses language models as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).
  • Self-attention applied to images requires each pixel to attend to all other pixels. Due to the pixel- square penalty, this cannot be the actual input size. Therefore, applying Transformers in the field of image processing, some methods have been tried in the past: using self-attention mechanism (self-attention) in the local neighborhood of each pixel, not in the global scope. Such local multi-head dot-product self-attention modules can completely replace convolutional networks. Or like Sparse Transformers employing scalable approximations of global self-attention applied to images. An alternative method is used to calculate blocks of different sizes. Many specialized attention mechanisms have shown excellent results in computer vision tasks, but require complex engineering to be used on hardware accelerators.
  • The closest model to ours is that of Cordonnier et al. (2020) , whose model extracts 2x2 patches from the input image and uses full attention at the top layer. This model is very similar to ViT, but our work further demonstrates that large-scale pre-training enables transformers to be compared with state-of-the-art CNNs. Also, Cordonnier et al. (2020) use small 2×2 pixel blocks, which makes the model only suitable for small-resolution images, while our model can also handle medium-resolution images.
  • There is a lot of interest in combining convolutional neural networks with self-attention . For example, image classification by enhancing feature maps (Bello et al., 2019) or using self-attention on the output of CNN for further processing, such as in object detection, video processing, image classification, based on unsupervised object discovery, Unified text vision tasks.
  • Another similar model is Image GPT (iGPT) , which applies Transformers to image pixels after reducing image resolution and color space. As a generative model, iGPT is trained in an unsupervised manner , and the classification performance is improved by fine-tuning or linear detection, reaching a maximum accuracy of 72% on ImageNet.
  • Our work explores image recognition on datasets larger than the standard ImageNet dataset . The best results are achieved on standard benchmarks using additional data sources. Moreover, Sun et al. (2017) studied how the performance of CNNs changes as the dataset size changes. Kolesnikovet al. (2020); Djolonga et al. (2020) present an experimental exploration of CNN transfer learning from large-scale datasets . Like datasets ImageNet-21k and JFT-300M. We equally focus on these two datasets, but build on previous work using Transformers instead of ResNet-based models for training.

Related work summary:

  1. The first paragraph: Application of Transformer in the field of NLP: BERT, GPT
  2. The second paragraph: the application of self-attention in the field of vision, in order to solve the "dimension explosion problem", some methods proposed in the past are listed.
  3. The third paragraph: introduces the model of Cordonnier et al. (2020), which is very similar to VIT, but the training scale is not large enough.
  4. Fourth paragraph: point out that there is still a lot of to combine convolutional neural networks and self-attention, covering detection, classification, video processing, multi-modality, etc.
  5. Fifth paragraph: Another similar work: image GPT, GPT is a generative model of NLP, unsupervised pre-training, ImageNet accuracy rate is 72%.
  6. Paragraph 6: VIT focuses on the two data sets of ImageNet-21k and JFT-300M, using Transformers.

7. Method

In terms of model design, we follow the original Transformer design as much as possible. One advantage of this intention to simplify the setup is to make NLP Transformer architectures scalable, and their efficient implementation works almost out of the box.

 Figure 1: Model overview, we divide the picture into fixed-size blocks and encode them linearly, add them to the positional encoding, and then load the results with vector sequences into the standard Transformer encoder. In order to perform classification, we Adds an additional learnable "class label" method on sequences using the standard one. The diagram of the Transformer encoder is inspired by Vaswani et al. (2017).

7.1 Vision Transformer (VIT)

Paper translation:

  • Figure 1 is an overview of the model. The input of the standard Transformer is a 1D token encoding sequence. In order to process 2D images, we change the shape of the image x\in \mathbb R^{H*W*C}into a sequence of planar 2D image blocks x_p \in \mathbb R^{N*(P^2*C)}, where (H, W) is the resolution of the original image, C is the number of channels, (P, P) the number of each pixel block Resolution N = HW/P^2is the number of image blocks, and it is also used as the effective input sequence length of Transformer. Transformer uses a fixed vector size D in all its layers. Hence i, we flatten the patch of pixels and map it into D dimension using a trainable linear projection (Eq. 1). We take the output of this projection as an encoding of pixel blocks.
  • Similar to BERT's [class]token, we prepend a learnable encoding ( ) on the sequence of encoded image blocks , which is used as the image y (Eq. 4) z^0_0 = x_{class}in the state of the Transformer encoder output ( ). z^0_LAssigned as a classification head during pre-training and fine-tuning z^0_L, the classification head is implemented by an MLP with one hidden layer during pre-training and a single linear layer during fine-tuning.
  • Position coding is added to image block coding to preserve position information. We use standard learnable 1D positional encodings because we do not see significant performance gains from advanced 2D positional encodings (Appendix D.4). The resulting sequence of embedding vectors serves as the input to the encoder.
  • The Transformer encoder is composed of alternating multi-head self-attention and MLP blocks . Layer normalization (LN) is applied before each block and residual connections are used after each block. The MLP consists of two layers with nonlinear GELU.

  • Inductive bias: We note that Vision Transformer (ViT) has a smaller inductive bias for images than CNNs. In CNNs, locality, 2D neighborhood structure, and translation invariance pervade every layer throughout the network. In ViT, only the MLP layer is locally shared and translation invariant while the self-attention layer is global . The two-dimensional neighborhood structure is rarely used: at the beginning of the model, the image is cropped into image blocks. Adjust the position embedding (described below) for images with different pixels during fine-tuning. Apart from these, positional embeddings are initialized without carrying the 2D location information of image patches, and all spatial relationships between image patches have to be learned from scratch.
  • Hybrid architectures: As an alternative to raw image patches, an input sequence can be formed from the feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Equation 1) is applied to the patch extracted from the CNN feature map. As a special case, the spatial size of the image patch can be 1x1, which means that the input sequence is obtained by simply flattening the spatial dimension of the feature map and projecting to the Transformer dimension. Categorical input embeddings and positional embeddings are added as described above.

Q1: The overall process of VIT:

  • The overall process: Given a picture, make the picture into patches (in Figure 1, it is into a nine-square grid), and then turn the patch into a sequence. Each patch will get a feature (Patch embedding) through a linear projection layer. , after adding the positional encoding, the tokens are passed into the Transformer encoder, and a lot of output is obtained. Each token has an output, which output is used for classification? Referring to BERT, it has an extra learnable embedding (special character cls). Similarly, we also added a special character to the token (replaced by * in the figure). It is also a position embedding, and its position information is always 0. Because all tokens are interacting with all tokens. Therefore, our marked CLS can learn useful information from other embeddings, so we only need to make a final classification judgment based on his output. The following MLP Head is actually a general classification head, and finally the cross-entropy function is used to train the model.
  • It is recommended to watch the following video (recorded from: [ViT Model] How does the arrogant Transformer "wave"!_哔哩哔哩_bilibili )

The overall process of VIT

Q2: VIT forward process (Vision problem becomes NLP problem):

  1. Patch embedding: For example, if the input image size is 224x224, divide the image into fixed-size patches, and the patch size is 16x16, then each image will generate 224x224/16x16=196 patches, that is, the input sequence length is 196, and the dimension of each patch is 16x16x3 = 768 , the dimension of the linear projection layer is 768xN (N=768), so the dimension after the input passes through the linear projection layer is still 196x768, that is, there are 196 tokens in total, and the dimension of each token is 768. A special character cls needs to be added here, so the final dimension is 197x768 . So far, a vision problem has been transformed into an NLP problem through patch embedding.
  2. positional encoding (standard learnable 1D position embeddings): ViT also needs to add positional encoding. The positional encoding can be understood as a table with a total of N lines. The size of N is the same as the length of the input sequence. Each line represents a vector. The dimension of the vector Same dimension as the input sequence embedding (768). Note that the positional encoding operation is sum, not concat. After adding position encoding information, the dimension is still 197x768
  3. LN/multi-head attention/LN: The output dimension of LN is still 197x768. When multi-head self-attention, first map the input to q, k, v. If there is only one head, the dimension of qkv is 197x768. If there are 12 heads (768/12=64), then the dimension of qkv is 197x64. There are 12 groups of qkv, and finally splicing the output of 12 groups of qkv, the output dimension is 197x768, and then after a layer of LN, the dimension is still 197x768
  4. MLP: Enlarge and shrink the dimensions back, 197x768 is enlarged to 197x3072, and then reduced to 197x768

After a block, the dimension is still the same as the input, which is 197x768, so multiple blocks can be stacked. Finally, the output corresponding to the special character cls will be z^0_L used as the final output of the encoder, representing the final image presentation (another way is to average the output of all tokens without adding the cls character), such as the paper formula, followed by an MLP Classify images. 

Q3: Inductive bias:

  • References: https://blog.csdn.net/qq_39478403/article/details/121107057
  • CNN's inductive bias: locality locality, translation equivalence translation equivalence. It is reflected in each layer of the CNN model, and the prior knowledge of the model runs through the entire model from beginning to end. But for VIT, only the MLP layer is locally shared and translation invariant while the self-attention layer is global. That is to say, VIT has less inductive bias than CNN, only MLP. The 2d position information of the patches of the VIT + the scene information between the spatial relations image blocks need to be relearned. Therefore, ViT does not have a lot of inductive bias, and the effect of training VIT with small and medium-sized data sets is not as good as CNN.

Q4: Hybrid architecture

Because Transformer has a strong global modeling ability, and CNN is more data-efficient, it does not need so much training data. Considering the combination of pre-CNN + post-Transformer, two different image preprocessing methods are compared through experiments: The image preprocessing method of ViT is the A picture is divided into patches, directly through the fully connected layer fc, the hybrid structure does not divide the patches, using CNN (res50 feature map 14 * 14 = 196), through the fully connected layer E Linear projections to get the embedding of the picture

7.2 Fine-tuning and higher resolution (fine-tuning and higher resolution)

Paper translation:

  • Typically, we pre-train ViT on large datasets and fine-tune to (smaller) downstream tasks. To this end, we removed the pre-training prediction head and replaced it with a DxK feed-forward layer initialized with all 0s, where K is the number of downstream categories. Often it is beneficial to use fine-tuning at a higher resolution than pre-trained (Touvron et al., 2019; Kolesnikov et al., 2020). This results in a larger effective sequence length when feeding higher resolution images while keeping the tile size constant. VIT can handle any sequence length (up to the upper limit of memory). However, pretrained positional embeddings may no longer be meaningful. Therefore, we perform 2D interpolation of the pre-trained positional embeddings according to their position in the original image . Note: Resolution adjustment, the decimation of image patches, is the only inductive bias point in Vision Transformer that manually injects the 2D structure of the image.

Q1: The number of patches increases, how to use the pre-trained position code?

In theory, Transformer can handle any length. However , the position embedding trained in advance may be invalid. For example, the original 1-9 Jiugongge picture has more patches, and the position encoding needs to be 1-25.  Then the number of patches increases, how to use the pre-trained position code? Here is a simple 2D interpolation, implemented with torch's official interpolate function ; but not any length increase can maintain the effect. When you change from a very short sequence to a very long sequence (256 → 512 → 768), use the difference directly, and the final effect will drop. Interpolate is only a temporary solution and a limitation of VIT fine-tuning. It is also because you use the position information of the picture to make this difference. In the resolution adjustment and image block extraction, this is the only inductive bias that uses the 2D information of the picture in Vision Transformer .


8. Experiments _ _

Paper translation:

  • We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and hybrid networks of the two. To understand the data requirements of each model, we pre-train on very large datasets and evaluate on many benchmark tasks . When considering the computational cost of the pre-trained model. ViT performs very well , with a lower pre-training cost, achieving the best results so far, on most recognition benchmarks. Finally, we perform a small experiment with self-supervision, showing that self-supervised ViT holds great promise for the future .

8.1 Setup 

8.1.1 Datasets

  • In order to explore the scalability of the model (to explore model scalability), three data sets of ImageNet-1K (1.3million), ImageNet-21K (14million), and JFT-18K (303million) were used in the pre-training phase. Also refer to BiT, delete the duplicate data in the pre-training dataset and the downstream task test set (de-duplicate the pre-training datasets wrt the test sets of the downstream)
  • Downstream datasets include: ImageNet (on the original validation labels), ImageNet (on the cleaned-up ReaL labels), CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB (19 tasks)
  • ImageNet ReaL reference 2020-Are we done with imagenet?  VTAB reference 2019-A large-scale study of representation learning with the visual task adaptation benchmark , preprocessing of all datasets reference BiT

8.1.2 Model Variants (models and variants)

  1. ViT: Referring to BERT, a total of three model variants (with the Huge variant added) are set up as shown in the figure below. For example, ViT-L/16 represents the Large variant, and the input patch size is 16x16.
  2. CNN: Baseline CNNs choose ResNet, replace Batch Normalization with Group Normalization, and use standardized convolutions to improve model migration performance.
  3. Hybrid: The hybrid model is to use the feature map output by ResNet50. Different stages will get feature maps of different sizes, that is, sequences of different lengths will be generated.
Table 1:Details of Vision Transformer model variants
  • All models are trained using Adam (β1=0.9, β2=0.999), batch_size is set to 4096, weight decay (apply a high weight decay of 0.1), and a learning rate warmup strategy (use a linear learning rate warmup and decay ); fine-tuning stage, using SGD with momentum, batch_size is set to 512

8.2  Comparison to state of the ART ( comparison with the latest technology )

Table 2: Performance comparison between ViT and other SOTA models, showing the mean and standard deviation of accuracy accuraces, all results are the results of three rounds of fine-tuning (averaged over three fine-tunning runs). The pre-training results of the Vision Transformer model on the JFT-300M dataset outperform the results of the ResNet-based baseline model on all datasets, while pre-training takes fewer computing resources. The pre-training results of ViT on the smaller ImageNet-21k public dataset also perform well. A small increase to the result of 88.5%.
  • It can be seen that the ViT model pre-trained on the JFT dataset, after migrating to downstream tasks, performs better than ResNet-based BiT and EfficientNet-based Noisy Student, and requires less pre-training time. That is to highlight the advantages of ViT: good effect + fast training.
Figure 2: The performance of various models on VTAB, ViT also performs better Title
  • The above experiments show that when pre-training on a large data set, ViT performance surpasses CNN, and later explore the impact of different sizes of pre-training data sets on model performance (you can't just look at super large data sets)

8.3 Pre-trainign Data Requirements (pre-training data requirements)

Figure 3:  Migrating to ImageNet, the large ViT model performs worse than BiT Resnet (shaded area) when pretrained on a small dataset, but the opposite is true when pretrained on a larger dataset. Similarly, as the dataset size grows, larger ViT variants will catch up to smaller ViT variants.
  • Here, when pre-training on a smaller dataset (ImageNet), optimize three hyperparameters to improve model performance, namely weight decay, dropout and label smoothing. It can be seen that when pre-training on a small data set (ImageNet-1k, 1.3million), the effect of ViT fine-tuning is far inferior to ResNet ; when pre-training on a medium data set (ImageNet-21K, 14million), two The effect of ViT is comparable; when pre-trained on a large data set (JFT-300M, 300million), the effect of ViT is better. So when we only have a smaller dataset, it is more suitable to use ResNet (not all datasets are suitable for hard-set transformers)
Figure 4:  Linear few-shot evaluation on ImageNet versus pre-training size. ResNets perform well on smaller pre-trained datasets, but quickly stall (faster than ViT), which performs better on large pre-trained datasets. ViT-b is the model after all hidden dimensions in ViT-B are halved.
  • As shown in the figure above, in the same data set (JFT), different amounts of data (10M, 30M, 100M, 300M) are extracted respectively to avoid gaps between different data sets, and additional regularization is not applied, and the hyperparameters are guaranteed to be the same. Linear evaluation refers to directly using the pre-trained model as a feature extractor, without fine-tune, and directly doing logistic regression with the extracted features. few-shot means that during evaluation, only five pictures are sampled for each category.
  • It can be seen that when the data set is small, the CNN pre-training model performs better, which proves the effectiveness of the CNN inductive bias , but when the data set is large enough, the inductive bias loses its advantage compared with Transformer, and there is even no inductive bias. setting, it is more efficient to learn patterns directly from the data. At the same time, if you observe carefully, you will find that even if the pre-trained data set is large, the performance improvement of ViT is not obvious in the end. Therefore, how to use ViT to do this small sample learning task is a direction that needs to be further studied .

8.4 Scaling Study

Figure 5: Performance and pre-training computational overhead of different architectures (Vision Transformers, ResNets, and hybrids of the two). Vision Transformers are generally better than ResNets at the same computational overhead, and for smaller models the mixed model is better than pure Transformers, but for larger models, the gap between the two disappears.
  • The above experiment proves that ViT's pre-training is cheaper than ResNet, that is, under the same pre-training computational complexity, ViT's effect is better than ResNet. It can be seen that when the model is small, the performance of the mixed model is better, but as the model increases, the performance of ViT exceeds that of the mixed model (why the mixed model is not as good as ViT at this time, intuitively the mixed model absorbs the advantages of both sides , should perform better).

8.5 Inspecting Vision Transformer(VISION TRANSFORMER检验)

Figure 7: Left: Filters of ViT-L/32RGB image's initialized linear map layer; Middle: ViT-L/32 position-encoded similarity map, title indicates block's position-encoded row-to-column cosine similarity, and Positions encode similarity to all other patches; 
  right: size of attention regions by multi-head and network depth. Each point shows the average attention distance of a head on an image in a layer of 16 heads, see the appendix for details.
  1. Left: The first 28 principal components of the first layer (linear projection) of the ViT block. It can be seen from the diagram that the Vision Transformer and CNN are very similar. Similar to the gabor filter, it has color and texture, and can be used as a basis function to describe the underlying information of each image block.
  2. Middle: Position coded similarity analysis (cos), the closer the position is, the higher the similarity between the patches; the patches in the same row/column have similar embeddings; it shows that Position embedding can learn some information representing the position distance. Although it is a 1d position embedding, it has already learned the concept of 2d image position, so switching to 2d position does not improve much.
  3. Right: In order to understand how self-attention aggregates information, see if Self-attention works, so calculate the average attention distance of different layers and different heads based on the attention weight, the average attention distance of each head of each layer, similar Based on the concept of CNN's receptive field, the average attention distance is calculated based on the attention weight. The specific method is to multiply the attention weight by the distance between the query pixel and all other pixels, and then calculate the average. It can be found from the illustration that some heads have paid attention to almost the entire range of the picture in the first layer, that is, self-attention has noticed the global information at the beginning ; while the receptive field of the first layer of CNN is very small at the beginning. , only nearby pixels can be seen; as the network deepens, the features learned by the model become more and more high level, with more and more semantic information, and the self-attention distance of pixels is getting farther and farther.
    Figure 6:  An example representation of attention from output tokens to input space.
  4. In order to prove that self-attention can learn pixel information at a long distance, the author gave Figure 6, which is a picture made by the output token of the last layer of ViT. He found that if these output pictures are used for self-attention, refraction (Reverse mapping) Back to the original input picture, you can see that ViT has really learned some concepts: such as dogs, airplanes. The author's last sentence said: From a global perspective, the output token is a fusion of global feature information, and the ViT model can pay attention to image regions related to classfication classification.

8.6 Self-Supervision (self-supervision)

Transformers have shown impressive performance gains in NLP tasks. However, their success stems not only from excellent scalability, but also from large-scale self-supervised pre-training . We also do preliminary explorations on self-supervised mask image patch prediction. Mimics the masked language model used in BERT . Using self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet. Trained from scratch, a massive 2% improvement over previous self-supervised models. But still 4% behind supervised. Appendix B.1.2 contains more details. We leave the exploration of contrastive training to future work.


3. Summary of Vision Transformer

  1. Writing angle: concise and clear, light and heavy (important results are placed in the main text), and the charts are clear.
  2. Content angle: Vision Transformer has dug a big hole: it can analyze, improve or promote from all angles.
  3. Task perspective: VIT only does classification, so tasks in detection, segmentation, and other fields can be done in the future.
  4. From the perspective of VIT structure: the tokenization at the beginning, the transformer block in the middle, including the objective function can be supervised or different self-supervised training methods can be improved.
  5. VIT has opened up the gap between CV and NLP, and dug a larger multi-modal pit. For video, audio, or touch-based signals, signals of various modalities can be used.

Guess you like

Origin blog.csdn.net/weixin_44074191/article/details/127512506