Vision Transformer paper + detailed explanation (ViT)

The paper is called "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE". A picture is equivalent to a 16x16 word. As the name suggests, ViT divides the picture into 16x16 patches, and then regards these patches as the input of the transformer. Let's study the thesis together.

Paper address: https://arxiv.org/pdf/2010.11929.pdf

pytorch source code: written by rwightman, officially included

tf source code: https://github.com/google-research/vision_transformer

Table of contents

Abstract

1 Introduction

 2 Related Work

3 Method 

3.1 VISION T RANSFORMER (V I T)

Embedding layer

Transformer Encoder

MLP Head

3.2 F INE - TUNING AND H IGHER R ESOLUTION

4 Experiments

5 Conclusion


Abstract

In fact, the abstract said one thing. In the visual field, we use the transformer to achieve better results than your CNN.

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its application in computer vision remains limited. In vision, attention is either used in conjunction with convolutional networks or used to replace some components of convolutional networks while keeping their overall structure unchanged. We show that this reliance on CNNs is unnecessary and that pure transformers applied directly to sequences of image patches can perform remarkably well on image classification tasks. When pretrained on large amounts of data and transferred to several medium or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the Vision Transformer (ViT) achieves excellent performance compared to state-of-the-art convolutional neural networks As a result, fewer computational resources are required for training at the same time.

⚠️ Note: When you see that IN is a small and medium-sized data set in the paper, you should pay attention to it. For the people of Google, let’s not substitute it subjectively. In addition, the less training resources here refer to 2500 days of TPUv3 training.

1 Introduction

The introduction mainly has these points:

  1. Transformer has achieved large-scale applications in the NLP field, with good computational efficiency and scalability, and as the model increases, the performance is not saturated (this is one reason why I want to introduce Transformer into the visual field)
  2. The author does not modify the transformer as much as possible, so that it can be proved that it is not because of the introduction of some visually friendly methods, but the transformer itself can perform visual tasks and achieve good results.
  3. Training on small and medium-sized data sets is indeed not as effective as ResNet and other networks, because traditional CNNs have two prior knowledge (inductive paranoia): translation equivariance, whether convolution or translation is done first, the effect is the same; locality refers to What is important is that the adjacent pixels of the picture have a certain relationship. If you train in a larger data set, you don't need these inductive paranoia. Therefore, training on a larger data set can achieve better results than CNN.

        Self-attention based architectures, especially Transformers, have become the model of choice for Natural Language Processing (NLP). The main approach is to pre-train on large text corpora and then fine-tune on smaller task-specific datasets. Due to the computational efficiency and scalability of Transformers, it is possible to train unprecedented models with over 100B parameters. There is still no sign of saturating performance as models and datasets grow.

        However, in computer vision, convolutional architectures still dominate. Inspired by the success of NLP, many works try to combine CNN-like architectures with self-attention, some of which completely replace convolutions. Some models, while theoretically efficient, have not yet scaled efficiently on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, the classic ResNet architecture is still state-of-the-art.

        Inspired by the scalability of Transformers in NLP, we try to apply standard Transformers directly to images with as little modification as possible . To do this, we split the image into blocks and provide a sequence of linear embeddings of these blocks as input to the Transformer. Image patches are treated in the same way as tokens (words) in NLP applications. We train the model for image segmentation in a supervised manner.

        When trained on a medium-sized dataset such as ImageNet without strong regularization, these models produce accuracy that is several percentage points lower than a ResNet of the same size. This seemingly dismal result may be expected: Transformers lack some of the inductive biases inherent in CNNs, such as translation equivariance and locality, and thus do not perform well when trained on insufficient amounts of data. generalization.

        However, things change if the model is trained on a larger dataset (14M-300M images). We find that training at scale outperforms inductive bias. Our Vision Transformer (ViT) achieves excellent results when pretrained at sufficient scale and transferred to tasks with fewer data points. When pretrained on the ImageNet-21k dataset or the JFT-300M dataset, ViT approaches or exceeds the state-of-the-art on multiple image recognition benchmarks. In particular, the best model achieves 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on VTAB of 19 tasks.

 2 Related Work

If you directly pull a picture into a vector, then for rransformer, the sequence length is too long and the computational complexity is too high. So the model of Cordonnier et al. , which extracts patches of size 2 × 2 from the input image. In fact, the transformer is already used here, but Google is richer. Just get a 16 x 16 patch, which can process 224 images, and achieve very good results on large data sets.

         Transformer was proposed by Vaswani et al. for machine translation and has become the state-of-the-art method in many NLP tasks. Large Transformer-based models are often pretrained on large corpora and then fine-tuned for the task at hand: BERT uses a denoising self-supervised pretraining task, while the GPT line of work uses language modeling as its pretraining task.

        Simply applying self-attention to images requires every pixel to pay attention to every other pixel. This does not scale to practical input sizes due to the quadratic cost in the number of pixels. Therefore, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. apply self-attention only in the local neighborhood of each query pixel, rather than globally. This local multi-head dot-product self-attention block can completely replace the convolution. In another work, Sparse Transformers (Child et al., 2019) employ a scalable approximation to global self-attention so that it works for images. Another way to scale attention is to apply it to patches of different sizes, in extreme cases only along a single axis. Many of these specialized attention architectures have shown promising results on computer vision tasks, but require hardware accelerators and are quite complex.

        Most relevant to us is the model of Cordonnier et al. , which extracts patches of size 2 × 2 from the input image and applies full self-attention on top. The model is very similar to ViT, but our work further demonstrates that large-scale pre-training enables vanilla Transformers to compete (or even outperform) state-of-the-art CNNs. Furthermore, Cordonnier et al. use a patch size of 2 × 2 pixels, which makes the model only suitable for small-resolution images, whereas we also deal with medium-resolution images.

        There is also a lot of interest in combining convolutional neural networks (CNNs) with a form of self-attention, e.g. by augmenting feature maps for image classification or by using self-attention to further process the output of the CNN, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020) or unified text vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).

        Another recent related model is Image GPT (iGPT) (Chen et al., 2020a), which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised manner as a generative model, and the resulting representations can then be fine-tuned or linearly probed to improve classification performance, achieving a maximum accuracy of 72% on ImageNet.

        Our work adds to a growing body of papers exploring image recognition on a larger scale than the standard ImageNet dataset. State-of-the-art results can be achieved on standard benchmarks using additional data sources (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020). Furthermore, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) conduct an empirical exploration of CNN transfer learning on large-scale datasets such as ImageNet-21k and JFT-300M. We also focus on the latter two datasets, but train Transformers instead of the ResNet-based models used in previous work.

3 Method 

        In model design, we follow the original Transformer (Vaswani et al., 2017) as much as possible. The advantage of this intentionally simple setup is that the scalable NLP Transformer architecture and its efficient implementation work almost out of the box.

3.1 VISION T RANSFORMER (V I T)

        An overview of the model is shown in Figure 1. The standard Transformer accepts a one-dimensional sequence of token embeddings as input. To process 2D images, we  x \in \mathbb{R}^{H \times W\times C} reshape the image into a series of flattened 2D patches x_{p} \in \mathbb{R}^{N\times (P^{2} \cdot C)}, where (H, W) is the resolution of the original image, C is the number of channels, and (P, P) is the resolution of each image patch , N = HW /P^{2} is the number of blocks generated, which also serves as the effective input sequence length for the Transformer. The Transformer uses a constant latent vector size D across all its layers, so we flatten the patches and map to D dimensions using a trainable linear projection (Equation 1). We refer to the output of this projection as patch embeddings.

z_{0}^{0} = X_{class}Similar to BERT's [class] tag, we add learnable embeddings to         the sequence of embedding patches (  ), whose z_{L}^{0} state at the output of the Transformer encoder ( ) is used as the image representation y(Eq 4). During pre-training and fine-tuning, Category headers are appended to z_{L}^{0}. The classification head is implemented by an MLP with one hidden layer for pre-training and a single linear layer for fine-tuning.

        Position embedding is added to patch embedding to preserve position information. We use standard learnable 1D position embeddings, as we do not observe significant performance improvements using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors is used as input to the encoder.

        The Transformer encoder consists of alternating layers of multi-head self-attention and MLP blocks. applied before each block

Layernorm (LN), which applies residual connections after each block.

        Inductive bias.  We note that the Vision Transformer has much smaller image-specific inductive bias compared to CNNs. In CNNs, locality, 2D neighborhood structure, and translational equivalence are implemented in every layer of the entire model. In ViT, only the MLP layer is localized and translational, while the self-attention layer is global. The 2D neighborhood structure is used very sparingly: by cutting the image into blocks at the beginning of the model, and adjusting the positional embedding of images of different resolutions during fine-tuning (described below). Besides that, the positional embeddings at initialization carry no information about the 2D positions of the patches, and all spatial relationships between patches have to be learned from scratch.

       Hybrid Architecture.  As an alternative to raw image patches, the input sequence can be formed from the feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Equation 1) is applied to patches extracted from CNN feature maps. As a special case, a patch can have a 1x1 spatial size, which means that the input sequence is obtained by simply flattening and projecting the spatial dimension of the feature map to the Transformer dimension. Add categorical input embeddings and location embeddings as above.

        This part makes ViT very clear at once. Its structure is the new Embedding layer + Encoder layer in transformer + MLP layer. Isn't it very simple!


Embedding layer

The original text explained the process of the embedding layer very clearly. Here, I use ViT-B/16 (ViT_base_patch16) as an example to illustrate. First input 224 x 224, divide the image into 16 x 16 patches, then there will be  (224/16)^{2}=196 a patch. Then talk about the vectors to which these 196 patches are mapped  d=768 . The shape of each patch is [16, 16, 3] (length and width are 16, and the number of channels is 3). This step in the code is implemented through a convolution

Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size)

The convolution kernel size is 16, the step size is 16, the input dimension is 3, and the output dimension is 768. Wonderful! !

In this way, the image is changed from [224, 224, 3] to [14, 14, 768], and after Flatten, [196, 768] is obtained.

Add a token specifically for classification. It is mentioned in the article that the addition method here is similar to BERT, adding a [class] token that can be obtained through learning. In order to keep the dimensions consistent, the dimension of the [class] token is [1, 768]. Through the Concat operation, [196, 768] and [1, 768] are concatenated to obtain [197, 768].

Then add location information to these tokens, that is, position embedding. This is consistent with the transformer, and is a trainable parameter, because it needs to be added to all tokens, so the dimension is also [197, 768].


Transformer Encoder

This part is exactly the same as the encoder in the transformer. You can read the detailed explanation of the transformer written before.

Simply put, it is N x Blocks (Multi-Head Attention + MLP)

In ViT-B/16, the input is [197, 768] and the output is also [197, 768] .

Then it is connected to an MLP Head to output the final classification result.


MLP Head

In fact, it is simpler here. The only difference from Transformer is that Transformer uses all output tokens, but ViT only classifies, and here only needs to output the corresponding position of [class] token.

The output method is also very simple, it is a fully connected neural network! It's that simple!


3.2 F INE - TUNING AND H IGHER R ESOLUTION

        Typically, we pre-train ViT on large datasets and fine-tune on (smaller) downstream tasks. To this end, we remove the pretrained prediction head and append a zero-initialized D × K feed-forward layer, where K is the number of downstream classes. Fine-tuning at a higher resolution is often beneficial compared to pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When provided with higher resolution images, we keep the patch size the same, resulting in a larger effective sequence length. Vision Transformer can handle arbitrary sequence lengths (up to memory limit), but the pre-trained positional information is no longer meaningful. Therefore, we 2D interpolate the pre-trained positional embeddings according to their position in the original image. Note that this resolution adjustment and patch extraction is the only point at which inductive bias about the 2D structure of the image is manually injected into the vision transformer.

4 Experiments

It can be seen that this parameter amount is still very exaggerated for us, so when training our own transformer, we still need to use their pre-trained weight files.

  

5 Conclusion

        We have explored the direct application of transformers in image recognition. Unlike previous work using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture beyond the initial patch extraction step. Instead, we interpret the image as a sequence of patches and process it by using standard Transformer encoders in NLP. This simple but scalable strategy works surprisingly well when combined with pre-training on large datasets. As a result, Vision Transformer matches or exceeds the state-of-the-art on many image classification datasets while being relatively inexpensive to pre-train.

        While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, together with those of Carion et al., demonstrate the promise of this approach. Another challenge is to continue to explore self-supervised pre-training methods. Our initial experiments showed improvements in self-supervised pre-training, but there is still a large gap between self-supervised and large-scale supervised pre-training. Finally, extending ViT further may improve performance. 

Guess you like

Origin blog.csdn.net/like_jmo/article/details/126094133