Detailed explanation of ViT paper

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 

Billbill's explanation: https://www.bilibili.com/video/BV1GB4y1X72R/?spm_id_from=333.788&vd_source=d2733c762a7b4f17d4f010131fbf1834

1.Introduction

Self-attention-based architectures, especially Transformers (Vaswani et al., 2017), have become the model of choice for natural language processing (NLP). The main approach is to pretrain on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Due to the computational efficiency and scalability of Transformers, it is possible to train unprecedented models with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). As the model and dataset grow, there is still no sign of saturated performance.

But in the field of computer vision, convolution still occupies the dominant address. Inspired by the success of NLP, a number of works have attempted to combine CNN-like architectures with self-attention. Some of them have completely replaced convolutions. The latter model, although theoretically valid, has not yet scaled effectively on modern hardware accelerators due to the use of specialized attention models. The classic ResNet is still the first choice.

Inspired by the success of Transformer scaling in NLP, try to apply a standard Transformer directly to the image with as few modifications as possible. Therefore, we split the image into patches and provide the linear embedding of these patches as input to the Transformer. Image patches are treated in the same way as tokens (words) in NLP applications (how many words are there in a sentence, how many patches are there in a picture). We train the image classification model in a supervised manner (nlp is learned using unsupervised learning).

When the Transformer without strong regularization is trained on the medium-sized dataset ImageNet, the accuracy is several percentage points lower than the downlight-sized ResNet. This is because the Transformer lacks some of the inductive biases inherent in CNNs, such as translation invariance and locality, and therefore cannot generalize well when the data set is insufficient.

However, the situation changes if the model is trained on a larger dataset (14M-300M images). We found that large-scale training outperforms inductive bias.

2.Related Work

Self-attention simply applies to an image by requiring each pixel to pay attention to every other pixel. Does not scale to actual input size due to quadratic cost in pixel count. Because, in order to apply the Transformer in the context of image processing, self-attention is applied in the local domain of the query pixel, and a scalable approximation (sparse attention) is adopted to the global self-attention so as to be applicable to the image. Another way to extend attention is to apply it to blocks of different sizes (Weissenborn et al., 2019), or in the extreme case only along a single axis (horizontal, vertical) (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures have demonstrated promising results on computer vision tasks but require complex engineering to implement effectively on hardware accelerators.

Most relevant to us is the model of Cordonnier et al. (2020), which extracts patches of size 2x2 from the input image and applies full self-attention on top. The model is very similar to ViT, but our work further demonstrates that large-scale pre-training enables vanilla Transformers to compete with (or even outperform) state-of-the-art CNNs. Furthermore, Cordonnier et al. (2020) used a patch size of 2 2 pixels, which makes the model suitable only for small resolution images, whereas we also deal with medium resolution images.

Another recent related model is Image GPT (iGPT) (Chen et al., 2020a), which applies Transformers to image pixels after reducing the image resolution and color space. The model is trained in an unsupervised manner as a generative model, and the generated representations can then be fine-tuned or linearly probed to improve classification performance, achieving a maximum accuracy of 72% on ImageNet.

3.Method

In model design, we followed the original Transformer (Vaswani et al., 2017) as much as possible. One advantage of this intentionally simple setup is that the scalable NLP Transformer architecture and its efficient implementation can be used almost out of the box.

Figure 1: Model overview. We split the image into fixed-size blocks, linearly embed each block, add positional embeddings, and feed the resulting vector sequence to a standard Transformer encoder. To perform classification, we use the standard method of adding additional learnable "classification markers" to the sequence. The illustration of the Transformer encoder is inspired by Vaswani et al. (2017)

Steps: Figure ---> Divide into multiple patches ---> Pass the patch through the linear projection layer --> Add Position Embedding --> Transformer Encoder --> MLP Head ---> class

The following uses the input of 224x224x3 to explain the figure.

Input

Linear projection layer E: 768x768 (dimension D in the article) For input: 196x768 token: 1x768      

Through linear projection layer: [196x768] x [178x768] = [196,768] (matrix multiplication)

Add a cls token, which can learn classification features from other 196 embeddings: torch.cat([196,768],[1,768]) ==> [197,768]

Add position encoding: [197, 768] + [197, 768] ==> [197,768], sum (the Transformer in 2017 proved that + has the same effect as concat).

Multi-head attention. Multi-head, eg, 12 heads, 768/12 = 64, each head is 197x64, and after splicing, it is 197x768.

MLP: General dimension magnification, four times magnification 197x3012. Later, the dimensions will be scaled and scaled back to 197x768.

Layer norm: Different from BN, CHW is normalized on all samples.

3.1 Vision Transformer

The standard Transformer accepts 1D tokens as input. In order to process 2D images, turn into a series of flat 2D patches, where . H and W are the height and width, C is the channel, and P is the resolution of each patch. is the number of patches for each image, which can be used as the effective input sequence length of Transformer. The Transformer uses a constant latent vector size D in all its layers, so we flatten the patch and map it to D dimensions using a trainable linear projection (Equation 1). We call the output of this projection patch embedding.

Similar to BERT's [class] token, we add a learnable embedding in the embedding sequence (Z00= xclass), whose state at the output of the Transformer encoder (Z0L) is used as the image representation y (Equation 4). During both pre-training and fine-tuning, the classification head is attached to Z0L. The classification head is implemented by an MLP with one hidden layer during pre-training and a single linear layer during fine-tuning.

Position embedding is added to patch embedding to preserve position information. We use standard learnable 1D position embedding because we do not observe significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors is used as input to the encoder.

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multi-head self-attention (MSA, see Appendix A) and MLP blocks (Equations 2, 3). Layernorm (LN) is applied before each block and residual connections are applied after each block (Wang et al., 2019; Baevski & Auli, 2019).

The MLP contains two layers with GELU nonlinearity.


Perceptual bias. Compared with CNN, Vision Transformer has much less bias in image feature induction. In CNN, locality, 2D domain structure and translation invariance are added to each layer of the entire model. In ViT, only the MLP layer is local and translation invariant, while the self-attention layer is global and adjusts the position embedding of images with different resolutions during fine-tuning. In addition to this, the initial position embedding does not carry 2D position information about the patches, and all spatial relationships between patches must be learned from scratch.

Hybrid Architecture. Hybrid architecture. As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from CNN feature maps. As a special case, patches can have a spatial size of 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting it to the Transformer dimension. Add categorical input embeddings and positional embeddings as above.

3.2 Fine-tuning and higher resolution

Typically, we pretrain ViT on large datasets and fine-tune it to (smaller) downstream tasks. To this end, we remove the pretrained prediction head and append a zero-initialized DxK feedforward layer, where K is the number of downstream classes. Fine-tuning at a higher resolution is often beneficial compared to pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When providing higher resolution images, we keep the patch size the same, resulting in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory limits), however, pretrained positional embeddings may no longer make sense. Therefore, we 2D interpolate the pre-trained position embeddings based on their positions in the original image. Note that this resolution adjustment and patch extraction are the only points at which inductive biases about the 2D structure of the image are manually injected into the vision transformer.

4.Experience

Compared with other methods, ViT has better performance and shorter training time.

The gray part indicates the effect that ResNet can achieve. On small data, the effect of VIT is worse than ResNet. On large data sets, VIT is better than ResNet.

The figure shows three models: Transformer, ResNet, and Hybrid. The performance is improving as FLOPs increase. There are no bottlenecks yet.

For models with the same FLOPs, Transformer and hybrid models outperform ResNet.

Hybrid models improve upon pure Transformers at smaller model sizes, but have no advantage over larger models.

4.5 Inspection Vision Transformer

 The first layer of the Vision Transformer linearly projects the flattened patch into a low-dimensional space. Figure 7 (left) shows the main components of learning embedding fitrs. These components resemble rational basis functions for low-dimensional representations of fine structures within each patch.

After projection, the learned positional embeddings are added to the patch’s representation. Figure 7 (middle) shows that the model learns to encode distance within an image in the similarity of position embeddings, i.e., closer patches tend to have more similar position embeddings. Further, a row-column structure emerges; patches in the same row/column have similar embeddings. Finally, for larger grids, sinusoidal structures sometimes appear (Appendix D). Positional embeddings learned to represent 2D image topology explain why handcrafted 2D-aware embedding variants do not yield improvements (Appendix D.4, because the Transformer already learns adjacent feature representations from 1D positional encodings)

Self-attention allows ViT to integrate information from the entire image, even at the lowest level. Our survey network exploits this capability to a large extent. Specifically, we calculate the average distance of information integration in image space based on attention weights (Figure 7, right). This "attention distance" is similar to the receptive field size in CNN. Some heads participating in the lower-level network have also observed global information, indicating that the model does use the ability to globally integrate information. The attention distance of other attention heads at lower levels is always small. This highly localized attention is less evident in the hybrid model where ResNet is applied before Transformer (Figure 7, right), suggesting that it may have a similar function to early convolutional layers in CNNs. Furthermore, attentional distance increases with network depth. Globally, we find that the model focuses on image regions that are semantically relevant for classification.

5.Conclusion

We have explored the direct application of Transformer in image recognition. Unlike previous work using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture beyond the initial patch extraction step. Instead, we interpret the image as a series of patches and process them through the standard Transformer encoder used in NLP. This simple yet scalable strategy works surprisingly well when combined with pre-training on large datasets. As a result, Vision Transformer matches or exceeds existing techniques on many image classification datasets while having relatively low pre-training costs.

Guess you like

Origin blog.csdn.net/qq_37424778/article/details/125417315