[Computer Vision] Visual Transformer (ViT) model structure and principle analysis

1. Introduction

Visual Transformer (ViT) comes from the paper "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE", which is the beginning of the Transformer-based model in the visual field.

insert image description here

This article will introduce the overall architecture and basic principles of the ViT model as concisely as possible.

The ViT model is based on the Transformer Encoder model, and it is assumed that the reader already understands the basics of Transformer.

2. How Vision Transformer works

We know that the Transformer model was originally used in the field of natural language processing (NLP). NLP mainly deals with text, sentences, paragraphs, etc., that is, sequence data. However, the visual field deals with image data, so applying the Transformer model to image data faces many challenges for the following reasons:

  1. Unlike text data such as words, sentences, and paragraphs, images contain more information and are presented in the form of pixel values.
  2. Processing images the same way text is processed, pixel by pixel, is difficult even with current hardware.
  3. Transformer lacks the inductive biases of CNNs, such as translation invariance and locally restricted receptive fields.
  4. CNNs extract features through similar convolution operations. As the number of model layers deepens, the receptive field will gradually increase. But due to the nature of Transformer, it will be more computationally expensive than CNNs.
  5. Transformer cannot be directly used to process grid-based data, such as image data.

In order to solve the above problems, Google's research team proposed the ViT model. Its essence is actually very simple. Since Transformer can only process sequence data, then we can convert image data into sequence data. Let's take a look at how ViT does it.

3. ViT model architecture

Let's first roughly analyze the workflow of ViT in combination with the animation below, as follows:

  1. Divide a picture into patches;
  2. Pave the patches;
  3. Linearly map the flattened patches to a lower-dimensional space;
  4. Add location embedding encoding information;
  5. Send the image sequence data into the standard Transformer encoder;
  6. pre-training on larger datasets;
  7. Fine-tuning on downstream datasets for image classification.

4. Analysis of the working principle of ViT

We decompose the process shown in the above figure into 6 steps, and then analyze its principle step by step. As shown below:

insert image description here

4.1 Step 1: Convert the image into a sequence of patches

This step is critical. In order for Transformer to process image data, the first step must be to convert the image data into sequence data, but how to do it? Suppose we have a picture: x ∈ RH × W × C x \in R^{H \times W \times C}xRH × W × C , the patch size isppp , then we can createthe NNN image patches, which can be expressed asxp ∈ R ( p 2 C ) x_p \in R^{(p^2C)}xpR(p2 C), whereN = HWP 2 N = \frac{HW}{P^2}N=P2HW N N N is the length of the sequence, similar to the number of words in a sentence. In the image above, you can see that the image is divided into 9 patches.

4.2 Step 2: Flatten the patches

In the original paper, the size of the patches chosen by the author is 16, then the shape of a patch is (3, 16, 16), the dimension is 3, and the size after paving it is 3x16x16=768. That is, a patch becomes a vector of length 768.

However, it still looks a bit big. At this time, you can add a Linear transformation, that is, add a linear mapping layer to map the dimension of the patch to the dimension of the embedding we specified, which is similar to the word vector in NLP.

4.3 Step 3: Add Position embedding

Unlike CNNs, the model does not know the location information of the patches in the sequence data at this time. So these patches must first append a position information, which is the vector with numbers in the figure.

Experiments show that different position encoding embeddings have little effect on the final result. In the original Transformer paper, fixed position encoding is used, and the learnable position embedding vectors used in ViT are added to the corresponding output patch embeddings. .

4.4 Step 4: Add class token

Before inputting to the Transformer Encoder, a special class token needs to be added, which is mainly borrowed from the BERT model.

The purpose of adding this class token is because the ViT model regards the output of this class token in the Transformer Encoder as the encoding feature of the model for the input image, which is used for subsequent input to the MLP module for loss calculation with the image label.

4.5 Step 5: Input Transformer Encoder

Splice the patch embedding and class token into the standard Transformer Encoder.

4.6 Step 6: Classification

Note that the output of the Transformer Encoder is actually a sequence, but only the output of the class token is used in the ViT model, and it is sent to the MLP module to output the final classification result.

V. Summary

The overall idea of ​​ViT is relatively simple, mainly to convert the image classification problem into a sequence problem. That is to convert the image patch into a token so that it can be processed by Transformer.

It sounds simple, but ViT needs to be pre-trained on massive datasets, and then fine-tuned on downstream datasets to achieve better results, otherwise the effect is not as good as CNN-based models such as ResNet50.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130480240