Principle Analysis of Vision Transformer

Principle Analysis of Vision Transformer

Vision Transformer is a model proposed by the Google team in 2021 to apply Transformer to image classification. Because of its simple model, good effect, and strong scalability, it has become a milestone work in the CV field and has also ignited subsequent related research.


1、Motivation

From 2020 to 2021, the CV field will be occupied by CNN, and Transformer in the NLP field will become standard. Many works are dedicated to extending the ideas in the NLP field to the CV field, which can be roughly divided into two categories: first, the combination of attention mechanism and CNN. Second, the attention mechanism replaces some structures of CNN while the overall structure remains unchanged. For example, Self-Attention is applied to the field of semantic segmentation, resulting in excellent semantic segmentation models such as Non-local, DANet, and CCNet.

However, these methods still need to rely on CNN to extract image features. Therefore, Google researchers think whether it is possible to use Transformer as an encoder to extract image features without relying on CNN at all, and make it a standard configuration for extracting image features in the CV field like ResNet.

The most direct way is to transfer the practice in NLP. The standard configuration of the pre-training model for extracting text features in NLP happens to be the pre-training model BERT released by Google. Therefore, ViT is thinking about how to expand BERT to the CV field. By comparison, it can be found that ViT is actually the input form (the image needs to be processed into the embedding form in NLP), and the pre-training task (removing the NSP and mask LM tasks in BERT and changing to image classification tasks) is somewhat different from BERT. difference.

2. Conclusion of the article

The core conclusion in the original ViT paper is that when there is enough data for pre-training, the performance of ViT will exceed that of CNN, breaking through the limitation of Transformer's lack of inductive bias, and better migration effect in downstream tasks. . But when the amount of training data is not large enough, the performance of ViT is usually worse than ResNet of the same size, because Transformer lacks inductive bias compared with CNN, that is, a priori knowledge.

3. Model details

The network architecture diagram of Vision Transformer is as follows:

insert image description here

Part 1: Model Input

Assume that the model input is a three-channel image with a scale of (512, 512, 3), and the initial value = torch.Size([1, 3, 512, 512]).

Part 2: Linear Projection of Flatten Patches

Divide the entire (512, 512, 3) size image into (16, 16) patches, and the scale of each patch is (32, 32, 3), at this time torch.Size([1, 3, 512 , 512]) can be converted to torch.Size([1, 3, 16, 32, 16, 32]).

Combine (16, 16) into one dimension, representing all patches; (3, 32, 32) into one dimension, representing the number of pixels contained in each patch, which can be converted to torch.Size([1 , 256, 3072]).

After a linear layer nn.Linear(3072, 1024), convert to torch.Size([1, 256, 1024]). After the gelu activation layer and LayerNorm normalization layer, torch.Size([1, 256, 1024]) is obtained.

Part 3: Patches+Position Embedding

Construct a dictionary of nn.Embedding(config.max_position_embeddings=500, config.hidden_size=1024), which is used to convert an index number into a vector of a specified dimension, where the value of the dictionary is also a parameter that needs to be trained.

For (16, 16)=256 patches, generate tensor([0, 1, 2, 3, ..., 254, 255]), and extract the corresponding (encoded) 1024-dimensional vector from the Embedding dictionary according to the index value , namely torch.Size([1, 256, 1024]).

Add the position encoding vector to the previously obtained image features to get torch.Size([1, 256, 1024]). Then go through the LayerNorm normalization layer and the Dropout layer to get torch.Size([1, 256, 1024]).

Part 4: Transformer Encoder — Block Overlay

Transformer Encoder contains a series of Blocks (that is, repeating parts 5 to 8), each Block has the same structure, the input dimension is torch.Size([1, 256, 1024]), and the output dimension is torch.Size( [1, 256, 1024]).

Part 5: Transformer Encoder — Block — Multi-Head Attention

The input dimension is torch.Size([1, 256, 1024]), define three linear layers nn.Linear(1024, 1024), and get three Q, K, V matrices, all of which are torch.Size([1, 256, 1024]).

Convert to multi-head form, Q, K, V matrix dimensions are all converted to torch.Size([1, 16, 256, 64]), using 16 heads.

Perform the self-attention operation in the multi-head mode to obtain the data dimension as torch.Size([1, 16, 256, 64]), and then restore it to the form of torch.Size([1, 256, 1024]).

Part 6: Transformer Encoder — Block — first residual connection

Add the torch.Size([1, 256, 1024]) obtained by the self-attention mechanism to the data before the self-attention mechanism, namely torch.Size([1, 256, 1024]), which is equivalent to the residual connection . After the LayerNorm normalization layer, torch.Size([1, 256, 1024]) is obtained.

Part 7: Transformer Encoder — Block — MLP

The input dimension is torch.Size([1, 256, 1024]), define the MLP layer nn.Linear(1024, 1024), and convert to torch.Size([1, 256, 1024]). Then go through the gelu activation layer to get torch.Size([1, 256, 1024]).

Part 8: Transformer Encoder — Block — Second Residual Connection

The torch.Size([1, 256, 1024]) obtained by MLP is accumulated with the data torch.Size([1, 256, 1024]) before MLP, which is equivalent to the residual connection. After the LayerNorm normalization layer, torch.Size([1, 256, 1024]) is obtained.

Part 9: Image classification output results

After encoding a series of Transformer Block modules, the final extracted feature dimension is torch.Size([1, 256, 1024]).
Averaging on the patches dimension is equivalent to fusing the features extracted from each patch to obtain torch.Size([1, 1024]).
Finally, after the linear layer nn.Linear(1014, 10), the output result torch.Size([1, 10]) is obtained.

4. Related thoughts

(1) Understanding of the nn.Embedding function, and what is the difference between it and the nn.Linear function?

Answer blog: https://blog.csdn.net/qq_43391414/article/details/120783887

torch.nn.Embedding is used to turn an index number into a vector of a specified dimension, for example, the number 1 becomes a 128-dimensional vector, and the number 2 becomes another 128-dimensional vector, but this 128-dimensional vector does not Not eternal. Then these 128-dimensional vectors will participate in model training and be updated, so that the number 1 will have a better representation of the 128-dimensional vector. Obviously, this is very much like a fully connected layer, so many people say that the Embedding layer is a special case of a fully connected layer.

Embedding and Linear are almost the same, the difference is that the input is different, one is the input number, and the latter is the input one-hot vector. So to speak, assuming there are K different 1-dimensional values, each needs to be mapped into an N-dimensional vector:
Embedding(K, N) = One-hot(K) + Linear(K, N)

5. Source code

If you need the source code, you can go to my homepage to find the project link. The above code and experimental results are obtained by my own experiments:
https://blog.csdn.net/Twilight737

Guess you like

Origin blog.csdn.net/Twilight737/article/details/131367628