Overview of Vision Transformer
1. Introduction to Transformer
Transformer, first applied in the field of natural language processing, is a deep neural network based on self-attention mechanism. Because of its powerful representation capabilities, researchers are looking for ways to apply transformers to computer vision tasks. On various vision benchmarks, Transformer-based models performed similarly or better than other types of networks, such as convolutional and recurrent neural networks. Transformers are receiving increasing attention from the computer vision community due to their high performance and less need for vision-specific sensing bias. In this paper, we review these visual transformer models, categorize them according to different tasks, and analyze their advantages and disadvantages. The main categories we explore include backbone networks, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformers into real device based applications. Additionally, we briefly introduce the self-attention mechanism in computer vision as it is a fundamental component of Transformers. At the end of this paper, we discuss the challenges faced by vision transformers and suggest further research directions.
Here, we review the work related to Transformer-based vision models to track progress in this area. Figure 1 shows the development timeline of the Vision Transformer – no doubt there will be many more milestones in the future.
Figure 1. Key milestones in Transformer development. Vision Transformer models are marked in red.
2. Transformer composition
Transformer was first used in the field of natural language processing (NLP) for machine translation tasks. As shown in Figure 2, the structure of the original transformer. It consists of an encoder and a decoder containing several converter blocks with the same architecture.
The encoder produces an encoding of the input, while the decoder takes all encodings and uses their combined contextual information to generate an output sequence. Each Transformer block consists of multi-head attention layers, feed-forward neural networks, shortcut connections, and layer normalization. Below, we describe each component of the transformer in detail.
2.1 Self-Attention
In the self-attention layer, the input vector is first transformed into three different vectors:
-
query vector (query vector) q
-
key vector k
-
value vector (value vector) v
The dimensions of the three vectors are d(q, k, v)=d(model)=512, and the vectors derived from different inputs are packed into three different matrices, namely Q, K and V. Then, the attention function between different input vectors is calculated, as shown in Figure 3 below:
Step 1: Calculate the score between different input vectors
Step 2: Normalize the stability score of the gradient
Step 3: Use the softmax function to convert the score into a probability
Step 4: Find the weighted value matrix Z=V*P
This process can be unified into a single function (dk = model dimension = 512)
The logic behind formula (1) is simple.
Step 1 calculates the scores between each pair of different vectors, these scores determine how much attention we give to other words when encoding the word at the current position
Step 2 normalizes the scores to enhance gradient stability for improved training;
Step 3 converts scores into probabilities.
Finally, multiply each value vector by the sum of probabilities. Vectors with larger probabilities will receive extra attention.
The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: the key matrix K and value matrix V are derived by the encoder module, and the query matrix Q is derived by the previous layer .
Note that the previous process is invariant to the position of each word , which means that the self-attention layer lacks the ability to capture the position information of words in the sentence. However, the sequential nature of sentences in language requires us to incorporate positional information in our encodings. To solve this problem and obtain the final input vector of words, a positional encoding of dimension dmodel is added to the original input embedding. Specifically, this position is encoded with the following formula
Pos represents the position of the word in the sentence, and i represents the current dimension of the position encoding. In this way, each element of the positional encoding corresponds to a sine wave, which allows the Transformer model to learn to engage by relative position and extrapolate to longer sequence lengths during inference.
In addition to fixed positional encodings in vanilla transformers, learned and relative positional encodings are used in various models.
Multi-Head Attention (multi-head attention)
Multi-head attention is a mechanism that can be used to improve the performance of vanilla self-attention layers.
Note that for a given reference word, we usually want to focus on several other words throughout the sentence. A single self-attention layer limits our ability to focus on one or more specific locations without simultaneously compromising attention to other equally important locations. This is achieved by giving attention layers different representation subspaces.
Specifically, different query matrices and key-value matrices are used for different heads. These matrices are randomly initialized, and the input vectors can be projected into different representation subspaces after training.
To illustrate this in more detail, given an input vector and number of attention heads h, dmodel = model dimension
-
First convert the input vector into three different sets of vectors: query group, key group and value group
-
in each group. There are h vectors with dimensions dq=dk'=dv'=dmodel/h=64
-
Then, the vectors derived from different inputs are packed into three different sets of matrices:
-
The multi-head attention process is shown in the figure below:
Among them, Q' (K', V' is the same) is the series of {Qi}, and Wo is the projection weight.
2.2 Other key concepts of Transformer
2.2.1 Feed-Forward Network Feedforward Network
A feed-forward network (FFN) is employed after each encoder and decoder self-attention layer. It consists of two linear transformation layers and one of them a nonlinear activation function, which can be expressed as the following function where w1 and w2 are two parameter matrices of the two linear transformation layers, and s is a nonlinear activation function, such as GELU. The dimension of the hidden layer is dh=2048.
2.2.2 Residual Connection residual connection
As shown in Figure 2, a residual connection (black arrow) is added in each sublayer of the encoder and decoder.
This enhances information flow for higher performance. After the remaining connections, layer normalization is employed. The output of these operations can be described as
X is used as the input of the self-attention layer, and the query, key-value matrices Q, K, and V all come from the same input matrix X.
2.2.3 The last layer in the decoder
The last layer in the decoder is used to convert the stack of vectors back into a word. This is achieved with a linear layer and a softmax layer.
The linear layer projects this vector as a vector of logits with dimension dword, where dword is the number of words in the vocabulary. The logit vectors are then converted to probabilities using a softmax layer.
When used for CV (computer vision) tasks, most transformers use the original transformer's encoder module. This transformer can be seen as a new type of feature extractor. Compared with CNN (Convolutional Neural Network), which only focuses on local features, Transformer can capture long-distance features, which means it can easily obtain global information.
Compared to RNNs (Recurrent Neural Networks), which have to compute hidden states sequentially, Transformers are more efficient because the outputs of self-attention and fully-connected layers can be computed in parallel and easily accelerated. From this, we can conclude that further research into the application of transformers in computer vision and natural language processing will yield beneficial results.