Vision Transformer overview part I Introduction and composition of Transformer

1. Introduction to Transformer

Transformer, first applied in the field of natural language processing, is a deep neural network based on self-attention mechanism. Because of its powerful representation capabilities, researchers are looking for ways to apply transformers to computer vision tasks. On various vision benchmarks, Transformer-based models performed similarly or better than other types of networks, such as convolutional and recurrent neural networks. Transformers are receiving increasing attention from the computer vision community due to their high performance and less need for vision-specific sensing bias. In this paper, we review these visual transformer models, categorize them according to different tasks, and analyze their advantages and disadvantages. The main categories we explore include backbone networks, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformers into real device based applications. Additionally, we briefly introduce the self-attention mechanism in computer vision as it is a fundamental component of Transformers. At the end of this paper, we discuss the challenges faced by vision transformers and suggest further research directions.

Here, we review the work related to Transformer-based vision models to track progress in this area. Figure 1 shows the development timeline of the Vision Transformer – no doubt there will be many more milestones in the future.

image-20230607104428932

Figure 1. Key milestones in Transformer development. Vision Transformer models are marked in red.

2. Transformer composition

Transformer was first used in the field of natural language processing (NLP) for machine translation tasks. As shown in Figure 2, the structure of the original transformer. It consists of an encoder and a decoder containing several converter blocks with the same architecture.

image-20230607110916425

The encoder produces an encoding of the input, while the decoder takes all encodings and uses their combined contextual information to generate an output sequence. Each Transformer block consists of multi-head attention layers, feed-forward neural networks, shortcut connections, and layer normalization. Below, we describe each component of the transformer in detail.

2.1 Self-Attention

In the self-attention layer, the input vector is first transformed into three different vectors:

  1. query vector (query vector) q

  2. key vector k

  3. value vector (value vector) v

The dimensions of the three vectors are d(q, k, v)=d(model)=512, and the vectors derived from different inputs are packed into three different matrices, namely Q, K and V. Then, the attention function between different input vectors is calculated, as shown in Figure 3 below:

image-20230610124711208

Step 1: Calculate the score between different input vectorsimage-20230610124936691

Step 2: Normalize the stability score of the gradientimage-20230610124954786

Step 3: Use the softmax function to convert the score into a probabilityimage-20230610125047090

Step 4: Find the weighted value matrix Z=V*P

This process can be unified into a single function (dk = model dimension = 512)image-20230610125237930

The logic behind formula (1) is simple.

Step 1 calculates the scores between each pair of different vectors, these scores determine how much attention we give to other words when encoding the word at the current position

Step 2 normalizes the scores to enhance gradient stability for improved training;

Step 3 converts scores into probabilities.

Finally, multiply each value vector by the sum of probabilities. Vectors with larger probabilities will receive extra attention.

The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: the key matrix K and value matrix V are derived by the encoder module, and the query matrix Q is derived by the previous layer .

Note that the previous process is invariant to the position of each word , which means that the self-attention layer lacks the ability to capture the position information of words in the sentence. However, the sequential nature of sentences in language requires us to incorporate positional information in our encodings. To solve this problem and obtain the final input vector of words, a positional encoding of dimension dmodel is added to the original input embedding. Specifically, this position is encoded with the following formula

image-20230610141412149

Pos represents the position of the word in the sentence, and i represents the current dimension of the position encoding. In this way, each element of the positional encoding corresponds to a sine wave, which allows the Transformer model to learn to engage by relative position and extrapolate to longer sequence lengths during inference.

In addition to fixed positional encodings in vanilla transformers, learned and relative positional encodings are used in various models.

Multi-Head Attention (multi-head attention)

Multi-head attention is a mechanism that can be used to improve the performance of vanilla self-attention layers.

Note that for a given reference word, we usually want to focus on several other words throughout the sentence. A single self-attention layer limits our ability to focus on one or more specific locations without simultaneously compromising attention to other equally important locations. This is achieved by giving attention layers different representation subspaces.

Specifically, different query matrices and key-value matrices are used for different heads. These matrices are randomly initialized, and the input vectors can be projected into different representation subspaces after training.

To illustrate this in more detail, given an input vector and number of attention heads h, dmodel = model dimension

  1. First convert the input vector into three different sets of vectors: query group, key group and value group

  2. in each group. There are h vectors with dimensions dq=dk'=dv'=dmodel/h=64

  3. Then, the vectors derived from different inputs are packed into three different sets of matrices:image-20230610143155208

  4. The multi-head attention process is shown in the figure below:image-20230610143224497

Among them, Q' (K', V' is the same) is the series of {Qi}, and Wo is the projection weight.

2.2 Other key concepts of Transformer

2.2.1 Feed-Forward Network Feedforward Network

A feed-forward network (FFN) is employed after each encoder and decoder self-attention layer. It consists of two linear transformation layers and one of them a nonlinear activation function, which can be expressed as the following function image-20230610144522413where w1 and w2 are two parameter matrices of the two linear transformation layers, and s is a nonlinear activation function, such as GELU. The dimension of the hidden layer is dh=2048.

2.2.2 Residual Connection residual connection

As shown in Figure 2, a residual connection (black arrow) is added in each sublayer of the encoder and decoder.

image-20230610144839128

This enhances information flow for higher performance. After the remaining connections, layer normalization is employed. The output of these operations can be described asimage-20230610144956642

X is used as the input of the self-attention layer, and the query, key-value matrices Q, K, and V all come from the same input matrix X.

2.2.3 The last layer in the decoder

The last layer in the decoder is used to convert the stack of vectors back into a word. This is achieved with a linear layer and a softmax layer.

image-20230610145202538

The linear layer projects this vector as a vector of logits with dimension dword, where dword is the number of words in the vocabulary. The logit vectors are then converted to probabilities using a softmax layer.

When used for CV (computer vision) tasks, most transformers use the original transformer's encoder module. This transformer can be seen as a new type of feature extractor. Compared with CNN (Convolutional Neural Network), which only focuses on local features, Transformer can capture long-distance features, which means it can easily obtain global information.

Compared to RNNs (Recurrent Neural Networks), which have to compute hidden states sequentially, Transformers are more efficient because the outputs of self-attention and fully-connected layers can be computed in parallel and easily accelerated. From this, we can conclude that further research into the application of transformers in computer vision and natural language processing will yield beneficial results.

Guess you like

Origin blog.csdn.net/qq_43537420/article/details/131142114