Overview of Vision Transformer

1. Introduction to Transformer

Transformer, first applied in the field of natural language processing, is a deep neural network based on self-attention mechanism. Because of its powerful representation capabilities, researchers are looking for ways to apply transformers to computer vision tasks. On various vision benchmarks, Transformer-based models performed similarly or better than other types of networks, such as convolutional and recurrent neural networks. Transformers are receiving increasing attention from the computer vision community due to their high performance and less need for vision-specific sensing bias. In this paper, we review these visual transformer models, categorize them according to different tasks, and analyze their advantages and disadvantages. The main categories we explore include backbone networks, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformers into real device based applications. Additionally, we briefly introduce the self-attention mechanism in computer vision as it is a fundamental component of Transformers. At the end of this paper, we discuss the challenges faced by vision transformers and suggest further research directions.

Here, we review the work related to Transformer-based vision models to track progress in this area. Figure 1 shows the development timeline of the Vision Transformer – no doubt there will be many more milestones in the future.

image-20230607104428932

Figure 1. Key milestones in Transformer development. Vision Transformer models are marked in red.

2. Transformer composition

Transformer was first used in the field of natural language processing (NLP) for machine translation tasks. As shown in Figure 2, the structure of the original transformer. It consists of an encoder and a decoder containing several converter blocks with the same architecture.

image-20230607110916425

The encoder produces an encoding of the input, while the decoder takes all encodings and uses their combined contextual information to generate an output sequence. Each Transformer block consists of multi-head attention layers, feed-forward neural networks, shortcut connections, and layer normalization. Below, we describe each component of the transformer in detail.

2.1 Self-Attention

In the self-attention layer, the input vector is first transformed into three different vectors:

  1. query vector (query vector) q

  2. key vector k

  3. value vector (value vector) v

The dimensions of the three vectors are d(q, k, v)=d(model)=512, and the vectors derived from different inputs are packed into three different matrices, namely Q, K and V. Then, the attention function between different input vectors is calculated, as shown in Figure 3 below:

image-20230610124711208

Step 1: Calculate the score between different input vectorsimage-20230610124936691

Step 2: Normalize the stability score of the gradientimage-20230610124954786

Step 3: Use the softmax function to convert the score into a probabilityimage-20230610125047090

Step 4: Find the weighted value matrix Z=V*P

This process can be unified into a single function (dk = model dimension = 512)image-20230610125237930

The logic behind formula (1) is simple.

Step 1 calculates the scores between each pair of different vectors, these scores determine how much attention we give to other words when encoding the word at the current position

Step 2 normalizes the scores to enhance gradient stability for improved training;

Step 3 converts scores into probabilities.

Finally, multiply each value vector by the sum of probabilities. Vectors with larger probabilities will receive extra attention.

The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: the key matrix K and value matrix V are derived by the encoder module, and the query matrix Q is derived by the previous layer .

Note that the previous process is invariant to the position of each word , which means that the self-attention layer lacks the ability to capture the position information of words in the sentence. However, the sequential nature of sentences in language requires us to incorporate positional information in our encodings. To solve this problem and obtain the final input vector of words, a positional encoding of dimension dmodel is added to the original input embedding. Specifically, this position is encoded with the following formula

image-20230610141412149

Pos represents the position of the word in the sentence, and i represents the current dimension of the position encoding. In this way, each element of the positional encoding corresponds to a sine wave, which allows the Transformer model to learn to engage by relative position and extrapolate to longer sequence lengths during inference.

In addition to fixed positional encodings in vanilla transformers, learned and relative positional encodings are used in various models.

Multi-Head Attention (multi-head attention)

Multi-head attention is a mechanism that can be used to improve the performance of vanilla self-attention layers.

Note that for a given reference word, we usually want to focus on several other words throughout the sentence. A single self-attention layer limits our ability to focus on one or more specific locations without simultaneously compromising attention to other equally important locations. This is achieved by giving attention layers different representation subspaces.

Specifically, different query matrices and key-value matrices are used for different heads. These matrices are randomly initialized, and the input vectors can be projected into different representation subspaces after training.

To illustrate this in more detail, given an input vector and number of attention heads h, dmodel = model dimension

  1. First convert the input vector into three different sets of vectors: query group, key group and value group

  2. in each group. There are h vectors with dimensions dq=dk'=dv'=dmodel/h=64

  3. Then, the vectors derived from different inputs are packed into three different sets of matrices:image-20230610143155208

  4. The multi-head attention process is shown in the figure below:image-20230610143224497

Among them, Q' (K', V' is the same) is the series of {Qi}, and Wo is the projection weight.

2.2 Other key concepts of Transformer

2.2.1 Feed-Forward Network Feedforward Network

A feed-forward network (FFN) is employed after each encoder and decoder self-attention layer. It consists of two linear transformation layers and one of them a nonlinear activation function, which can be expressed as the following function image-20230610144522413where w1 and w2 are two parameter matrices of the two linear transformation layers, and s is a nonlinear activation function, such as GELU. The dimension of the hidden layer is dh=2048.

2.2.2 Residual Connection residual connection

As shown in Figure 2, a residual connection (black arrow) is added in each sublayer of the encoder and decoder.

image-20230610144839128

This enhances information flow for higher performance. After the remaining connections, layer normalization is employed. The output of these operations can be described asimage-20230610144956642

X is used as the input of the self-attention layer, and the query, key-value matrices Q, K, and V all come from the same input matrix X.

2.2.3 The last layer in the decoder

The last layer in the decoder is used to convert the stack of vectors back into a word. This is achieved with a linear layer and a softmax layer.

image-20230610145202538

The linear layer projects this vector as a vector of logits with dimension dword, where dword is the number of words in the vocabulary. The logit vectors are then converted to probabilities using a softmax layer.

When used for CV (computer vision) tasks, most transformers use the original transformer's encoder module. This transformer can be seen as a new type of feature extractor. Compared with CNN (Convolutional Neural Network), which only focuses on local features, Transformer can capture long-distance features, which means it can easily obtain global information.

Compared to RNNs (Recurrent Neural Networks), which have to compute hidden states sequentially, Transformers are more efficient because the outputs of self-attention and fully-connected layers can be computed in parallel and easily accelerated. From this, we can conclude that further research into the application of transformers in computer vision and natural language processing will yield beneficial results.

3. VISION TRANSFORMER

In this section, we review the applications of Transformer-based models in computer vision, including image classification, high/medium-level vision, low-level vision, and video processing. The application of self-attention mechanism and model compression method to high-efficiency transformers is briefly summarized.

3.1 The backbone of Representation Learning

Compared to text, images involve more dimensions, noise, and redundant morphologies, and thus are considered more difficult to model generatively. In addition to cnn, ResNet can be adopted as the baseline of the model, and the final stage of convolution is replaced by visual transformer.

The low-level features extracted by the convolutional layer are input to the vision transformer, and then a tokenizer is used to group pixels into a small number of visual tokens, each token representing a semantic concept in the image.

These visual markers are directly used for image classification, while the transformer is used to model the relationship between markers.

image-20230612091615269

As shown in Figure 4, the works are divided into purely using transformers and combining CNN with transformers.

We summarize the results of these models in Table 2 and Figure 6 to show the development of backbones.

image-20230612092301947

image-20230612092350068

Besides supervised learning, self-supervised learning has also been explored in visual transformers.

3.1.1 Pure Transformer

image-20230612093525765

Base ViT. The Vision Transformer (ViT) is a pure transformer that is directly applied to a sequence of image patches for image classification tasks. It follows the original design of the transformer as closely as possible. The framework of ViT is shown in Figure 5: the 2d image is reshaped into a block patch, and its position code and patch block are embedded together through a linear map.

Let (p,p) be the resolution of the patch, (h,w) the image resolution, and the effective sequence length of the transformer is n=hw/p^2. Since the Transformer uses a constant width across all its layers, a trainable linear projection maps each vectorized path to model dimension d, the output of which is called patch embeddings.

It is worth noting that ViT uses only standard Transformer encoders (except where used for layer normalization), whose output is preceded by an MLP header. In most cases, ViT is pre-trained on large datasets and then fine-tuned on downstream tasks on smaller data.

**Variants of ViT.** Based on the paradigm of virtual vision technology, a series of variants of virtual vision technology have been proposed to improve the performance of vision tasks. Primary avenues include enhanced locality, increased self-awareness, and architectural design.

There are some other directions to further improve the visual transformer, such as position encoding, normalization strategy, shortcut connection, attention removal, etc.

3.1.2 Transformer With Convolution

A feed-forward network (FFN) in each Transformer block is combined with convolutional layers to facilitate the correlation between adjacent tokens. In addition, some researchers have shown that transformer-based models may be more difficult to achieve good data fit ability, in other words, they are very sensitive to the choice of optimizer, hyperparameters and training schedule.

Visformer reveals the gap between Transformers and CNNs using two different training settings. The first one is the standard setting of CNN, the training time is shorter, and the data augmentation only consists of random cropping and horizontal flipping. The other is longer training schedule and stronger data augmentation. Changed ViT's early vision processing by replacing its embedding stem with a standard convolutional stem, and found that this change made ViT converge faster and allowed the use of AdamW or SGD without significant loss of accuracy. In addition to these two jobs, I chose to add a convolution stem on top of the transformer.

3.1.3 Self-supervised representation learning

Generation-Based Approach : This article briefly describes the mechanism of action of iGPT. This approach consists of a pre-training phase followed by a fine-tuning phase . In the pre-training phase, the autoregressive objective and the BERT objective are studied. To achieve pixel-wise prediction, the structure of sequence transformers is adopted instead of language symbols (as used in NLP). When used in conjunction with early stopping, pretraining can be considered a beneficial initialization or regularization. During the fine-tuning stage, they added a small classification head to the model. This helps to refine classification objectives and tune all weights.

Image pixels are converted into sequence data by k-means clustering. Given an unlabeled dataset X consisting of high-dimensional data X=(x1,...,xn), they train the model by minimizing the negative log-likelihood of the data:image-20230612142737154

p(x) is the probability density of the image data, which can be modeled as:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-IH41Nqiq-1686551810644) (C:/Users/dell/AppData/Roaming/Typora/typora-user-images/ image-20230612142825068.png)]

The identity permutation Πi=i applies to [1,n], which is also called the raster order.

iGPT and ViT are two pioneering works applying transformers to vision tasks. The difference between the iGPT model and the vi-like model is mainly reflected in three aspects:

  1. The input of iGPT is a palette sequence of a group of pixel clusters, while ViT evenly divides the image into several local small blocks;

  2. The iGPT architecture is an encoder-decoder framework, while ViT only has a transformer encoder;

  3. iGPT is trained with an autoregressive self-supervised loss, while ViT is trained with a supervised image classification task.

methods based on contrastive learning . Contrastive learning is currently the most popular form of self-supervised learning in computer vision. Applying Contrastive Learning to Unsupervised Pretraining of Vision Transformers

3.1.4 Discussion

The multi-head self-attention, multi-layer perceptron, shortcut connection, layer normalization, position encoding, and network topology components of the visual transformer play key roles in visual recognition. As mentioned above, many works have been proposed to improve the effectiveness and efficiency of vision transformers. From the results in Figure 6, it can be seen that combining CNN and transformer can achieve better performance, which shows that they achieve complementarity through local connections and global connections. Further research on the backbone network will contribute to the improvement of the entire vision community. For self-supervised representation learning of visual transformers, we still need to strive for the success of large-scale pre-training in NLP.

3.2 High/Intermediate Vision Tasks

Recently, there has been increasing interest in using Transformers for high/mid-level computer vision tasks such as object detection, lane detection, segmentation, and pose estimation. We review these methods in this section.

3.2.1 Generic object detection

Traditional object detection methods are mainly based on CNN, while Transformer-based object detection methods have attracted extensive attention due to their superior performance.

Some object detection methods try to exploit the self-attention mechanism of transformers, and then enhance specific modules of modern detectors, such as feature fusion modules and prediction heads.

Transformer-based object detection methods can be broadly classified into two categories:

  1. Transformer-Based Ensemble Prediction Method

  2. Transformer-based backbone approach, as shown in Figure 7

image-20230614094501482

Table 3 shows the detection results of different Transformer-based object detectors mentioned earlier on the COCO 2012 val set.

image-20230614094649037

Transformer-based set prediction detection (DETR)

DETR , a simple full end-to-end object detector, treats the object detection task as an intuitive ensemble prediction problem, eliminating traditional hand-crafted components such as anchor generation and non-maximum suppression (non-maximum suppression) , NMS) post-processing. (Feedforward Network (FFN))

image-20230614100038250

  1. To complement image features with positional information, image features are encoded with a fixed position before they are fed into the encoder-decoder transformer.

  2. The decoder uses the embedding from the encoder along with N learned positional encodings (object queries) and produces N output embeddings. (N here is a predefined parameter, usually larger than the number of objects in the image)

  3. A simple feed-forward network (ffn) is used to compute the final prediction, which includes bounding box coordinates and a class label to represent a specific class of object (or to represent the absence of an object).

  4. Unlike the original Transformer, which computes predictions sequentially, DETR decodes N objects in parallel. DETR employs a bipartite matching algorithm to assign predicted and ground truth objects. As shown in Equation (11), the Hungarian loss is utilized to calculate the loss function for all matching object pairs.image-20230614100441971

DETR has shown impressive performance on object detection, providing comparable accuracy and speed to the popular and well-established Faster R-CNN baseline on the COCO benchmark, DETR is a transformer-based object detection framework. New design, which enables the community to develop fully end-to-end detectors. However, vanilla DETR poses several challenges, specifically, a long training schedule and poor performance for small object detection.

Deformable DETR is 10x cheaper to train and 1.6x faster to infer than DETR. By adopting an iterative bounding box optimization method and a two-stage scheme, Deformable DETR can further improve the detection performance.

Efficient DETR incorporates dense prior information into the detection pipeline via an additional region proposal network. Better initialization enables them to use only one decoder layer instead of six to achieve competitive performance for more compact networks.

Transformer-based detection backbone (R-CNN)

Unlike DETR, which redesigns object detection as a set prediction task through transformers, Beal et al. [4] utilize transformers as the backbone of common detection frameworks, such as Faster R-CNN . The input image is segmented into several small blocks and fed into the vision transformer, whose output embedding features are reorganized according to the spatial information, and then passed through the detection head to get the final result. A large-scale pre-trained Transformer backbone can bring benefits to the proposed ViT-FRCNN .

There are also quite a few approaches to explore multifunctional vision transformer backbone designs and transfer these backbones to traditional detection frameworks such as RetinaNet and Cascade R-CNN. For example, Swin Transformer achieves 4 box AP gain with ResNet-50 backbone, using similar FLOPs for various detection frameworks.

Transformer-based pre-training (YOLO)

Inspired by the pre-trained Transformer scheme in NLP, several methods have been proposed to explore different pre-training schemes for Transformer-based object detection. Dai et al. proposed unsupervised pre-training for object detection (UP-DETR).

Specifically, a new unsupervised subterfuge task - random query patch detection is proposed to pre-train the DETR model. With this unsupervised pre-training scheme, UP-DETR significantly improves the detection accuracy on a relatively small dataset (PASCAL VOC). On the COCO benchmark with sufficient training data, UP-DETR still outperforms DETR, proving the effectiveness of the unsupervised pre-training scheme .

YOLOS first removes classification markers in ViT and adds learnable detection markers. Furthermore, a bipartite matching loss is used for set prediction of the target. By using this simple pre-training scheme on the ImageNet dataset, the proposed YOLOS shows competitive object detection performance on the COCO benchmark.

3.2.2 Segmentation

Segmentation is an important research topic in the field of computer vision, broadly including panoptic segmentation, instance segmentation, and semantic segmentation . Vision transformers also show amazing potential in the field of segmentation.

Transformer (DETR) for Panoramic Segmentation

DETR can be naturally extended to panoptic segmentation tasks and achieves competitive results by attaching a masking head to the decoder.

Transformer (VisTR) for instance segmentation

VisTR is a transformer-based video instance segmentation model

Transformer (SETR) for Semantic Segmentation

SETR is a transformer-based semantic segmentation network

Transformer (Cell-DETR) for medical image segmentation

This is a unet-like pure Transformer for medical image segmentation, by inputting tokenized image blocks into the Transformer-based du-shapedencoderdecoder architecture with skip connections for local and global semantic feature learning.

3.2.3 Pose Estimation

Human pose and hand pose estimation is a fundamental topic that has attracted significant interest from the research community. Joint pose estimation is similar to structured prediction tasks, aiming at predicting joint coordinates or mesh vertices from input RGB/D images. There are two categories to be aware of:

  1. Transformers for Hand Pose Estimation
  2. Transformers for Human Pose Estimation

3.2.4 Other tasks

There are also many different high/intermediate vision tasks that have explored the use of vision transformers for better performance. We will briefly review the following tasks.

Pedestrian Detection Pedestrian Detection

Lane Detection Lane Detection

Scene graph Scene Graph

Tracking Tracking

Jugetsu Re-Identification

Point cloud Point Cloud Learning

The following sections do not explain further, I hope you can understand it yourself

3.3 Low-Level Vision Task Low-Level Vision

There is little research on applying transformers in low-level vision domains such as image super-resolution and image generation. These tasks usually have images as output (e.g., high-resolution or denoised images), which are more challenging than higher-level vision tasks such as classification, segmentation, and detection, where the output is labels or boxes.

3.3.1 Image Generation Image Generation

image-20230615091806495

Figure 9. A general framework for transformers in image generation.

3.3.2 Image ProcessingImage Processing

3.4 Video processing

3.5 Multimodal tasks

Due to the success of Transformers in text-based natural language processing tasks, many studies are keen to exploit their potential when dealing with multimodal tasks. One such example is VideoBERT, which uses a CNN-based module to preprocess videos to obtain representation tokens.

3.6 Efficient Transformer

While Transformer models have been successful in a variety of tasks, their high memory and computational resource requirements hinder their implementation on resource-constrained devices such as mobile phones. In this section, we review research on compression and acceleration of transformer models for efficient implementation. These include network pruning, low-rank decomposition, knowledge distillation, network quantization, and compact design .

Table 4 lists some representative compression transformer models.

image-20230615092511403

image-20230615092640336

Figure 13. Different methods of compressing a transformer.

The methods described above take different approaches when trying to identify redundancies in the transformer model (see Figure 13). Pruning and factorization methods usually require predefined redundant models. Specifically, pruning focuses on reducing the number of components (e.g., layers, heads) in the transformer model, while decomposition represents an original matrix with multiple small matrices. Compact models can also be designed directly by hand (requiring sufficient expertise) or automatically (e.g. via NAS). The resulting compact model can be further represented with low bits by quantization methods for efficient deployment on resource-constrained devices.

4. Conclusion and Discussion

Compared with CNN, Transformer's performance and huge potential make it a hot topic in the field of computer vision. To discover and utilize the energy of transformers, as summarized in this survey, many methods have been proposed in recent years. These methods excel in vision tasks such as backbone, high/intermediate-level vision, low-level vision, and video processing. However, the potential of transformers for computer vision has not been fully exploited, which means that there are still several challenges to be solved. In this section, we discuss these challenges and provide insights into future prospects.

To advance the development of visual transformers, we propose several potential research directions.

One direction is the effectiveness and efficiency of transformers in computer vision. The goal is to develop efficient and efficient vision transformers; specifically, high-performance transformers with low resource cost. Performance determines whether a model can be applied to real-world applications, while resource cost affects device deployment. Effectiveness is usually related to efficiency, so how to achieve a better balance between the two is a meaningful topic for future research.

There are various types of neural networks such as CNN, RNN, and transformer. In the CV world, CNNs used to be the mainstream choice, but now transformers are becoming more and more popular. CNNs can capture inductive biases such as translation variance and locality, while ViT uses large-scale training to go beyond inductive biases. From the currently available evidence [15], CNNs perform well on small datasets, while Transformers perform better on large datasets. The future question is to use CNN or Transformer.

Guess you like

Origin blog.csdn.net/qq_43537420/article/details/131221043