Visual Transformer - VTs: Enhancing CNN with Transformer


Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。


From RNN to Attention to Transformer series

Column link:
https://blog.csdn.net/qq_39707285/category_11814303.html

此专栏主要介绍RNN、LSTM、Attention、Transformer及其代码实现。


YOLO series target detection algorithm

Column link:
https://blog.csdn.net/qq_39707285/category_12009356.html

此专栏详细介绍YOLO系列算法,包括官方的YOLOv1、YOLOv2、YOLOv3、YOLOv4、Scaled-YOLOv4、YOLOv7,和YOLOv5,以及美团的YOLOv6,还有PaddlePaddle的PP-YOLO、PP-YOLOv2等,还有YOLOR、YOLOX、YOLOS等。


Visual Transformer

Column link:
https://blog.csdn.net/qq_39707285/category_12184436.html

此专栏详细介绍各种Visual Transformer,包括应用到分类、检测和分割的多种算法。



VTs
使用Transformer加强CNN,paper:《Visual Transformers: Token-based Image Representation and Processing for Computer Vision》

1 Introduction

  The success of computer vision depends on:

  • Represent an image as an array of uniformly arranged pixels;
  • Convolving highly localized features.

  But there are several problems:

  • Convolution treats all image pixels equally regardless of their importance;
  • Model all concepts in all images regardless of content;
  • Difficulty connecting spatially distant concepts.

  In order to solve the above problems, this paper starts from the fundamental (pixel convolution paradigm), and introduces Visual Transformer (VT), a new paradigm that can represent and process advanced concepts (concepts) in images, as shown in the figure below.
insert image description here
  The point of this paper is that a sentence with only a few words (or visual tokens) is sufficient to describe high-level concepts in images. This paper uses spatial attention to convert feature maps into a compact set of semantic tokens. These tokens are then fed to a Transformer that captures token interactions. The computed visual tokens can be directly used for image-level prediction tasks (e.g., classification), or spatially reprojected to feature maps for pixel-level prediction tasks (e.g., segmentation).

  Unlike convolution, the VT in this paper can better deal with the above three challenges:

  1. Distribute calculations wisely by focusing on areas that matter, not treating all pixels equally;
  2. encode semantic concepts in several visual tokens related to an image, rather than modeling all concepts in all images;
  3. Through self-attention in token-space to associate spatially distant concepts.

  While using fewer FLOPs and parameters, the accuracy on ImageNet top 1 is 4.6 to 7 points higher than that of ResNet. For LIP and COCO semantic segmentation, the VT-based Feature Pyramid Network (FPN) achieves an mIoU surpassing of 0.35 points while reducing the FLOPs of the FPN module by a factor of 6.5.

2. VT

  See the previous section for the overall flow chart. First, the input image is processed with several convolutional blocks, and then the output feature map is fed to VT. The highlight of this article is to take advantage of convolution and VT:

  • Early in the network, convolutions are used to learn densely distributed low-level patterns;
  • In the later stages of the network, VT is used to learn and associate the distribution of sparser, higher-level semantic concepts;
  • At the end of the network, visual tokens are used for image-level prediction tasks, and augmented feature maps are used for pixel-level prediction tasks.

  The VT module consists of three steps:

  1. Group pixels into semantic concepts to generate a compact set of visual tokens;
  2. In order to model the relationship between semantic concepts, Transformer is applied to these visual tokens;
  3. These visual tokens are projected back to pixel space to obtain enhanced feature maps.

2.1 Tokenizer

  The point of this paper is that an image can be summarized by several handfuls of words or visual tokens. This is in stark contrast to convolutions that use hundreds of kernels and graph convolutions that use hundreds of "latent nodes" to detect all possible concepts regardless of image content. To exploit this perspective, this paper introduces a tokenizer module that converts feature maps into a compact set of vision tokens. Put the input image X ∈ RHW × CX \in R^{HW×C}XRH W × C is transformed intoT ∈ RL × CT \in R^{L×C}TRL × C , whereL ≪ HWL \ll HWLHW L L L represents the number of tokens.

2.2 Transformer

  After tokenization, the interactions between these visual tokens need to be modeled. This paper adopts Transformer, which uses weights related to the input. Therefore, Transformer supports visual tokens with variable meanings, covering more possible concepts with fewer tokens.

  This article uses a standard Transformer, but with some small changes:
insert image description here
insert image description here
  where T in , T ' out , T out ∈ RL × C T_{in},T`_{out},T_{out} \in R^{L ×C}Tin,Tout,ToutRL × C are visual tokens. In Transformer, the weight between tokens depends on the input and is calculated as a key-query:
insert image description here

Allows use of as few as 16 visual tokens. After self-attention, nonlinearity and two pointwise convolutions are used in equation (4), where F ' 1 , T 2 ∈ RL × LF`_1,T_2 \in R^{L×L}F1,T2RL × L is the weight,σ ( ⋅ ) \sigma(·)σ ( ) is the ReLU function.

2.3 Projector (projection)

  Many vision tasks require pixel-level details, but these details are not preserved in vision tokens. Therefore, the output of the Transformer is fused with the feature map to refine the pixel array representation of the feature map as:
insert image description here
where X in , X out ∈ RHW × C X_{in},X_{out} \in R^{HW×C }Xin,XoutRH W × C is the input and output feature maps;
( X in WQ ) ∈ RHW × C (X_{in}W_{Q}) \in R^{HW×C}(XinWQ)RH W × C is the query calculated according to the input feature map X_in;
( X in WQ ) p ∈ RC (X_{in}W_{Q}) _p \in R^C(XinWQ)pRC to pixelppp needs to encode the visual tokens information;
( TWK ) ∈ RL × C (TW_{K}) \in R^{L×C}(TWK)RL × C is the key calculated according to tokenT;
( TWK ) l ∈ RC (TW_{K})_l \in R^C(TWK)lRC means thellthInformation encoded by l
token; key-value determines how to encode in visual toeknTTThe information in T is projected to the original feature map.
WQ ∈ RC × C , WK ∈ RC × C W_{Q} \in R^{C×C},W_{K} \in R^{C×C}WQRC×C,WKRC × C is used to compute learnable weights for queries and keys.

3. Using Visual Transformer (VT) in the visual model

  In this section, we describe how to use VTs as building blocks in vision models. Three hyperparameters are defined for each VT: the channel size of feature maps; the channel size of visual tokens; and the number of visual tokens.

3.1 For image classification model

  Based on ResNet-{18, 34, 50, 101}, the corresponding Visual-Transformer-ResNets are constructed by replacing the last stage of the convolution with a VT module (VT ResNets: The last stage
of ResNet-{18, 34, 50, 101} A stage contains 2 basic blocks, 3 basic blocks and 3 bottleneck blocks respectively. Replace them with the same number (2, 3, 3, 4) of VT modules;

  • At the end of stage 4 (before max pooling in stage 5), ResNet-{18, 34} produces feature maps of shape 142×256, and ResNet-{50, 101} produces feature maps of shape 142×1024. For ResNet-{18, 34, 50, 101}, set the feature map channel size of VT to 256, 256, 1024, 1024;
  • Set to 16 visual tokens for all models, channel size is 1024;
  • At the end of the network, 16 visual tokens are output to the classification head, which applies average pooling to the tokens and uses a fully connected layer to predict probabilities.

  The specific model composition is shown in the following table:
insert image description here
  Since VTs only operate on 16 visible tokens, it can reduce the FLOP in the final stage by 6.9 times, as shown in Table 1:
insert image description here

3.2 Model for Image Semantic Segmentation

  Using VTs for semantic segmentation can address several challenges of the pixel-wise convolution paradigm:

  • First, the computational complexity of convolution increases with image resolution;
  • Second, it is difficult for convolutions to capture long-term interactions between pixels.
  • On the other hand, VT can operate on a small number of visual tokens regardless of image resolution, and since it simulates concept interactions in token-space, it bypasses the "long-range" challenge of pixel arrays.

  To test this hypothesis, Panoptic Feature Pyramid Network (FPN) is used as a baseline, and VT is used to improve the network. PanopticFPN uses ResNet as the backbone to extract feature maps from different stages at different resolutions. Then, these feature maps are fused in a top-down manner through a feature pyramid network to generate multi-scale and detail-preserving feature maps with rich semantics for segmentation (left side of the figure below). FPN is computationally expensive as it heavily relies on spatial convolutions operating on high-resolution feature maps with large channels. This paper uses VT instead of convolution in FPN. Name the new module VT-FPN (right side of the picture below).
insert image description here
  From the feature maps at each resolution, VT-FPN extracts 8 visual tokens with a channel size of 1024. Vision tokens are merged and fed into a Transformer to compute interactions between vision tokens of different resolutions. The output tokens are then projected back to the original feature map, which is then used to perform pixel-wise prediction. Compared with the original FPN, the computational cost of VT-FPN is much smaller, because only a very small number of visual tokens are operated on instead of all pixels. Experiments show that VT-FPN uses 6.4 times fewer FLOPs than FPN while maintaining or exceeding its performance (results are shown in the following two tables).
insert image description here

4 Conclusion

  The convention in computer vision is to represent an image as an array of pixels and apply a deep learning operator - convolution. Instead, this paper proposes Visual Transformers (VTs), a hallmark of a new computer vision paradigm that can more efficiently learn and relate high-level concepts from sparse distributions:

  • Unlike pixel arrays, VTs only use visual tokens (visual tokens) to represent high-level concepts in images;
  • Different from convolution, VTs apply Transformer to directly associate semantic concepts in token-space.

  To evaluate this idea, we replace the convolutional modules with VTs and obtain significant accuracy improvements across multiple tasks and datasets. Using advanced training configurations, VTs improve ResNet accuracy on ImageNet by 4.6 to 7 points. For semantic segmentation of LIP and COCO, the feature pyramid network (FPN) based on VTs achieves 0.35 points of mIoU improvement, and 6.5 times fewer FLOPs than the convolutional FPN module.

  Furthermore, this paradigm can be combined with other contemporaneous techniques beyond the scope of this paper, including additional training data and neural architecture search. However, the goal of this paper is not to demonstrate a large number of deep learning tricks, but to show that the pixel convolution paradigm is full of redundancy, and modern methods compensate for using multiple tricks, which add a lot of computational complexity.

  This paper fundamentally addresses the problem, rather than exacerbating computational demands, by employing a newly discovered token-transformer paradigm to address redundancy in pixel-wise convolution constraints.

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/128901674