ShiftViT uses the accuracy of Swin Transformer to outperform the speed of ResNet, discussing the success of ViT is out of focus!

Click on "Computer Vision Workshop" above and select "Star"

Dry goods delivered as soon as possible

b5e58e58eeadd3eeb40c7f321da6f421.png

Author: ChaucerG

Source丨Jizhi Shutong

eb0b53b78c50c67e967a4d13f1047454.png

The attention mechanism is widely regarded as the key to the success of Vision Transformer (ViT), as it provides a flexible and powerful method to model spatial relationships. However, is the attention mechanism really an integral part of ViT ? Can it be replaced by some other alternative ? To uncover the role of the attention mechanism, the authors reduce it to a very simple case: ZERO FLOP and ZERO parameter.

Specifically, the authors revisit the Shift operation. It does not contain any parameters or arithmetic calculations. The only operation is to swap a small number of channels between adjacent features. Based on this simple operation, the author constructs a new Backbone, namely ShiftViT, in which the attention layer in ViT is replaced by the shift operation.

Surprisingly, ShiftViT works well on several mainstream tasks, such as classification, detection, and segmentation. The performance is even better than Swin Transformer. These results suggest that the attention mechanism may not be the key factor that makes ViT successful. It can even be replaced by a zero-argument operation. In future work, more attention should be paid to the rest of ViT.

1 Introduction

The design of the Backbone plays a vital role in computer vision. Convolutional Neural Networks (CNNs) have dominated this neighborhood for almost 10 years since the revolutionary advancement of AlexNet. However, recent ViTs have shown potential to challenge this throne. The advantages of ViT are first demonstrated on the image classification task, where ViT's Backbone significantly outperforms CNN's Backbone. Due to the excellent performance of ViT, variants of ViT have flourished and been rapidly applied to many other computer vision tasks, such as object detection, semantic segmentation, and action recognition.

Despite the impressive performance of recent ViT variants, it remains unclear what makes ViT beneficial for vision tasks. Some works tend to attribute success to the attention mechanism, as it provides a flexible and powerful way to model spatial relationships. Specifically, the attention mechanism utilizes a self-attention matrix to aggregate features from arbitrary locations. It has two significant advantages over convolution operations in CNNs. First, this mechanism offers the possibility to capture both short-term and long-term dependencies simultaneously and gets rid of the local constraints of convolution. Second, the interaction between two spatial locations dynamically depends on the respective features, rather than a fixed convolution kernel. Due to these excellent properties, some works believe that the attention mechanism contributes to the powerful expressive ability of ViT.

However, are these two strengths really the key to success? The answer is probably not. Some existing work proves that the ViT variant can still work well even without these properties. For the first, global dependencies may not be inevitable. More and more ViTs introduce a local attention mechanism to restrict their attention to a small local area, such as Swin-Transformer and Local ViT. Experimental results show that the performance of the system is not degraded by local constraints. Additionally, another study explored the need for dynamic aggregation. MLP-Mixer proposes to replace the attention layer with a linear projection layer, where the linear weights are not dynamically generated. In this case, it can still achieve leading performance on the ImageNet dataset.

Since both global and dynamic features may not be important to the ViT framework, what is the root cause of ViT's success ? For clarity, the authors further simplify the attention layer to a very simple case: no global scope , no dynamics , not even parameters , and no arithmetic computations . The authors wondered whether ViT could maintain good performance in this extreme case.

90cef98fd8fdff43c404bd1530434399.png
figure 1

Conceptually, this zero-parameter alternative must rely on hand-crafted rules to model spatial relationships. In this work, the Shift operation is revisited, which the authors consider to be one of the simplest spatial modeling modules. As shown in Figure 1, a standard ViT building block consists of 2 parts: an attention layer and a feed-forward network (FFN) .

The authors replace the previous attention layer with a Shift operation, while keeping the latter FFN part unchanged. Given an input feature, the proposed building block will first move a small number of channels along 4 spatial directions, namely left, right, up and down. Therefore, the information of adjacent features is mixed together by the Shift of the channel. The subsequent FFN then performs channel mixing to further fuse information from adjacent channels.

Based on this shift building block, a ViT-like Backbone is built, namely ShiftViT. Surprisingly, this Backbone also works well for mainstream visual recognition tasks. Performance is comparable to or better than Swin Transformer. Specifically, within the same computational budget as the Swin-T model, ShiftViT achieves 81.7% on the ImageNet dataset (vs. 81.3% for Swin-T). For the dense prediction task, the mean precision (mAP) is 45.7% (43.7% of Swin-T) on the COCO detection dataset and 46.3% (44.5% of Swin-T) on the ADE20k segmentation dataset. .

Since the Shift operation is already the simplest spatial modeling module, excellent performance must come from the remaining components, such as linear layers and activation functions in FFN. These components are less studied in existing work because they seem trivial. However, to further demystify why ViT works, the authors argue that more attention should be paid to these components rather than just the attention mechanism.

In summary, the contributions of this work are in two aspects:

  • A ViT-like Backbone is proposed, where the ordinary attention layer is replaced by a very simple Shift operation. This model can achieve better performance than Swin Transformer

  • The reasons behind the success of ViT are analyzed. This hints that the attention mechanism may not be the key factor in making ViT work. The remaining components should be taken seriously in future ViT studies

2 Related work

2.1 Attention and Vision Transformers

Transformers were first introduced into the field of natural language processing (NLP). It only employs an attention mechanism to establish connections between tokens in different languages. Transformers have quickly dominated the NLP field and become the de facto standard due to their outstanding performance.

Inspired by successful applications of natural language processing, attention mechanisms have also received increasing interest in the computer vision community. Early exploration can be roughly divided into two categories. On the one hand, some literature considers attention as a plug-and-play module that can be seamlessly integrated into existing CNN architectures. Representative works include non-local network, Relation network and CCNet. On the other hand, some works aim to replace all convolutional operations with attention mechanisms, such as Local Relation networks and Self-Attention.

While these two works have shown promising results, they are still based on CNN architectures. ViT is a pioneering work utilizing a pure Transformer architecture for visual recognition tasks. Due to its impressive performance, the field has recently seen a rising wave of research on visual Transformers.

Along this research direction, the main research focus is on improving the attention mechanism so that it can satisfy the intrinsic properties of visual signals. For example, MSViT builds hierarchical attention layers to obtain multi-scale features. Swin-Transformer introduces a locality constraint in its attention mechanism. Related work also includes pyramid attention, local-global attention, cross attention and so on.

Unlike the special interest in attention mechanisms, the remaining components of ViT are less studied. DeiT builds a standard training pipeline for Vision Transformers. Most subsequent works inherit its configuration with only some modifications to the attention mechanism. The work in this paper also follows this paradigm. However, the purpose of this work is not the design of complex attention. Instead, the purpose of this paper is to show that attention mechanisms may not be a critical part of making ViTs work. It can even be replaced by a very simple Shift operation. The authors hope these results will inspire researchers to rethink the role of attention mechanisms.

2.2 MLP Variants

The work in this paper is related to a recent variant of Multilayer Perceptron (MLP). Specifically, a variant of MLP proposes to extract image features through a pure MLP-like architecture. Out of the attention-based framework in ViT. For example, MLP-Mixer introduces a token mixing MLP to directly connect all spatial locations. It removes the dynamic nature of ViT without losing accuracy. Subsequent work investigated more MLP designs such as spatially gated units or recurrent connections.

ShiftViT can also be classified as a pure MLP architecture, where Shift operations are treated as a special token mixing layer. Compared to existing MLP work, the Shift operation is simpler because it contains no parameters and no FLOPs. Furthermore, ordinary MLP variants cannot handle variable input sizes due to having fixed linear weights. The Shift operation overcomes this obstacle, thus enabling Backbone to be used for more vision tasks such as object detection and semantic segmentation.

2.3 Shift Operation

Shift operations are nothing new in computer vision. Back in 2017, it was considered an efficient alternative to spatial convolution operations. Specifically, it uses a sandwich-like architecture with 2 1×1 convolutions and a Shift operation to approximate a K×K convolution. In follow-up work, the Shift operation is further extended to different variants, such as active Shift, sparse Shift, and partial Shift.

3Shift-ViT

3.1 Architecture Overview

For a fair comparison, the authors follow the architecture of the Swin Transformer. An overview of the architecture is shown in Figure 2(a).

87bf804af13c40e02226c9b8d790a418.png
Figure 2(a)

Specifically, given an input image of shape H × W × 3, it first splits the image into non-overlapping patches. The patch-size is 4×4 pixels. Therefore, the output value of the patch partition is H/4×W/4 tokens, where each token has a channel size of 48.

The next module can be divided into 4 stages. Each stage consists of 2 parts: embedding generation and stacking shift blocks . For the first-stage embedding generation, a linear projection layer is used to map each token into an embedding with channel size c. For the remaining stages, adjacent patches are merged by convolution with kernel size of 2 × 2. After patch merging, the spatial size of the output is half of the downsampling, and the channel size is 2 times the input, i.e. from C to 2C.

Stacked shift blocks are made up of repeating basic units. The detailed design of each shift block is shown in Figure 2(b). It is shown that it consists of shift operation, layer normalization and MLP network. This design is almost identical to the standard Transformer block. The only difference is that a shift operation is used instead of an attention layer. For each stage, the number of shift blocks can be different, denoted as , , , respectively. In the out implementation, the values ​​are carefully chosen such that the entire model shares a similar number of parameters with the Baseline Swin Transformer model.

3.2 Shift Block

The detailed architecture of Shift Block is shown in Figure 2(b).

028857ba29ba8b8db9012a7d0d2ef8dc.png

Figure 2(b)

Specifically, the block consists of 3 sequentially stacked components: Shift operation, layer normalization, and MLP network.

Shift operations have been well studied in CNNs. It can have many design choices like active Shift and sparse Shift. The partial Shift operation in TSM is followed in this work. As shown in Figure 1(b). Given an input tensor, a small number of channels move along 4 directions in space, i.e. left, right, up, down, while the rest of the channels remain unchanged. After Shift, out-of-range pixels are simply deleted and blank pixels are filled with zeros. In this work, the Shift step size is set to 1 pixel.

Formally, the shape of the input feature z is assumed to be H×W×C, where C is the number of channels, and H and W are the spatial height and width, respectively. The output feature z has the same shape as the input feature. It can be written as:

13a2d5dc52f97e274c8d9fdf0696e4f1.png

where γ is a scaling factor to control the percentage of channels. In most experiments, the value of γ was set to 1/12.

2199123081ff54a273d492356c64b031.png

Pytorch implementation

It is worth noting that the Shift operation does not contain any parameters or arithmetic calculations. The only implementation is memory copying. Therefore, the Shift operation is efficient and easy to implement. This pseudocode is presented in Algorithm 1. Compared with the self-attention mechanism, the Shift operation is cleaner, neater, and more friendly to deep learning inference libraries such as TensorRT.

The rest of the Shift block is the same as ViT's standard building blocks. The MLP network has 2 linear layers. The first method increases the channels of the input features to a higher dimension, e.g., from C to τC. Then, the 2nd linear layer projects the high-dimensional features into the original channel size of c. Between these two layers, GELU is adopted as the nonlinear activation function.

3.3 Architecture Variants

For comparison with the Baseline Swin Transformer, the authors also build multiple models with different numbers of parameters and computational complexity. Specifically, Shift-T (iny), Shift-S (mall), Shift-B (ase) variants were introduced, corresponding to swin-t, swin-s, and swin-b, respectively. Shift-T is the smallest, which is similar in size to Swin-T and ResNet-50. The other two variants, Shift-S and Shift-B, are about 2x and 4x more complicated than shiftvit. The detailed configuration and number of blocks { } of the basic embedding channel C are as follows:

5fd733026b0c697e4fac250b16fef109.png

In addition to the model size, the authors also took a closer look at the depth of the model. In the proposed model, almost all parameters are concentrated in the MLP part. Therefore, the expansion ratio of MLP τ can be controlled to obtain deeper network depths. If not specified, sets the expansion ratio τ to 2. Ablation analysis shows that deeper models achieve better performance.

4 Realization

4.1 Ablation experiment

1、Expand ratio of MLP

Previous experiments demonstrate the design principle of this paper, that a large model depth can compensate for the deficiencies of each component. Often, there is a trade-off between model depth and building block complexity. With a fixed computational budget, lightweight building blocks can enjoy deeper network working architectures.

6a1a601933bc93c4970d73c00ec27566.png

table 5

To further investigate this trade-off, the authors provide some ShiftViT models with different depths. For ShiftViT, most parameters exist in the MLP section. The authors can control the model depth by changing the expansion ratio of the MLP τ. As shown in Table 5, Shift-T was chosen as the baseline model. The expansion ratio τ in the range of 1 to 4 was investigated. It is worth noting that the parameters and FLOPs of different entries are almost the same.

From Table 5, a trend can be observed that the deeper the model, the better the performance. When the depth of ShiftViT is increased to 225, the absolute gains in classification, detection and segmentation are improved by 0.5%, 1.2% and 2.9% over 57 layers in classification, detection and segmentation, respectively. This trend supports the conjecture that powerful and heavy modules, such as attention, may not be the best choice for Backbone.

2、Percentage of shifted channels

The Shift operation has only one hyperparameter, which is the percentage of the shift channel. By default, it is set to 33%. Some other settings are explored in this section. Specifically, set the proportions of the moving channels to 20%, 25%, 33%, and 50%, respectively. The results are shown in Figure 3. This shows that the final performance is not very sensitive to this hyperparameter. Moving 25% of the channel would only result in an absolute loss of 0.3% compared to the optimal setting. Within a reasonable range (25%-50%), all settings achieve better accuracy than Swin-T Baseline.

3、Shifted pixels

In a Shift operation, a small number of channels are shifted by one pixel in 4 directions. For a thorough exploration, different moving pixels were also tried. When the offset pixel is 0, that is, when no offset occurs, the Top-1 accuracy of the ImageNet dataset is only 72.9%, which is significantly lower than the Baseline of this paper (81.7%). This is not surprising, since no movement means no interaction between the different spatial locations. Furthermore, if shifted by two pixels in the shift operation, the model achieves 80.2% top-1 accuracy on ImageNet, which is also slightly worse than the default setting.

4、ViT-style training scheme

Shift operations have been well studied in CNNs. However, previous work has not been as impressive as this one. The accuracy of Shift-ResNet-50 on ImageNet is only 75.6%, which is much lower than the accuracy of 81.7%. This gap raises a natural concern about what's good for ViT.

The authors suspect that the reason may lie in a virtual reality-style training program. Specifically, most existing ViT variants follow the setting in DeiT, which is quite different from the standard pipeline for training CNNs. For example, the ViT-style scheme uses the AdamW optimizer and is trained on ImageNet for 300 epochs. In contrast, CNN-style schemes tend to favor SGD optimizers, and the training schedule is usually only 90 epochs. Since our model inherits a ViT-style training scheme, it will be interesting to observe how these differences affect performance.

ec8dfccebe356779fff626acdba3bdd4.png

Table 6

Due to resource constraints, ViT-style and CNN-style cannot be completely aligned between all settings. Therefore, 4 important factors that are believed to bring enlightenment are selected, namely optimizer, activation function, normalization layer and training plan. As can be seen from Table 6, these factors can significantly affect the accuracy, especially the training progress. These results show that the good performance of ShiftViT is partly brought about by the ViT-style training scheme. Likewise, the success of ViT may also be related to its special training program. This issue should be taken seriously in future ViT studies.

4.2 ImageNet and COCO

4e717cb842e0833a134dd08bd53d64b9.png

Table 2

Overall, our method can achieve performance comparable to state-of-the-art. For ViT-based and mlp-based methods, the best performance is around 83.5%, while our model achieves 83.3% accuracy. For CNN-based methods, our model is slightly worse but the comparison is not entirely fair because EfficientNet takes a larger input size.

Another interesting thing is working with 2 S^2-MLP and AS-MLP. The two parts of the work have similar ideas in the Shift operation, but they introduce some auxiliary modules in the building blocks, such as the pre-projection layer and the post-projection layer. In Table 2, this paper performs slightly better than these two works. It's a testament to the design choice that Backbone can be built just fine with just a simple Shift operation.

715e488fb42e38757ed969ce652527aa.png

image

Besides classification tasks, similar performance trajectories can be observed for object detection tasks and semantic segmentation tasks. It is worth noting that some ViT- and mlp-based methods do not easily scale to such dense prediction tasks, because high-resolution inputs create an unbearably computational burden. Due to the high efficiency of the Shift operation, our method does not have this obstacle.

849f50ed32bb5b55bbfbaaeee5a37958.png

table 3
199cf48d73c32166e6d905298e9041e8.png
Table 4

As shown in Tables 3 and 4, the advantages of ShiftViT are obvious. ShiftT has a mAP score of 47.1 points on object detection and an mIoU score of 47.8 points on semantic segmentation, which is significantly better than other methods.

5 References

[1].When Shift Operation Meets Vision Transformer:An Extremely Simple Alternative to Attention Mechanism

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

18241d23f8ea4f403df5d346710cab71.png

▲Long press to add WeChat group or contribute

00604a4bffcc9d2d3d22e3771b06f219.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

4b6f02c6287c21976956b1ee2fb054e1.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/123244247