CVPR 2022 | Tsinghua Open Source ACmix: Fusion of Self-Attention and CNN! Performance speed has been fully improved!

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Reprinted from: Jizhi Shutong

7793a3425a6317e57f69348214582d50.png

On the Integration of Self-Attention and Convolution

Paper: https://arxiv.org/abs/2111.14556

Code (partly open source):

https://github.com/Panxuran/ACmix

Convolution and Self-Attention are two powerful representation learning methods, and they are often considered as two different methods from each other.

It is demonstrated in this paper that there is a strong underlying relationship between them, since most of the computations for both methods are actually done with the same operations. Specifically:

  • First , it is proved that convolutions can be decomposed into independent convolutions;

  • Then , perform shift and sum operations;

  • Then , the projections of query, key, and value in the Self-Attention module are interpreted as multiple convolutions, and then the aggregation of attention weights and values ​​is calculated.

Therefore, the first stages of both modules contain similar operations. More importantly, the computational complexity (square of channels) of the first stage dominates compared to the second stage.

This observation naturally leads to an elegant integration of these two seemingly different paradigms, i.e., a hybrid model that takes into account the advantages of both Self-Attention and Convolution, while having Less computational overhead. Extensive experiments show that our method achieves continuously improved results on image recognition and downstream tasks.

1 Introduction

In recent years, convolution and Self-Attention have made great strides in the field of computer vision. Convolutional neural networks are widely used in image recognition, semantic segmentation, and object detection, and have achieved state-of-the-art performance on various benchmarks. Recently, with the advent of the Vision Transformer, Self-Attention-based modules have achieved comparable or even better performance than their CNN counterparts on many vision tasks.

Although both approaches have achieved great success, convolutional and Self-Attention modules generally follow different design paradigms. Traditional convolution utilizes an aggregation function over the local receptive field based on the weights of the convolution, which are shared across the entire feature map. Inherent features bring crucial inductive bias to image processing.

In contrast, the Self-Attention module employs a weighted average operation based on the context of the input features, which dynamically computes attention weights through the similarity function between related pixel pairs. This flexibility enables the attention module to adaptively focus on different regions and capture more features.

Considering the different and complementary nature of convolution and Self-Attention, there is a potential possibility to benefit from both paradigms by integrating these modules. Previous work explored the combination of Self-Attention and convolution from several different perspectives.

Earlier studies, such as SENet, CBAM, showed that Self-Attention can be used as an enhancement for convolutional modules. Recently, Self-Attention was proposed as an independent block to replace traditional convolution in CNN models, such as SAN, BoTNet.

Another research focuses on combining Self-Attention and convolution in a single Block, such as AA-ResNet, Container, and this architecture is limited to designing independent paths for each block. Therefore, existing methods still treat Self-Attention and convolution as different parts, and do not fully exploit the intrinsic relationship between them.

In this paper, the authors attempt to reveal a closer relationship between Self-Attention and convolution. Decomposing the operations of these two modules shows that they rely heavily on the same convolution operations. The authors develop a hybrid model based on this observation, named ACmix , and elegantly integrate Self-Attention and convolution with minimal computational overhead.

Specifically:

  • First, a rich set of intermediate features is obtained by mapping the input features using convolutions;

  • Then, intermediate features are reused and aggregated according to different modes (Self-Attention and Convolution, respectively).

In this way, ACmix enjoys the advantages of both modules and effectively avoids two expensive projection operations.

Main contributions:

  1. Reveals the powerful underlying relationship between Self-Attention and convolution, providing a new perspective for understanding the connection between the two modules and inspiration for designing new learning paradigms;

  2. An elegant integration of Self-Attention and Convolutional modules is introduced, which enjoys the advantages of both. Empirical evidence shows that hybrid models consistently outperform their pure convolutional or Self-Attention counterparts.

2 Related work

Convolutional neural networks use convolutional kernels to extract local features and have become the most powerful and routine technique in various vision tasks. Meanwhile, Self-Attention also shows general performance in a wide range of language tasks such as BERT and GPT3. Theoretical analysis shows that Self-Attention can represent the function class of any convolutional layer when it has a large enough capacity. Therefore, a recent study explored the possibility of introducing Self-Attention into vision tasks.

There are two main methods:

  • One is to use Self-Attention as a building block in the network;

  • The other is to use Self-Attention and convolution as the complementary part.

2.1 Self-Attention only

Some studies have shown that Self-Attention can be a complete replacement for convolution operations. Recently, the Vision Transformer showed that, given enough data, it is possible to treat an image as a sequence of 256 tokens and leverage the Transformer model to achieve competitive results in image recognition. In addition, the Transformer paradigm is adopted in vision tasks such as detection, segmentation, and point cloud recognition.

2.2 Convolution with attention boosting

Multiple previously proposed image attention mechanisms have shown that it can overcome the locality limitations of convolutional networks. Therefore, many researchers explore the possibility of using attention modules or exploiting more relational information to enhance the functions of convolutional networks.

  • Squeeze-andExcitation (SE) and Gather-Excite (GE) Reweight feature maps for each channel.

  • BAM and CBAM Reweight channels and spatial locations for better refinement of feature maps.

  • AA-ResNet augments some convolutional layers by concatenating attention maps from another independent Self-Attention.

  • BoTNet replaces convolution with Self-Attention.

Some works aim to design a more flexible feature extractor by aggregating information from a larger range of pixels. Hu et al. proposed a local relationship method to adaptively determine aggregation weights according to the compositional relationship of local pixels. Wang et al. proposed a Non-Local network to increase the receptive field by introducing a Non-Local block of similarity between global pixels.

2.3 Using Convolution to Improve Attention

With the advent of Vision Transformer, many Transformer-based variants have been proposed and achieved significant improvements on computer vision tasks. Among them, the existing research mainly focuses on convolution operation on Transformer model to introduce additional inductive bias.

  • CvT uses convolution in the Token process, and uses convolution to reduce the computational complexity of Self-Attention.

  • ViT with convolutional stem proposes to increase convolution in the early stage to achieve more stable training.

  • The CSwin Transformer adopts a convolution-based positional encoding technique with improvements for downstream tasks.

  • Conformer combines Transformer with a standalone CNN model to integrate these two functions.

3 Review of old knowledge

3.1 Convolution Operation

Convolutions are one of the most important components of modern ConvNets. First review the standard convolution operation and reformulate it from a different perspective. As shown in Figure 2(a). For simplicity, the stride of the convolution is assumed to be 1.

Consider a standard convolution where k is the kernel size, and , is the size of the input and output channels.

The input and output tensors are known, where H, W represent height and width, and let , as pixels correspond to F and G, respectively. Then the standard convolution can be expressed as:

f9204abaf0fa7b3f33f77147fea87f76.png b760eedff4c33eeec35a2aea35c1e26e.png

Represents the kernel weights of the kernel position(p,q).

For convenience, equation (1) can be rewritten as the sum of feature maps from different kernel positions:

a595c6a5bcae7fe11af6e6dc5b1061ac.png

To further simplify the formula, the Shift operation is defined,

585cdbd4ab24baba2228966b65ca9436.pngas

efbb0f27b496f341df67a4401bf7e9e4.png

Δx and Δy are horizontal and vertical displacements. The formula (3) can be rewritten as:

18bff4e219a15f8021dfc0aee0869b8d.png

Therefore, standard convolution can be generalized into 2 stages:

4d533778c336a22365f97838433f943f.png
Image 2 (a)
1f2d9cabb2b89b89e2a984dedffed9bd.png
  • The first stage : Linearly project the input feature map from a certain position, which is the same as the standard 1×1 convolution.

  • The second stage : The projected feature map is shifted according to the kernel position, and finally aggregated together. It can be easily observed that most of the computational cost is performed in 1×1 convolution, while the following displacement and aggregation are lightweight.

3.2 Self-Attention operation

Attention mechanisms are also widely used in vision tasks. Attention allows the model to focus on important regions on a larger scale than traditional convolutions. As shown in Figure 2(b).

Consider a standard Self-Attention module with N Heads. Let the input tensor and output tensor sum, where H, W represent height and width, and let , as pixels correspond to F and G, respectively. Then, the output of the attention module is computed as:

b0eaede5d835c85259a97ff8ec7c3308.png

is the concatenation of N attention head outputs, , , is the projection matrix of query, key and value. Represents the local area of ​​the pixel, the spatial range k is the center, and is the attention weight corresponding to the inner feature.

For the widely adopted self-attention module, the attention weights are calculated as:

29a2f7521b7971811757561af7eb8bb1.png

where d is the feature dimension of .

Furthermore, multi-head self-attention can be decomposed into two stages and reformulated as:

ac7048d46ea514bba2b569059b277731.png ea3627e1ddf970abe41234734878115c.png
Figure 2(b)
  • The first stage : use 1×1 convolution to project the input features into query, key and value;

  • The second stage : including the calculation of attention weights and the aggregation of value matrices, that is, aggregating local features. Compared with the first stage, the corresponding computational cost is smaller, and it is the same mode as the convolution.

3.3 Computational Cost

In order to fully understand the computational bottleneck of the convolution module and self-attention module, the authors analyze the floating-point operations (FLOPs) and the number of parameters in each stage, summarized in Table 1.

0e9ee1f0f0079c4b75b83a34b237cbaf.png

the result shows:

  • For the convolution module : the theoretical FLOPs and parameters of convolution stage one have quadratic complexity relative to the channel size C, while the computational cost of stage two is linear C, and no additional training parameters are required.

  • For the self-attention module : a convolution-like trend is found, with all training parameters kept in stage one. For theoretical FLOPs, a normal case is considered in a ResNet-like model with C = 7 and C = 64, 128, 256, 512 different layer depths. The results show that the amount of operations consumed in the first stage is , and this difference becomes more pronounced as the channel size grows.

To further verify the validity of the analysis, the authors also summarize the actual computational cost of the convolutional and self-attention modules in the ResNet50 model. In fact, the costs of all 3×3 convolutional modules are added up to reflect this trend from the model’s perspective. Computational results show that 99% of convolution calculations and 83% of self-attention are in the first stage, which is consistent with the theoretical analysis.

4 This paper method

4.1 Linking Self-Attention to Convolution

The decomposition of self-attention and convolution modules was introduced earlier, revealing deeper relationships from multiple perspectives. First, the roles of the two phases are very similar. Stage one is a feature learning module, and both methods share the same operation by performing a convolution to project features into a deeper space. On the other hand, the second stage corresponds to the process of feature aggregation.

From a computational point of view, both the 1 × 1 convolution performed by the convolution module and the self-attention module in the first stage require quadratic complexity C of theoretical floats and channel size parameters. In contrast, in the second stage, both modules are lightweight or require little computation.

Taken together, the above analysis shows that:

  1. Convolution and self-attention are actually the same in the operation of inputting feature map through 1×1 convolutions projection, which is also the computational cost of the two modules;

  2. Although crucial for capturing semantic features, the aggregation operation in the second stage is lightweight and does not require acquisition of additional learned parameters.

4.2 Integration of Self-Attention and Convolution

The above observations naturally lead to the perfect combination of convolution and self-attention. Since both modules share the same 1×1 convolution operation, only one projection can be performed and these intermediate feature maps are used separately for different aggregation operations. So the author of this paper proposes the hybrid module ACmix as shown in Figure 2(c).

0797b4a170c665c73d0d7d286eb9437a.png
Figure 2(c)

Specifically, ACmix still includes two stages:

  1. In the first stage : the input features are projected through 3 1×1 convolutions, and then reshaped into N Pieces. Thus, a rich set of intermediate features containing 3×N feature maps is obtained.

  2. In the second stage : they follow a different paradigm. For the self-attention path, the intermediate features are grouped into N groups, each containing 3 features, each from a 1×1 convolution. The corresponding three feature maps are used as query, key and value respectively, following the traditional multi-head self-attention module. For a convolution path with a kernel size of k, a light fully connected layer is used to generate a feature map. Therefore, by shifting and aggregating the generated features, the input features are convolved and information is collected from the local receptive field as conventional.

Finally, the outputs of the two paths are summed, the strength of which is controlled by two learnable scalars:

058b9a7d51e1660f09e518a8e530bd38.png

4.3 Improve Shift and Summation

As shown in Section 4.2 and Figure 2, the intermediate features in the convolution path follow the shift-and-sum operations in traditional convolution modules. Although they are theoretically lightweight, moving tensors in different directions actually breaks data locality, making it difficult to implement vectorized implementations. This can greatly impair the actual efficiency of inference.

d157948a9e2b574ec0e5af9d97b0f0a7.png
image 3

As a remedy, depthwise convolutions with fixed kernels are used instead of inefficient tensor displacement, as shown in Figure 3(b). Taking the shift feature as an example, the calculation is:

0b766333c18406a2b5b011cd413eb0ab.png

where c represents the channel of each input feature.

On the other hand, if the convolution kernel (kernel size k = 3) is represented as:

01bcabb045aefcb5814cb5edef2cb28c.png

The corresponding output can be expressed as:

39983fff82fadcd52bc66620146c3119.png

Therefore, for a specific displacement direction, with carefully designed kernel weights, the convolution output is equivalent to a simple tensor displacement. To further incorporate the summation of features from different directions, we concatenate all input features and convolution kernels separately, and denote the shift operation as a single group convolution, as shown in Figure 3(cI). This modification makes the module more computationally efficient.

On this basis, some adaptations are also introduced to enhance the flexibility of the module. As shown in Fig. 3 (c.II), the convolution kernels are released as learnable weights, and the shift kernels are used as initialization. This increases the capacity of the model while maintaining the original shift manipulation capabilities. Multiple sets of convolution kernels are also used to match the output channel dimension of the convolution and the self-attention path, as shown in Fig. 3 (c.III).

4.4 Computational cost of ACmix

For better comparison, the FLOPs and parameters of ACmix are summarized in Table 1.

f43613410635c711e120c529010e64ad.png

The computational cost and training parameters of the first stage are the same as self-attention and lighter than traditional convolutions such as 3×3 conv. In the second stage, ACmix introduces additional computational overhead (fully connected layers and group convolutions), and its computational complexity is linear to the channel size C and a relatively small stage i.e. the actual cost and theoretical analysis of the ResNet50 model show similar trend.

4.5 Generalization to other attention modes

With the development of self-attention mechanisms, many studies have focused on exploring changes in attention to further improve model performance. The Patchwise attention proposed by some scholars combines the information from all the features of the local area into attention weights, replacing the original softmax operation. The window attention method adopted by swin-transformer keeps the receptive field of the token the same in the same local window to save computational cost and achieve fast inference speed. On the other hand, ViT and DeiT consider global attention that keeps long-term dependencies in a single layer. These modifications are proven to be effective under specific model architectures.

In this case, it is worth noting that the proposed ACmix is ​​independent of the self-attention formulation and can be easily applied to the above variants. Specifically, the attention weights can be summarized as:

4abcdf7db7cd93406aa6880d84aa7d91.png

where [ ] represents feature concatenation, φ( ) represents two linear projection layers with intermediate nonlinear activations, Wk(i,j) is the specialized receptive field for each query token, and W represents the entire feature map. Then, the calculated attention weights can be applied to Eq. (12) and conform to the general formula.

5 experiments

5.1 ImageNet

207123f444481b2a3a02113187776c0b.png c6e0a2eb12bab5edf68c8fd5475c6c8f.png

The classification results are shown in the chart above. For ResNet-ACmix model outperforms all Baselines with comparable floats or parameters.

For example, ResNet-ACmix 26 achieves the same top-1 accuracy as SASA-ResNet 50, but with 80% execution times. In the case of similar FLOPs, our model outperforms SASA by 0.35%-0.8%, and the advantage over other Baselines is even greater.

For SANACmix, PVT-ACmix and Swin-ACmix, the model in this paper achieves continuous improvement. SAN-acmix 15 outperformed SAN 19 with 80% FLOPs. PVT-ACmix-T shows comparable performance to PVT-Large with only 40% FLOPs. Swin-ACmix-S achieves higher accuracy than Swin-B with 60% FLOPs.

5.2 Semantic Segmentation and Object Detection

d4eaf03763a37560510e54870956e636.png

The authors evaluate the effectiveness of the model on the ADE20K dataset and show the results on two segmentation methods, Semantic-FPN and UpperNet. Pretrain Backbone on ImageNet-1K. it turns out

ACmix achieved a boost at all settings.

d3d5c8e68b9c431f31807154f3a62985.png

The authors also conducted experiments on COCO.

e5eb07e892de94a629454ab8ccbbc209.png 5a4dca2803c1fe82336b5b8c17a112eb.png

Tables 3 and 4 show the results of resnet-based models and Transformer-based models under different head detection scenarios, including RetinaNet, Mask R-CNN, and Cascade Mask R-CNN. It can be observed that ACmix consistently outperforms Baseline with similar parameters or FLOPs. This further validates the effectiveness of ACmix when transferring it to downstream tasks.

5.3 Ablation experiment

1. Combine the output of the two paths

62949ab9e1bf0a9c199dabe6aa8d4b87.png

Explore the impact of different combinations of convolutional and self-attention outputs on model performance. The authors used a variety of combination methods to conduct experiments, and the results are summarized in Table 6. By replacing window attention with traditional 3 × 3 convolutions, we also show the performance of models that take only one path, Swin-T with self-attention and Conv-Swin-T with convolution. As observed, the combination of convolutional and self-attention modules consistently outperforms models using a single path. Fixing the ratio of convolution and self-attention for all operators also leads to performance degradation. In contrast, using learned parameters brings higher flexibility to ACmix, and the strength of convolution and self-attention paths can be adaptively adjusted according to the position of the filter in the whole network.

2、Group Convolution Kernels

deda4d9f605e05a0eac7b9ea61a0bae3.png

The authors also conduct ablation experiments on the choice of group convolution kernels, demonstrating in Table 7 the effectiveness of each adaptation empirically, and its impact on practical inference speed. Replacing tensor displacement with group convolution greatly improves inference speed. Furthermore, the use of learnable convolution kernels and well-designed initialization enhances the flexibility of the model and contributes to the final performance.

5.5 Bias towards Different Paths

a1b66155e063aae2b8551fbe0b2dfa0e.png

It is also worth noting that ACmix introduces two learnable scalars α, β to combine the outputs from the two paths. This leads to a by-product of the module, where α and β actually reflect the model's bias towards convolution or self-attention at different depths.

Parallel experiments are performed here, and Figure 5 shows the parameters α, β learned by different layers in the SAN-ACmix model and the Swin-ACmix model. The left and middle plots show the trends of self-attention and convolutional path rates, respectively. The variation in rate was relatively small across experiments, especially when the layers were deeper.

This observation suggests that deep models have stable preferences for different design patterns. A more pronounced trend is shown in the plot on the right, where the ratio between the two paths is explicitly represented. can be seen:

  • In the early stages of the Transformer model, convolutions can be good feature extractors.

  • In the intermediate stages of the network, the model tends to utilize a mixture of the two paths, with an increasing bias towards convolution.

  • In the final stage, self-attention shows a larger advantage than convolution. This is also consistent with the design pattern of the previous works, that is, self-attention is used to replace the original 3x3 convolution in the final stage, and the early convolution proved to be more effective for vision transformers.

By analyzing the changes of α and β, it is found that there are different biases towards convolution and self-attention at different stages of the deep model.

ACmix paper PDF download

Background reply: ACmix, you can download the above paper

 
  

ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer或者目标检测 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer或者目标检测+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/124089713