cvpr2022|Self-attention and convolution integration, ACmix performance speed is fully improved

insert image description here

preface:

Tsinghua University and others have proposed a new paradigm that combines convolution and self-attention, which is used in the image field, and the performance and speed have been comprehensively improved. The official code has been open sourced.

foreword

Convolution and self-attention are two powerful techniques for representation learning, often considered as two approaches with different mechanisms. In this paper, the authors show that most of the computations of these two paradigms are actually done by the same operations , demonstrating a strong intrinsic relationship between them. The author splits the convolution and self-attention into two stages. In the convolution operation, the traditional convolution with a kernel size of k×k can be decomposed into kxk individual 1×1 convolutions, and then shifted and Sum operation. In the self attention module, we interpret the projection of queries, keys, and values ​​as multiple 1×1 convolutions, and then compute attention weights and aggregate values. Therefore, the first stages of both modules contain similar operations. More importantly, the first stage contributes the main computational complexity (squared of the channel size) compared to the second stage. This allows combining these two seemingly different paradigms to propose ACmix , which enjoys the benefits of self-attention and convolution while having minimal computational overhead compared to pure convolution or self-attention. The author and a large number of experiments demonstrate that the model has achieved continuous improvement results on competitive baselines in image recognition and downstream tasks.

main content

insert image description here
Structurally, as shown in the above figure, the 1x1 convolution implicitly included in the convolution and self-attention is shared, thereby reducing the calculation of this part.

How to achieve it specifically, although the author gave a general indication, but it is not clear enough. In order to understand this process more clearly, we analyze the specific operation process according to the author's formula, parameter amount and calculation amount.

convolution

For convolution, if you want to know the implementation of the convolution operator, you can click here . The traditional convolution will be converted through the k * k convolution kernel feature, first aggregated and then shifted to the next position and then convoluted. The convolution here is After calculating all the feature conversions of the feature map through the 1 * 1 convolution kernel, first offset, then aggregate , and split it into two stages of transformation and offset aggregation:
insert image description hereinsert image description here
the first step: the convolution operation will k * k The convolution kernel is split into k k 1 1 convolution kernels, and the value of each element in each kernel multiplied by the feature map is calculated without adding and processing, so there are k * k * cin * cout Calculation amount, parameter amount Since the convolution kernels are involved in the calculation, the parameter amount k * k * cin * cout and the calculation amount k * k * cin * cout are as shown in the figure above.

Step 2: The first formula, after convolving with the convolution kernel of 1 1, needs to correspond to the standard convolution of k k, then you need to add the corresponding positions, and the first formula is Shift operation, the second formula is to add and operate the shifted value at the corresponding position, the calculation amount is k * k* cout, the second step sums the output of the first step, and no new parameters come out , so the parameter amount is 0, and in terms of calculation amount, the shift operation does not generate calculation amount, and the total calculation amount is k* k* cout.

self-attention

The self-attention mechanism is also widely used in the cv field, such as the transformer series. Compared with the traditional convolution, it allows the model to focus on important areas in a larger content space. The calculation formula is as follows (considering N heads ):
insert image description here
Among them:
insert image description here
Similarly, the multi-head attention mechanism can also be seen in two stages: the
insert image description here
insert image description here
first step is to calculate the que, key, and value after the 1x1 convolution transformation. Here, why can the 1x1 convolution be used instead? The reason is that the full connection can be seen from the input to the output, and the full connection can be replaced by 1x1 convolution. You can check here , here are simply three 1x1 convolution matrices, and the parameter amount is 3xcxc. The second step is for query, key, and value The calculation of the attention weight and the operation of splicing different heads are carried out , because only the elements within the window kxk range are considered at this time, that is, the local features are collected, so the sequence length in the calculation amount is also fixed at kxk. Therefore, in the calculation of que and key, the calculation amount is kxcxk, and in the calculation of query x key and value, the calculation amount is kxkxc. So the whole is 2 times kxkxc. And there are no additional parameters to learn at this time, so the parameter amount is 0.

ACmix

The ACmix operation integrates convolution and self-attention operations, mainly doing a shared feature conversion operation for their first step, as shown in the figure below.
insert image description here
In the first stage, through three 1x1 convolutions , three feature maps (mainly for q, k and v of self attention) are generated, and the three feature maps are divided into N groups in the depth direction (for self attention N heads of attention). The amount of calculation and parameters here is the amount corresponding to three independent 1x1 convolution operations. I feel that the author mainly considers the self-attention mechanism in the first stage. I may think that the main function of convolution is feature extraction. How convolution does not affect , the parameter amount and calculation amount of the first stage are both 3xCxC. Two parts are considered in the second phase.
In the convolution part, first expand the channel through the full connection of the channel layer. Here (HxWxC/N) can be seen as a group convolution (not shared between N, shared within C/N), as a basic unit of convolution , the depth on the channel layer is 3N, and through the fully connected 3N→kxkx N layer, kxk here is kxk in order to correspond to the size of the convolution kernel. After such conversion, it corresponds to the first step of the above convolution layer operation. Yes, the subsequent processing is the same as that of the second step of convolution, first offset it, and then aggregate it into the corresponding dimension.
insert image description here
There are several points here: 1. Direct spatial offset will destroy the limitations of the data, and it is difficult to achieve vectorization processing, so the author uses an equivalent transformation similar to convolution as shift, as shown in the above figure, the use of offset can be Referring to [6], the first is manual translation, the second is operated by matrix transformation, and the third is operated by grouped convolution. The third operation used by the author is realized by using a learnable group convolution structure similar to convolution, which refers to the convolution operation of kc X kc size. 2. In the space offset, according to the calculation amount, kc^4 XC, it can be deduced that in the N-group group convolution, the convolution of each group is not summed, but the deepwise depth separable volume is used. product. 3. According to the calculation amount and parameter amount, it is deduced that when ACmix performs convolution calculation in the second step, the calculation amount of k k C during aggregation is (k X kX C), and parallel processing is used.

The self-attention part is performed according to the normal attention operation, because N heads are divided, and the calculation amount is: NX ka XC/NX ka +NX ka X ka XC/N =2ka X ka XC, in terms of parameter quantity, there is no new one behind The parameters come in, and the parameter amount is 0.

Finally, the calculated convolution part and self-attention part are fused with different weights:
insert image description here
where α and β are learnable parameters.

Experimental results

The author has done experiments in image recognition and downstream tasks.

Image Identification

Using ACmix on the four basic models of ResNet, SAN, PVT, and Swin-Transformer, the test results on the ImageNet dataset are as follows:
insert image description here
In the case of little difference in the calculation amount and parameter amount, the top1 accuracy rate has an improvement of about 1 point .

split task

On the two networks of Senmantic FPN and UpperNet used in the segmentation task, the test results in the data set ADE20K are as follows:
insert image description here

object detection task

The target detection task uses resnet as the backbone network and transformer as the backbone network experiment, using the coco data set, the test results are as follows:
insert image description here
insert image description here

Reasoning speed

Under the picture size of image size (3, 576, 576), on the Ascend 910 hardware platform, inference in the MindSpore framework environment, the inference speed is as follows:
insert image description here

Ablation test

1. Ablation test of different weight fusion:
insert image description here
2. Contrast test of three kinds of convolution shifts:
insert image description here

Weights for different paths

The author shows the results of learnable α and β in different layers of the SAN-ACmix network as follows: In the early stage of the network, the convolution occupies a higher weight, which plays the role of feature extraction; in the later stage, self-attention is slowly raised. , with a higher weight.
insert image description here

epilogue

In summary, the author decomposes the convolution and self-attention mechanism, shares the feature extraction layer when projecting feature maps, shares computational overhead, integrates convolution and self-attention operations, and proves effectiveness in image classification and downstream tasks. In fact, with the brilliant performance of transformer on cv, convolution and self-attention have been combined in various forms. The author fuses it with weights, so that both convolution and self-attention are included, because it is related to volume The feature extraction is shared by the product, so the self-attention is still based on the fixed window and not global.

Paper address: https://arxiv.org/pdf/2111.14556.pdf
Open source code address: https://github.com/LeapLabTHU/ACmix
Reference:
[1] https://arxiv.org/pdf/2111.14556.pdf
[ 2] https://www.yuque.com/lart/papers/nlu51g
[3] https://zhuanlan.zhihu.com/p/440649716
[4] https://blog.csdn.net/qq_37151108/article/ details/121938837
[5] https://zhuanlan.zhihu.com/p/439676274
[6] https://blog.csdn.net/P_LarT/article/details/122521114

For more content , deep learning , pay attention to the official account "The Invincible Zhang Dao "insert image description here

Guess you like

Origin blog.csdn.net/zqwwwm/article/details/123202301