Literature reading (52) - Integration self-attention and convolution

Literature reading (52) - Integration self-attention and convolution


On the Integration of Self-Attention and Convolution
insert image description here
CVPR

Prior Knowledge/Knowledge Development

  • convolutional network

    • advantage

      • Parameter sharing: The convolution kernel moves across the entire image, and sharing parameters can reduce the number of parameters of the model, thereby greatly reducing training time and memory consumption.
      • Local perception: The convolution operation only focuses on the features of the local area, and is not affected by global noise, which improves the robustness of the features.
      • Spatial invariance: The convolution operation has the same effect of translation on the entire image, so CNN has a certain spatial invariance when processing images.
    • shortcoming

      • Large-scale convolution kernel: CNN needs a large-scale convolution kernel to capture more complex features, which will lead to too many model parameters and prone to overfitting.
      • Fixed receptive field: Since the convolution kernel size and step size are fixed, CNN can only perceive a fixed-size area and may not be able to capture all features
  • self-attention mechanism

    • advantage

      • Dynamics: The attention mechanism can adjust the weights according to different parts of the input data, so that the model can focus on different features for different tasks.
      • Flexibility: The attention mechanism can be integrated with various neural network structures, such as CNN, RNN and Transformer, etc.
    • shortcoming

      • Computational complexity: Due to the need to calculate the importance of each feature, the attention mechanism adds a certain amount of computational burden, which may lead to a long model training time.
      • Adversarial examples: The attention mechanism may reduce the robustness of the model to adversarial examples, because it may focus too much on some important features while ignoring others.
  • between the two

    • A convolution operation is a fixed operation that extracts features over the entire image. The attention mechanism is a dynamic operation, which can assign different weights to the model according to different parts of the input data. In addition, the convolution operation can only deal with local information, while the attention mechanism can capture global information .
    • Traditional convolution exploits aggregation functions over local receptive fields using convolutional filter weights, which are shared across the entire feature map. Its intrinsic features impose important inductive biases on image processing. In contrast, the self-attention module employs a weighted average operation based on the input feature context , where the attention weights are dynamically computed by the similarity function between related pixel pairs. This flexibility enables the attention module to adaptively focus on different regions and capture more informative features .

Article structure

  • abstract
  • related work
  • revisiting convolution and self-attention
  • method★
  • experiments
  • conclusion

background


Ask a question:

  • Convolution is more concerned with local information as a solid weight, and the attention mechanism is a dynamic weight. But there is a potential connection between the two
  • Through decomposition, it can be found that they all rely on the same 1*1 convolution operation

Based on this finding, the authors developed a hybrid model that cleverly integrates self-attention and convolution with minimal computational cost


article method

ACmix
insert image description here

1. Relating Self-Attention with Convolution

  • Convolution process insert image description here

  • The self-attention process
    insert image description here
    insert image description here
    insert image description here
    can be seen from the above inference:

  • Convolution and self-attention are actually the same operation when inputting features through a 1*1 convolution map, and this step is the most expensive.

  • The stage2 of both is crucial to capturing semantic features, but it is actually lightweight and has no additional learning parameters.

overall design
insert image description here

In the end, the features learned by the two are the sum of the two (the weights are learnable):
insert image description here

article results

1. Classification

insert image description here

2. Split

insert image description here

3. Object detection

insert image description here
insert image description here

4. Ablation Experiment

(1)combination block compared with single block

The author compares the performance of only using attention, only using convolution, and combining the parameters of the two, Flops and the model.
insert image description here

(2)group convolution kernels

insert image description here

(3) Hyperparameters

insert image description here
可以看到在transformer 模型的早期阶段,卷积可以提取更好的特征。在最后的阶段,注意力机制可以提供更好的特征。

Contributions

  1. Reveals a strong underlying relationship between self-attention and convolution, providing a new perspective for a deep understanding of both
  2. A module that combines the advantages of both is proposed. Ablation experiments also show that the hybrid model is more effective than either of them alone.

Summarize

From the author's ablation experiment, the effect is quite gratifying. It is a new perspective and worth learning!

What can be learned/learned from?

都给我去看!

Guess you like

Origin blog.csdn.net/qq_43368987/article/details/130653505