[Paper reading notes] ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Paper address: ECA-Net
paper code: https://github.com/BangguWu/ECANet

Paper summary

  ECA-Net is based on the expansion of SE-Net. It believes that the dimensionality reduction between the two FC layers of the SE block is not conducive to the weight learning of channel attention. This weight learning process should directly correspond to one to one. The author did a series of experiments to prove the importance of keeping the number of channels constant in the attention block.

  The practice of ECA-Net is: (1) Global Avg Pooling gets a 1 ∗ 1 ∗ C 1*1*C11The vector of C ; (2) One-dimensional convolution 1D-Conv is used to complete the information interaction across channels.
  The size of the convolution kernel of one-dimensional convolution is adaptive through a function, so that the layer with a larger number of channels can perform more cross channel interactions. The formula for calculating the size of the adaptive convolution kernel is:k = ψ (C) = ∣ log ⁡ 2 (C) γ + b γ ∣ k=\psi(C)=|\frac{\log_2(C)}{\ gamma}+\frac {b}{\gamma}|k=ψ ( C )=clog2(C)+cb , whereγ = 2, b = 1 \gamma=2,b=1c=2,b=1

Paper content

  According to the author, most of the recent expansions based on SE-Net only focus on developing complex attention models to obtain better performance, which inevitably increases the complexity of the model. The method ECA-Net in this article only involves a few parameters, and can achieve two goals: (1) avoid the reduction of feature dimensions; (2) increase the interaction of information between channels, reduce complexity while maintaining performance (through one-dimensional convolution).

  The structure of SE block consists of two parts: (1) global avg pooling produces 1 ∗ 1 ∗ C 1*1*C11C- sized feature maps; (2) Two fc layers (with dimensionality reduction in the middle) to generate the weight of each channel. The structure of
  ECA-Netis shown in the figure below: (1) Global avg pooling produces1 ∗ 1 ∗ C 1*1*C11C- size feature maps; (2) Calculate the adaptive kernel_size; (3) apply the kernel_size in one-dimensional convolution to obtain the weight of each channel.

Ablation learning experiment

  The author's attempt to expand SE-Net is shown in the figure below: [Prerequisite knowledge yyy is the output of the average pooling layer], there are SE-Var1 (non-linear operation directly on y), SE-Var2 (depth-wise convolution operation on y, that is, one parameter per channel), SE-Var3 (right y directly performs a fully connected operation), SE-GC (performs a grouped convolution operation on y), ECA-NS (each channel uses k parameters for interaction between channels), ECA (shares k parameters for channel The interaction between one-dimensional convolution).

  From the above figure, the following conclusions can be drawn:

  1. The results of SE-Var1 and Vanilla show that attention without adding parameters is effective;
  2. The results of SE-Var2 and SE show that although SE-Var2 has fewer parameters, the effect of SE-Var2 is still better than the original SE, which shows the importance of maintaining the number of channels. This importance is more important than considering the non-linear dependence of information interaction between channels;
  3. The results of SE-Var3 and SE show that SE-Var3 only uses one FC better than the SE with two FC layers with reduced dimensions;
  4. The results of SE-Var3 and SE-Var2 show the importance of cross-channel information interaction for learning channel attention;
  5. The use of Group Conv is a compromise between SE-Var3 and SE-Var2, but the use of Group Conv is not as effective as SE-Var2; this may be due to Group Conv abandoning the dependencies between different Groups, which leads to Wrong information exchange;
  6. ECA-NS has k parameters for each channel, which can avoid the problem of information isolation between different groups, which is okay from the result;
  7. ECA uses shared weights, and the results show the feasibility of the method; at the same time, shared weights can reduce model parameters;
Adaptive kernel_size

  The author gave up using the linear method to get the kernel_size, thinking that this is too restrictive. Since the number of channels is generally an exponent of 2, there is the following formula:

  Among them, ∣ x ∣ odd |x|_{odd}xoddMeans to choose the nearest odd number; γ = 2, b = 1 \gamma=2,b=1c=2,b=1

Experimental results

Performance comparison with other SE-Net extensions:
Comparing the structure with other SE-Net extensions:
Performance display of ECA-Net applied to various frameworks:
Manual selection experiment of kernel_size
Performance display of ECA-Net

Guess you like

Origin blog.csdn.net/qq_19784349/article/details/107107432