SENET paper notes attention mechanism

SENet paper notes attention mechanism

Squeeze-and-Excitation Networks 2019

Abstract

  • Traditional convolution improves representation capabilities at the feature level by improving spatial coding quality.
  • SENet focuses on channel relationships, adaptively adjusts the weight of channel direction feature maps, and displays the mutual relationships between constructed channels.
  • Combined with the SE module, the performance becomes better and the calculation amount increases a little.

Introduction

  • Models around 15 years old use local receptive fields to simultaneously fuse spatial and channel information.
  • The Inception structure proposed later helps capture the spatial correlation between features (fusion of large and small receptive fields?)
  • Later models proposed spatial attention

The SE block is based on the relationship between channels and improves the representation ability of the network by displaying and adjusting the interdependence of convolutional feature channels.

Feature recalibration: allows the network to learn to use global information to selectively enhance useful features and suppress useless features.

img

First, the Ftr (transform) step is a conversion operation (original network) that maps the input X to a feature map U:
F tr : {F}_{tr}: \mathbf{X} \rightarrow \mathbf{U}, \mathbf{X} \in \mathbb{R}^{H^{\prime} \times W^{\prime} \ times C^{\prime}}, {\mathbf{U}} \in \mathbb{R}^{H \times W \times C}Ftr:XU,XRH×W×C,URH × W × C
feature adjustment (recalibration): the feature map U first undergoesthe squeeze operation

squeeze generates channel descriptors by aggregating feature map spatial dimensions (GAP)

The function of this descriptor is to produce an embedding of the global distribution of channel-wise feature responses, allowing information from the global receptive field of the network to be used by all its layers.

The aggregation operation and the excitation operation (FC*2) are used to generate the weight of each channel, the weight scale U, and the output of the SE block is obtained.

  • SEblock is easily integrated into a range of structures.
  • SEblock shares low-dimensional representations in the early stages of the network (first few layers) . The subsequent layers of seblock become highly specific , that is, they will be different for different categories in the image.
  • SEblock has few hyperparameters and can be directly integrated into existing models with only a slight increase in computational complexity.

SQUEEZE-AND-EXCITATION BLOCKS

The formula of Ftr is the following formula 1 (convolution operation, vc represents the c-th convolution kernel , xs represents the s-th input , and the following summation operation is the addition in the convolution calculation)
uc = vc ∗ X = ∑ s = 1 C ′ vcs ∗ xs \mathbf{u}_{c}=\mathbf{v}_{c} * \mathbf{X}=\sum_{s=1}^{C^{\prime}} \ mathbf{v}_{c}^{s} * \mathbf{x}^{s}uc=vcX=s=1CvcsxThe U obtained by s
Ftr is the second three-dimensional matrix, C feature maps (also called tensors) of size H*W. And uc represents the c-th two-dimensional matrix in U, and the subscript c represents channel.

Squeeze operation : global average pooling, converts the input of H * W * C into the output of 1 * 1 * C, representing global information.
zc = F sq ( uc ) = 1 H × W ∑ i = 1 H ∑ j = 1 W uc ( i , j ) z_{c}=\mathbf{F}_{sq}\left(\mathbf{u} _{c}\right)=\frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} u_{c}(i, j)zc=Fsq(uc)=H×W1i=1Hj=1Wuc(i,j )
Excitation factor:
s = F ex ( z , W ) = σ ( g ( z , W ) ) = σ ( W 2 δ ( W 1 z ) ) \mathbf{s}=\mathbf{F}_{ex }(\mathbf{z}, \mathbf{W})=\sigma(g(\mathbf{z}, \mathbf{W}))=\sigma\left(\mathbf{W}_{2} \delta \left(\mathbf{W}_{1}\mathbf{z}\right)\right)s=Fex(z,W)=σ ( g ( z ,W))=p(W2d(W1z ) )
The result of the previous squeeze is z. Here, multiply W1 by z, which is a fully connected layer operation.
The dimension of W1 (weight matrix) is C/r * C, and the dimension of z is 1 * 1 * C.

r is a scaling parameter, which was obtained through experiments to 16. The purpose of this parameter is to reduce the number of channels and thereby reduce the amount of calculation.

So the result of W1z is 1 * 1 * C/r; then it goes through a ReLU layer , and the output dimension remains unchanged; then it is multiplied by W2, which is also a fully connected layer process, W2 (weight matrix) The dimension is C * C/r, so the output dimension is 1 * 1*C; finally, through the sigmoid function , s is obtained

s represents the weight of C feature maps in U.

**scale operation:**Each channel corresponds to multiplication
x ~ c = F scale (uc, sc) = sc ⋅ uc \tilde{\mathbf{x}}_{c}=\mathbf{F}_{\ text {scale }}\left(\mathbf{u}_{c}, s_{c}\right)=s_{c} \cdot \mathbf{u}_{c}x~c=Fscale (uc,sc)=scuc

Instantiations

The left side is integrated into Inception and the scale is directly output to the inception. The right side is integrated into the residual link before adding the scale.

img

Use global average pooling as the Squeeze operation. Then two Fully Connected layers form a Bottleneck structure to model the correlation between channels, and output the same number of weights as the input features. We first reduce the feature dimension to 1/16 of the input, and then raise it back to the original dimension through a Fully Connected layer after ReLu activation. The advantages of doing this over directly using a Fully Connected layer are:

  1. With more nonlinearity, it can better fit the complex correlation between channels;
  2. The amount of parameters and calculations are greatly reduced. Then a Sigmoid gate is used to obtain the normalized weight between 0 and 1, and finally a Scale operation is used to weight the normalized weight to the features of each channel.

If the features on the main branch after the Addition are recalibrated, due to the scale operation of 0~1 on the main branch, gradient dissipation will easily occur near the input layer when the network is deeply BP optimized. (The impact of the placement of SEblock on the results was later tested)

MODEL AND COMPUTATIONAL COMPLEXITY

SEblock balances improvement and computational effort.

img

The added parameters mainly come from two FCs. The dimensions of the two FC layers are C/r * C, so the parameter amount of these two fully connected layers is 2*C^2/r. Taking ResNet as an example, assuming that ResNet contains S stages in total, and each Stage contains N repeated residual blocks, then the increased parameter amount of ResNet with SE block added is the following formula: 2 r ∑ s = 1 SN s
⋅ C s 2 \frac{2}{r} \sum_{s=1}^{S} N_{s} \cdot C_{s}{ }^{2}r2s=1SNsCs2

Experiments

ImageNet test

Model:

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-vvGB9oBJ-1670230634557) (C:\Users\86188\Pictures\tabel1.png)]

Single-crop SENet of different depths and different types of accuracy, GFLOPS: higher accuracy, but only a slight increase in computational complexity

The original is given by the original paper, and the re-implementation is the best result of their own run.

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-5Ygha2Rf-1670230634558)(https://pic1.zhimg.com/80/v2-428df581957e74127e9ebc9ce4f6f0b0_720w.jpg)]

Training curves of SE-ResNet and ResNet at different depths

img

Convergence curves of ResNeXt and SE-ResNeXt, Inception-ResNet-v2 and SE-Inception-ResNet-v2

Insert image description here

Scene classification test

The structure of the data set place365 comparison is ResNet-152 and SE-ResNet-152. It can be seen that SENet still has advantages in data sets other than ImageNet.

Insert image description here

Other tests

Insert image description here

Insert image description here

ablation experiment

Squeeze Operator,

GAP and GMP were used, GAP was better and the others were not considered. table11

Excitation Operator

Replace sigmoid with ReLu, Tanh. table12

We see that exchanging the sigmoid for tanh slightly worsens performance, while using ReLU is dramatically worse and in fact causes the performance of SE-ResNet-50 to drop below that of the ResNet-50 baseline.

Different stages

Combining SE in different stages of resnet can improve accuracy, and adding all of them will improve the most.

Insert image description here

Integration strategy

Change the SE fusion position: table14

Insert image description here

Insert image description here

ROLE OF SE BLOCKS

Effect of Squeeze

we experiment with a variant of the SE block that adds an equal number of parameters, but does not perform global average pooling.

Just remove the pooling and replace it with two 1*1 convolutional layers with the same output as the original. The effect is slightly worse than SE.

he SE block allows this global information to be used in a computationally parsimonious (简约的)manner

If pooling is deleted, FC cannot be used.

Role of Excitation

Excitation is not deleted here, but the weight distribution during scale is explored.

Distribution of scales at different levels: It can be seen that the curves of each category at the front levels (SE_2_3 and SE_3_4) are not very different , which shows that the distribution of scale in lower levels has nothing to do with the input category;

As the level deepens, the curves of different categories begin to differ (SE_4_6 and SE_5_1), which shows that the scale size of the later levels is strongly related to the input category ; after SE_5_2, almost all scales are saturated , and the output is 1. Only one channel is 0; and the scale of the channels of the last layer SE_5_3 is basically the same.

Since the scales of the last two layers are basically equal, they are of little use. They can be removed in order to save calculations.
Insert image description here

My Conclusion

  1. The idea of ​​dynamic change is important
  2. In addition to depth-separable convolution, inter-channel correlation can also be weighted and adjusted dynamically according to the category.

Guess you like

Origin blog.csdn.net/qq_52038588/article/details/128189674