attention module

 At present, the mainstream attention mechanism can be divided into the following three types: channel attention, spatial attention and self-attention (Self-attention)

  • The channel domain is designed to model the correlation between different channels, automatically obtain the importance of each feature channel through network learning, and finally assign different weight coefficients to each channel to strengthen the importance. The features suppress the non-important features.
  • The spatial domain aims to improve the feature expression of key areas. Essentially, the spatial information in the original image is transformed into another space and retains key information through the spatial transformation module, and a weight mask (mask) is generated for each position and weighted. output, thereby enhancing specific target regions of interest while weakening irrelevant background regions. (represented by CBAM)
  • The hybrid domain mainly combines channel domain, spatial domain and other forms of attention to form a more comprehensive feature attention method.

1. ECA attention module
The ECA attention module is a channel attention module; it is often applied in visual models. It supports plug-and-play, that is, it can enhance the channel features of the input feature map, and the final output of the ECA module does not change the size of the input feature map.

Background: ECA-Net believes that: the dimensionality reduction operation adopted in SENet will have a negative impact on the prediction of channel attention; it is inefficient and unnecessary to obtain the dependencies of all channels at the same time; design: ECA is based on the SE
module In the above, the use of fully connected layer FC learning channel attention information in SE is changed to 1*1 convolution learning channel attention information; one-dimensional convolution layers.Conv1D is used to complete the information interaction between channels, and the size of the convolution kernel is passed A function to adapt changes such that layers with larger channel counts do more cross-channel interactions.
Function: Use 1*1 convolution to capture information between different channels, avoid channel dimension reduction when learning channel attention information; reduce the amount of parameters; (FC has a large amount of parameters; 1*1 convolution has only a small Parameters)
module structure:

 The process idea of ​​the ECA model is as follows:

  • First input the feature map, its dimension is H*W*C;
  • Perform spatial feature compression on the input feature map; implementation: in the spatial dimension, use the global average pooling GAP to obtain a 1*1*C feature map;
  • Carry out channel feature learning on the compressed feature map; realize: through 1*1 convolution, learn the importance between different channels, and the output dimension at this time is still 1*1*C;
  • Finally, the channel attention is combined, and the feature map 1*1*C of channel attention and the original input feature map H*W*C are multiplied channel by channel, and finally the feature map with channel attention is output.

In the FC fully connected layer , the input channel feature map is processed for global learning ;

If 1*1 convolution is used , only the information between local channels can be learned ;

When performing convolution operations, the size of its convolution kernel will affect the receptive field; in order to solve different input feature maps and extract features of different ranges, ECA uses a dynamic convolution kernel to do 1*1 convolution , to learn the importance among different channels.

  • The dynamic convolution kernel means that the size of the convolution kernel adapts to changes through a function;
  • In the layer with a large number of channels, use a larger convolution kernel to perform 1*1 convolution, so that more cross-channel interactions can be performed;
  • In the layer with a small number of channels, use a smaller convolution kernel to do 1*1 convolution, so that there is less cross-channel interaction;
     

 Convolution and adaptation functions, defined as follows:

  Where k represents the size of the convolution kernel; C represents the number of channels; | |odd represents that k can only be an odd number; \gammaand b represents that it is set to 2 and 1 in the paper, which is used to change the ratio between the number of channels C and the size of the convolution kernel .

# --------------------------------------------------------- #
#(2)ECANet 通道注意力机制
# 使用1D卷积代替SE注意力机制中的全连接层
# --------------------------------------------------------- #
 
import torch
from torch import nn
import math
from torchstat import stat  # 查看网络参数
 
# 定义ECANet的类
class eca_block(nn.Module):
    # 初始化, in_channel代表特征图的输入通道数, b和gama代表公式中的两个系数
    def __init__(self, in_channel, b=1, gama=2):
        # 继承父类初始化
        super(eca_block, self).__init__()
        
        # 根据输入通道数自适应调整卷积核大小
        kernel_size = int(abs((math.log(in_channel, 2)+b)/gama))
        # 如果卷积核大小是奇数,就使用它
        if kernel_size % 2:
            kernel_size = kernel_size
        # 如果卷积核大小是偶数,就把它变成奇数
        else:
            kernel_size = kernel_size
        
        # 卷积时,为例保证卷积前后的size不变,需要0填充的数量
        padding = kernel_size // 2
        
        # 全局平均池化,输出的特征图的宽高=1
        self.avg_pool = nn.AdaptiveAvgPool2d(output_size=1)
        # 1D卷积,输入和输出通道数都=1,卷积核大小是自适应的
        self.conv = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=kernel_size,
                              bias=False, padding=padding)
        # sigmoid激活函数,权值归一化
        self.sigmoid = nn.Sigmoid()
    
    # 前向传播
    def forward(self, inputs):
        # 获得输入图像的shape
        b, c, h, w = inputs.shape
        
        # 全局平均池化 [b,c,h,w]==>[b,c,1,1]
        x = self.avg_pool(inputs)
        # 维度调整,变成序列形式 [b,c,1,1]==>[b,1,c]
        x = x.view([b,1,c])
        # 1D卷积 [b,1,c]==>[b,1,c]
        x = self.conv(x)
        # 权值归一化
        x = self.sigmoid(x)
        # 维度调整 [b,1,c]==>[b,c,1,1]
        x = x.view([b,c,1,1])
        
        # 将输入特征图和通道权重相乘[b,c,h,w]*[b,c,1,1]==>[b,c,h,w]
        outputs = x * inputs
        return outputs

2. SENet
SE attention mechanism (Squeeze-and-Excitation Networks) adds attention mechanism in the channel dimension, and the key operations are squeeze and excitation.

Through automatic learning, use another new neural network to obtain the importance of each channel of the feature map, and then use this importance to assign a weight value to each feature, so that the neural network can focus on certain features. aisle. Promote channels of feature maps that are useful for the current task, and suppress feature channels that are not very useful for the current task.

As shown in the figure below, before entering the SE attention mechanism (left figure C), the importance of each channel of the feature map is the same, after passing through SENet (right color figure C), different colors represent different weights , so that the importance of each feature channel becomes different, so that the neural network focuses on certain channels with large weight values.


The implementation steps of the SE attention mechanism are as follows:

  • (1) Squeeze: Through global average pooling, the two-dimensional features (H*W) of each channel are compressed into 1 real number, and the feature map is changed from [h, w, c] ==> [1,1,c ]
  • (2) excitation: Generate a weight value for each feature channel. In the paper, the correlation between channels is constructed through two fully connected layers. The number of output weight values ​​is the same as the number of channels of the input feature map. [1,1,c] ==> [1,1,c]
  • (3) Scale: Weight the normalized weight obtained earlier to the features of each channel. Multiplication is used in the paper, and the weight coefficient is multiplied channel by channel. [h,w,c]*[1,1,c] ==> [h,w,c]

 Summarize:

  1. The core idea of ​​SENet is to automatically learn the feature weights according to the loss loss through the fully connected network, instead of directly judging according to the numerical distribution of the feature channels, so that the weight of the effective feature channels is large. Of course, the SE attention mechanism inevitably increases some parameters and calculations, but the cost performance is still quite high.
  2. The paper believes that using two fully connected layers in the excitation operation has the advantage of using one fully connected layer directly, which has more nonlinearities and can better fit the complex correlation between channels.
# -------------------------------------------- #
#(1)SE 通道注意力机制
# -------------------------------------------- #
import torch
from torch import nn
from torchstat import stat  # 查看网络参数
 
# 定义SE注意力机制的类
class se_block(nn.Module):
    # 初始化, in_channel代表输入特征图的通道数, ratio代表第一个全连接下降通道的倍数
    def __init__(self, in_channel, ratio=4):
        # 继承父类初始化方法
        super(se_block, self).__init__()
        
        # 属性分配
        # 全局平均池化,输出的特征图的宽高=1
        self.avg_pool = nn.AdaptiveAvgPool2d(output_size=1)
        # 第一个全连接层将特征图的通道数下降4倍
        self.fc1 = nn.Linear(in_features=in_channel, out_features=in_channel//ratio, bias=False)
        # relu激活
        self.relu = nn.ReLU()
        # 第二个全连接层恢复通道数
        self.fc2 = nn.Linear(in_features=in_channel//ratio, out_features=in_channel, bias=False)
        # sigmoid激活函数,将权值归一化到0-1
        self.sigmoid = nn.Sigmoid()
        
    # 前向传播
    def forward(self, inputs):  # inputs 代表输入特征图
    
        # 获取输入特征图的shape
        b, c, h, w = inputs.shape
        # 全局平均池化 [b,c,h,w]==>[b,c,1,1]
        x = self.avg_pool(inputs)
        # 维度调整 [b,c,1,1]==>[b,c]
        x = x.view([b,c])
        
        # 第一个全连接下降通道 [b,c]==>[b,c//4]
        x = self.fc1(x)
        x = self.relu(x)
        # 第二个全连接上升通道 [b,c//4]==>[b,c]
        x = self.fc2(x)
        # 对通道权重归一化处理
        x = self.sigmoid(x)
        
        # 调整维度 [b,c]==>[b,c,1,1]
        x = x.view([b,c,1,1])
        
        # 将输入特征图和通道权重相乘
        outputs = x * inputs
        return outputs

 Construct the input layer, view the output of a forward pass, and print the network structure

# 构造输入层shape==[4,32,16,16]   #4维tensor,各参数含义:[width, height, channels, kernel_nums]
inputs = torch.rand(4,32,16,16)
# 获取输入通道数
in_channel = inputs.shape[1]
# 模型实例化
model = se_block(in_channel=in_channel)
 
# 前向传播查看输出结果
outputs = model(inputs)
print(outputs.shape)  # [4,32,16,16])
 
print(model) # 查看模型结构
stat(model, input_size=[32,16,16])  # 查看参数,不需要指定batch维度

3. Hole convolution

 Ordinary convolution:

insert image description here

 Atrous convolution:

insert image description here

 Atrous Convolution is one of the keys to the DeepLab model. It can control the receptive field without changing the size of the feature map, which is conducive to extracting multi-scale information. The hole convolution is shown in the figure below, where rate (r) controls the size of the receptive field, and the larger r is, the larger the receptive field is. The output_stride=32 of the usual CNN classification network, if you want the output_stride=16 of DilatedFCN, you only need to set the stride of the last downsampling layer to 1, and set the r of all subsequent convolutional layers to 2, so as to ensure that the receptive field does not occur Variety. For output_stride=8, you need to change the stride of the last two downsampling layers to 1, and set the rates of the corresponding convolutional layers to 2 and 4 respectively. In addition, DeepLabv3 mentioned the multi-grid method. For the ResNet network, the last three cascaded blocks use different rates. If output_stride=16 and multi_grid = (1, 2, 4), then the last three blocks rate= 2 · (1, 2, 4) = (2, 4, 8). This is slightly more efficient than using (1, 1, 1) directly, but the result is not too different.


Guess you like

Origin blog.csdn.net/weixin_64043217/article/details/129062631