Channel Attention network structure, source code interpretation series one

SE-Net, SK-Net and CBAM

1 DEC

Original text link: SENet original text
Source code link: SENet source code

Squeeze-and-Excitation Networks (SENet) is a new image recognition structure announced by the self-driving company Momenta in 2017. It models the correlation between feature channels and strengthens important features to improve accuracy. Rate. This structure is the champion of the 2017 ILSVR competition. The author mentioned in the original text that SENet has achieved a top5 error rate of 2.251%, which is 25% lower than the first place in 2016. It was also a very successful thing in that year. .

1.1 Squeeze-and-Excitation Blocks

insert image description here
The SE Block module is mainly composed of a Squeeze operation and an Excitation operation: the Squeeze operation is responsible for global pooling of the spatial dimension (such as 7 x 7 --> 1 x 1); the Excitation operation learns the channel dependencies after pooling, and performs channel Weighted empowerment. The network structure in the above picture actually summarizes the theme of SENet very well. Next, I will explain it in detail from two aspects: Squeeze and Excitation.

1.1.1 Squeeze: Global Information Embedding

The initial part of the network structure Ftr: X->U is the classic convolution structure in the past, and the part after U is the innovative part of SENet: use global average pooling to squeeze U in the two dimensions of H and W, and combine a The entire spatial feature on the channel is encoded as a global feature, resulting in an intermediate output of 1x1xC. To put it more simply, here is actually using a two-dimensional pooling check feature map for dimensionality reduction, from the original three dimensions of H, W, and C to one dimension of C, so that the subsequent channel empowerment The operation is feasible, and its formula is shown in the figure below:
insert image description here

1.1.2 Excitation: Adaptive Recalibration

In order to better learn the feature information obtained by the Squeeze operation, the author uses the Excitation operation to obtain the dependencies between channels. In order to achieve this goal, the author analyzes that the function must meet two criteria: (1)it must be flexible(in particular being able to learn nonlinear interactions between channels); (2)It must be able to learn a non-mutually exclusive relationship(because we want to make sure that multiple channels are allowed to be emphasized). Therefore, the author uses two fully connected layers FC to learn the dependencies between channels, and finally uses the sigmoid function to normalize the weights (limit the weight value of each channel to 0-1, and limit the weight sum to 1), the formula is as follows :
insert image description here

1.1.3 Take a chestnut: SE-ResNet Module

insert image description here
The above picture is the network structure of SE-ResNet. For the Residual stage, SE-Block will perform dimensionality reduction through a global pooling (the dimensionality reduction may not be standardized) to obtain the features of the dimension of channel C, and then go through two layers of FC. The first layer of FC will continue to reduce the dimension of C, mainly through the hyperparameter r (r refers to the compression ratio, the author tried the performance of r under various values, and finally concluded that when r=16, the overall performance and The amount of calculation is the most balanced); after activation, the second layer of FC maps the compressed channel back to the original dimension, and finally uses the Sigmoid function to assign different weights to each channel.
Scale represents the operation of multiplying the weight by the feature to be weighted. After the Scale operation, the weight on the channel dimension is perfectly added to the feature.

1.2 Code implementation

1.2.1 SE module

The implementation of SE is shown in the following code, and I have made detailed comments for each step. If you don't understand the previous formula, the corresponding function operation here may help you understand the formula.

class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)# Squeeze操作的定义
        self.fc = nn.Sequential(# Excitation操作的定义
            nn.Linear(channel, channel // reduction, bias=False),# 压缩
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),# 恢复
            nn.Sigmoid()# 定义归一化操作
        )

    def forward(self, x):
        b, c, _, _ = x.size()# 得到H和W的维度，在这两个维度上进行全局池化
        y = self.avg_pool(x).view(b, c)# Squeeze操作的实现
        y = self.fc(y).view(b, c, 1, 1)# Excitation操作的实现
        # 将y扩展到x相同大小的维度后进行赋权
        return x * y.expand_as(x)

1.2.2 SE-ResNet

The following code shows the operation before adding SENet to Resnet residual link. In fact, SENet can be added in shallow block (such as before conv1) or in deep layer (after bn2). The location to add should be determined according to your own tasks. If your network pays more attention to shallow features, such as texture features, then it can be added to the shallow layer; on the contrary, if your network pays more attention to deep features, such as outline features and structural features, it should be added to the deep layer. Specific issues analyze.

class SEBasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None,
                 *, reduction=16):
        super(SEBasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes, 1)
        self.bn2 = nn.BatchNorm2d(planes)
        self.se = SELayer(planes, reduction)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.se(out)# 加入通道注意力机制

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out

2 SKNet

Link to original text: SKNet original
text Link to source code: SKNet source code

CVPR2019's article Selective Kernel Networks , this article also pays tribute to the idea of SENet. SENet proposed Sequeeze and Excitation block, and SKNet proposed Selective Kernel Convolution. Both of them can be easily embedded into the current network structure, such as ResNet, Inception, ShuffleNet, to achieve accuracy improvement.

2.1 Selective Kernel Convolution

The article focuses onReceptive fields of different sizes have different effects on targets of different scales, and what method should we adopt so that the network can automatically use the effective receptive field for classification? In order to solve this problem, the author proposes aDynamic selection mechanism for convolution kernel, this mechanism allows each neuron to adaptively adjust the size of its receptive field (convolution kernel) according to the multi-scale input information.
insert image description here
The above figure is the select kernel convolution module. The network mainly includes three operations: Split, Fuse, and Select. Split generates different feature maps through multiple kernels of different sizes. The model in the above figure only designs two convolution kernels of different sizes. In fact, multiple convolution kernels of multiple branches can be designed; the Fuse operation combines and aggregates from Information from multiple paths to obtain a global and comprehensive representation for selection weights; the select operation aggregates feature maps of kernels of different sizes according to the selection weights.

2.1.1 Split

Use different convolution kernels for the input X to generate different feature outputs. The above figure shows the convolution operation using the 3x3 and 5x5 convolution kernels. In order to improve the calculation efficiency, the 5x5 convolution operation uses the hole rate 2. The convolution kernel is implemented by 3x3 hole convolution, and uses group convolution, depth separable convolution, BatchNorm and ReLU.

2.1.2 Fuze

Information fusion of the obtained multiple feature outputs, that is, the sum operation in pytorch, to obtain a new feature map U, which is the formula (1) in the figure below; then use the same operation of Squeeze to generate the information of this dimension of the channel, namely Formula (2) in the figure; finally, use the 1-layer fully connected layer FC to learn the dependencies between channels, and finally use ReLU and BatchNorm for normalization, which is the formula (3) in the figure below. The relevant formulas are as follows:
insert image description here

2.1.3 Select

In the dimension of the channel, the final feature map obtained by multiple branches is weighted, and the sigmoid function is used. Finally, the weighted feature maps of all branches are added to obtain the final output.

2.2 Code implementation

Combined with the above explanations, the code is actually very clear. I have commented on the specific definitions and operations. You can refer to the comments for understanding.

class SKConv(nn.Module):
    def __init__(self, features, WH, M, G, r, stride=1 ,L=32):
        super(SKConv, self).__init__()
        d = max(int(features/r), L)
        self.M = M
        self.features = features
        self.convs = nn.ModuleList([])
        # 生成M个分支，将其添加到convs中，每个分支采用不同的卷积核和不同规模的padding，保证最终得到的特征图大小一致
        for i in range(M):
            self.convs.append(nn.Sequential(
                nn.Conv2d(features, features, kernel_size=3+i*2, stride=stride, padding=1+i, groups=G),
                nn.BatchNorm2d(features),
                nn.ReLU(inplace=False)
            ))
        # 学习通道间依赖的全连接层
        self.fc = nn.Linear(features, d)
        self.fcs = nn.ModuleList([])
        for i in range(M):
            self.fcs.append(
                nn.Linear(d, features)
            )
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, x):
        for i, conv in enumerate(self.convs):
            fea = conv(x).unsqueeze_(dim=1)
            if i == 0:
                feas = fea
            else:
                feas = torch.cat([feas, fea], dim=1)
        fea_U = torch.sum(feas, dim=1)# 将多个分支得到的特征图进行融合
        fea_s = fea_U.mean(-1).mean(-1)# 在channel这个维度进行特征抽取
        fea_z = self.fc(fea_s)# 学习通道间的依赖关系
        # 赋权操作，由于是对多维数组赋权，所以看起来比SENet麻烦一些
        for i, fc in enumerate(self.fcs):
            vector = fc(fea_z).unsqueeze_(dim=1)
            if i == 0:
                attention_vectors = vector
            else:
                attention_vectors = torch.cat([attention_vectors, vector], dim=1)
        attention_vectors = self.softmax(attention_vectors)
        attention_vectors = attention_vectors.unsqueeze(-1).unsqueeze(-1)
        fea_v = (feas * attention_vectors).sum(dim=1)
        return fea_v

3 CBAM

Original text link: CBAM original text
Source code link: CBAM source code

CBAM (Convolutional Block Attention Module) is a lightweight channel attention mechanism, and it is also a widely used visual attention mechanism. It was proposed in ECCV in 2018. The article uses Channel Attention and Spatial Attention at the same time, and found that it is better to connect the two attentions in series.

3.1 Convolutional Block Attention Module

The following figure is the network structure diagram of CBAM.
insert image description here

It can be seen that CBAM contains two independent sub-modules, Channel Attention Module (CAM) and Spatial Attention Module (Spartial Attention Module, SAM) , which perform channel and spatial weighting respectively. This not only saves parameters and computing power, but also ensures that it can be integrated into the existing network architecture as a plug-and-play module.

3.1.1 Channel attention module

insert image description here
The basic idea of the channel attention mechanism is the same as SENet, but the specific operation is slightly different from SENet, and I marked the different parts in red. First , the input feature map F (H×W×C) is subjected to global maximum pooling (MaxPool) and global average pooling (AvgPool) based on the two dimensions of H and W, respectively , to obtain two 1×1×C feature map; then , the two feature maps are sent to a weight-shared two-layer neural network (MLP) to learn the dependencies between channels, and the dimensionality reduction is achieved through the compression ratio r between the two neural layers. Finally , the features output by the MLP are subjected to an element-wise summation operation, and then undergo a sigmoid activation operation to generate the final channel weight, namely M_c. Its formula is as follows:

insert image description here

3.1.2 Spatial attention module

Originally, this topic mainly discusses the channel attention mechanism, and I will talk about the follow-up of CBAM when I think about the spatial attention mechanism in the next article, but according to what my sister Yi said, the algorithm has been brought to my mouth, so I just take a piece solved.

insert image description here
The spatial attention mechanism takes the feature map F' output by the channel attention module as the input feature map of this module. First , the maximum pooling (MaxPool) and average pooling (AvgPool) operations are performed based on the channel dimension to obtain two H×W×1 feature maps; then, the two feature maps are spliced based on the channel dimension, that is, the concat operation ; Then , use the 7×7 convolution kernel (the author has verified that the 7x7 effect is better than other dimension convolution kernels through experiments) for channel dimensionality reduction, and the dimensionality reduction is a single-channel feature map, that is, H×W×1; finally , After sigmoid learns the dependency relationship between spatial elements, the weight of the spatial dimension is generated, namely M_s. Its formula is as follows:

insert image description here

3.2 Code implementation

3.2.1 CA&SA

For specific network definition and operation implementation, please refer to my code comments.

class ChannelAttention(nn.Module):
    def __init__(self, in_planes, ratio=16):
        super(ChannelAttention, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)# 定义全局平均池化
        self.max_pool = nn.AdaptiveMaxPool2d(1)# 定义全局最大池化
        # 定义CBAM中的通道依赖关系学习层，注意这里是使用1x1的卷积实现的，而不是全连接层
        self.fc = nn.Sequential(nn.Conv2d(in_planes, in_planes // 16, 1, bias=False),
                               nn.ReLU(),
                               nn.Conv2d(in_planes // 16, in_planes, 1, bias=False))
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = self.fc(self.avg_pool(x))# 实现全局平均池化
        max_out = self.fc(self.max_pool(x))# 实现全局最大池化
        out = avg_out + max_out# 两种信息融合
        # 最后利用sigmoid进行赋权
        return self.sigmoid(out)

class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        # 定义7*7的空间依赖关系学习层
        self.conv1 = nn.Conv2d(2, 1, kernel_size, padding=kernel_size//2, bias=False)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        avg_out = torch.mean(x, dim=1, keepdim=True)# 实现channel维度的平均池化
        max_out, _ = torch.max(x, dim=1, keepdim=True)# 实现channel维度的最大池化
        x = torch.cat([avg_out, max_out], dim=1)# 拼接上述两种操作的到的两个特征图
        x = self.conv1(x)# 学习空间上的依赖关系
        # 对空间元素进行赋权
        return self.sigmoid(x)

3.2.2 CBAM_ResNet

Due to space limitations, only the addition of BasicBlock's CA&SA is shown here

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
		# 定义ca和sa，注意CA与channel num有关，需要指定这个超参！！！
        self.ca = ChannelAttention(planes)
        self.sa = SpatialAttention()
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.ca(out) * out# 对channel赋权
        out = self.sa(out) * out# 对spatial赋权
        if self.downsample is not None:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)

        return out

The series ends here, the more you like, the faster the update!

[Attention Mechanism Collection] Channel Attention Channel Attention Network Structure, Source Code Interpretation Series 1