BAM&SGE&DAN original text, structure, source code detailed explanation

Attention Mechanism Collection 2

Previously, we have systematically introduced the concept, classification and development overview of attention mechanism in recent years. Portal:
[Introduction
to Attention Mechanism] and introduced the channel attention mechanisms SENet, SKNet, and CBAM in the attention mechanism. Gate:
[Channel Attention Mechanism Series 1]
In this article, we will continue to follow the development of the visual attention mechanism, and interpret the original text, structure, and source code of several well-known algorithms in the attention mechanism. Talk is cheap, let's view the code.

1 BAM：Bottleneck Attention Module

Original Link: https://arxiv.org/pdf/1807.06514.pdf
Source Link: https://github.com/Jongchan/attention-module
Single cat detection process

This article was created by the original team of CBAM. It is regarded as the sister algorithm of CBAM. The mechanism and principle are actually very similar to CBAM. In the picture above, the author puts BAM between each stage in the Resnet network. Interestingly, through visualization, we can see that multi-layer BAM forms a layered attention mechanism, which is a bit like the human perception mechanism. It shows that BAM can eliminate low-level features such as background semantic features between each stage, and gradually focus on high-level semantic information, such as the focusing process of the single cat in the figure.

1.1 Interpretation of BAM

BAM structure diagram

For the feature map input into the network, BAM will calculate the attention weight based on the two channels of channel and spatial respectively, add the
feature vectors obtained in the two channels to form a new attention weight, and finally use the sigmoid function to activate. There are three more important points:
(1) BAM is used as a residual structure as a whole, so unlike the previous module that is directly multiplied into the feature map, the original text defines its mathematical definition as shown in formula (1).
(2) CBAM fuses the attention weights of the channel and spatial levels in a series connection, while BAM uses a parallel connection to separate the two, and directly adds the attention weights obtained by the two, and its mathematical definition As shown in formula (2).
(3) What if the weight size obtained by the attention channel at the Channel level is different from the weight size at the Spatial level? In fact, not only the dimensions are different when adding here, but also the dimensions are different when the internal weight map of Spatial is fused. The author doesn't discuss this issue too much in the article, but in the source code the author just uses the broadcast mechanism to map the output to the same size as the input.
Explanation of formula 1 and formula 2 in the original text of BAM

1.1.1 Channel attention branch

Channel attention branch
Different from the channel attention mechanism in CBAM, only global average pooling is used in BAM to generate Channel dimension features, and the weights of different input channels are learned through two layers of fully connected layers FC, and finally the channel weight M_c of Cx1x1 size is obtained. . The mathematical definition is shown in formula (3):
insert image description here
It can be seen from the formula that for two layers of fully connected layers, the author uses the attenuation parameter r mentioned in SENet in the process of learning weights to reduce the dimension, and then remaps back to the original The dimension size, and finally perform BatchNorm on the feature vector. It can be said that BAM reduces the network content of channel attention and puts more energy on spatial attention.

1.1.2 Spatial attention branch

Spatial attention branch
The author’s team mentioned the original intention of designing this part of the network at the beginning of this section: a large receptive field can promote the network’s effective use of context information, so the author’s team introduced dilated convolution (also known as dilated convolution, dilated convolution) to expand efficiently Feel wild.
For the input feature map F, the network first uses 1x1 convolution to reduce its channel dimension, and the specific number of channels is determined by the attenuation parameter r (you can recall how CBAM reduces dimensionality and make a comparison); then, pass The repeated expansion convolution operation learns the spatial weight of the feature map. The paper shows that it is realized by using 2 layers of convolution layers with kernel_size=3 and dilation_val=4. You can also use volumes with different numbers of layers and different hole rates. The product operation is used to reproduce; finally, the 1x1 convolution operation is used to reduce the number of channels of the feature weight to 1. Its mathematical definition is shown in formula (4).
insert image description here

1.1.3 One more thing

Finally, let me talk about some thoughts when I read paper\code:
(1) First, for the feature weights obtained on the two channels of channel and spatial, the author uses the broadcast mechanism to map them to the size of the original input indiscriminately, Is this method conducive to the learning of weights?
(2) Secondly, in the spatial channel, the author learns spatial weights through the downsampled convolutional layer. If the author wants to learn spatial weights through atrous convolution, why not use the constant convolution operation of H and W of the feature map, so that after the downsampling is completed, the dimensions must be restored through broadcasting, which is difficult to understand Motivation for downsampling.
(3) Finally, also in the spatial channel, the network will compress the channel number to 1 after the learning is completed, but the subsequent channel+spatial operation will re-broadcast the channel number, is it redundant? After thinking about it after writing, more layers of MLP operations may learn more nonlinear features between features to achieve better fitting results?

1.2 Code Interpretation

The two paths in BAM are mainly implemented by module sequences, so when reading the code, you must first look at the hyperparameters r and layer_num you set, and then return at the end of the forward propagation through .expand_as(x) ( this The operation is more important, you can check it yourself) The feature map after operating the broadcast. The specific internal details can be understood according to the above interpretation, so I won’t go into details.

import torch
from torch import nn
from torch.nn import init

class Flatten(nn.Module):
    def forward(self,x):
        return x.view(x.shape[0],-1)

class ChannelAttention(nn.Module):
    def __init__(self,channel,reduction=16,num_layers=3):
        super().__init__()
        self.avgpool=nn.AdaptiveAvgPool2d(1)
        gate_channels=[channel]
        gate_channels+=[channel//reduction]*num_layers
        gate_channels+=[channel]
        self.ca=nn.Sequential()
        self.ca.add_module('flatten',Flatten())# 特征图扁平化
        for i in range(len(gate_channels)-2):# 构造全连接层
            self.ca.add_module('fc%d'%i,nn.Linear(gate_channels[i],gate_channels[i+1]))
            self.ca.add_module('bn%d'%i,nn.BatchNorm1d(gate_channels[i+1]))
            self.ca.add_module('relu%d'%i,nn.ReLU())
        self.ca.add_module('last_fc',nn.Linear(gate_channels[-2],gate_channels[-1]))
        
    def forward(self, x) :
        res=self.avgpool(x)
        res=self.ca(res)
        return res.unsqueeze(-1).unsqueeze(-1).expand_as(x)


class SpatialAttention(nn.Module):
    def __init__(self,channel,reduction=16,num_layers=3,dia_val=2):
        super().__init__()
        self.sa=nn.Sequential()
        self.sa.add_module('conv_reduce1',nn.Conv2d(kernel_size=1,in_channels=channel,out_channels=channel//reduction))
        self.sa.add_module('bn_reduce1',nn.BatchNorm2d(channel//reduction))
        self.sa.add_module('relu_reduce1',nn.ReLU())
        for i in range(num_layers):
            self.sa.add_module('conv_%d'%i,nn.Conv2d(kernel_size=3,in_channels=channel//reduction,out_channels=channel//reduction,padding=1,dilation=dia_val))
            self.sa.add_module('bn_%d'%i,nn.BatchNorm2d(channel//reduction))
            self.sa.add_module('relu_%d'%i,nn.ReLU())
        self.sa.add_module('last_conv',nn.Conv2d(channel//reduction,1,kernel_size=1))
    def forward(self, x) :
        res=self.sa(x)
        return res.expand_as(x)
        
class BAMBlock(nn.Module):

    def __init__(self, channel=512,reduction=16,dia_val=2):
        super().__init__()
        self.ca=ChannelAttention(channel=channel,reduction=reduction)
        self.sa=SpatialAttention(channel=channel,reduction=reduction,dia_val=dia_val)
        self.sigmoid=nn.Sigmoid()

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)

    def forward(self, x):
        b, c, _, _ = x.size()
        sa_out=self.sa(x)
        ca_out=self.ca(x)
        weight=self.sigmoid(sa_out+ca_out)
        out=(1+weight)*x
        return out

if __name__ == '__main__':
    input=torch.randn(50,512,7,7)
    bam = BAMBlock(channel=512,reduction=16,dia_val=2)
    output=bam(input)
    print(output.shape)

2 DANet：Dual Attention Network

Original Link: https://arxiv.org/abs/1809.02983
Source Link: https://github.com/junfu1115/DANet

DANet is an attention mechanism based on the semantic segmentation task. It abandons the previous encoder-decoder structure and uses the hole convolution + attention mechanism. Capture long-distance contextual information at three levels , and achieved SOFT results in many competitions in 2019.

2.1 Interpretation of DANet

Since the receptive field generated by the convolution operation is local , the local features of a car in different receptive fields may be different, that is, the features corresponding to pixels with the same label may have certain differences . Intra-class differences affect the recognition accuracy. In order to solve this problem, the author explores the global context information by establishing the association between features and attention mechanism in the network, and then improves the feature expression ability of scene segmentation.
Schematic diagram of DANet structure
The structure principle of DANet is shown in the figure above, and the specific process:
(1) For the input picture, DANet uses ResNet without sampling operation as the skeleton network for feature extraction, and uses hole convolution in the last two modules of ResNet, so The final feature map is expanded to 1/8 of the original image . This improvement retains more underlying details without adding additional parameters;
(2) Use the convolution operation to reduce the dimensionality of the feature map obtained from the skeleton network (The two gray rectangles after ResNet), get the features of the input Position Attention Module and Channel Attention Module, and extract the correlation information between pixels and channels through the above two modules.
(3) The outputs from the two modules are fused to obtain a better feature representation for pixel-level prediction.

2.1.1 Position Attention Module

The above figure shows the structural details of the position attention module. The original input A undergoes three identical convolution operations to obtain B, C, and D. (How do I feel that the convolution operation of mapping A to BCD here is as similar as the original input in Transformer after mapping to QKV?????)
For input B and C, its original size is ChannelxHxW, and the network reshape it from 3D features to 2D features, and the size becomes ChannelxN (N=HxW); but for B, not only reshape, but also transpose (transpose) setting), otherwise two identical non-square matrices cannot be multiplied. So input B eventually becomes a feature of NxChannel, and C eventually becomes a feature of ChannelxN. Then the two get the S in the above figure through matrix multiplication and softmax activation, and its size is NxN. The specific mathematical definition is shown in the figure below:
S definition
A also obtains D through convolution operation, performs reshape operation on D to obtain the feature of size ChannelxN, and then combines the feature S of size NxN with the transposed D (the size of D at this time Do matrix multiplication for NxChannel), and then reshape the finally obtained eigenvectors to the size of ChannelxHxW (to be honest, I was also confused by the author's operation, so I don't need to go into details here) Finally, compare A with D after a series of
operations Plus information fusion, get the feature map E! The mathematical formula for calculating E is shown in the figure below:
Definition of final output E

Q: Why do you do this?
In fact, through the four operations of ABCD, we can see that the result feature E of each position is the weighted sum of the features of all positions (the result of BCD series operations) and the original feature A. Therefore, it has a macroscopic and global semantic view, and can selectively aggregate context according to the positional feature map, enabling similar semantic features to benefit from each other, thereby improving intra-class compactness and semantic consistency.

2.1.2 Channel Attention Module

The strategy adopted by the Channel Attention Module is similar to positional attention. The difference is that BCD is not obtained through the intermediate convolution mapping, but feature extraction is directly based on A. The specific details can be easily understood by combining the positional attention mechanism and the above figure. Here I I won't go into details.

2.2 Code Interpretation

I took a look at the code. For the part of the convolution map in pa, the author seems to have really borrowed from Transformer, because he directly named the three feature maps as q, k, and v! To be honest, this part of the content is really too convoluted, and I didn’t understand the code too much, so I only posted the main code here, leaving a pit for the specific details, and filling it in when I have time...

class PositionAttentionModule(nn.Module):

    def __init__(self,d_model=512,kernel_size=3,H=7,W=7):
        super().__init__()
        self.cnn=nn.Conv2d(d_model,d_model,kernel_size=kernel_size,padding=(kernel_size-1)//2)
        self.pa=ScaledDotProductAttention(d_model,d_k=d_model,d_v=d_model,h=1)
    
    def forward(self,x):
        bs,c,h,w=x.shape
        y=self.cnn(x)
        y=y.view(bs,c,-1).permute(0,2,1) #bs,h*w,c
        y=self.pa(y,y,y) #bs,h*w,c
        return y


class ChannelAttentionModule(nn.Module):
    
    def __init__(self,d_model=512,kernel_size=3,H=7,W=7):
        super().__init__()
        self.cnn=nn.Conv2d(d_model,d_model,kernel_size=kernel_size,padding=(kernel_size-1)//2)
        self.pa=SimplifiedScaledDotProductAttention(H*W,h=1)
    
    def forward(self,x):
        bs,c,h,w=x.shape
        y=self.cnn(x)
        y=y.view(bs,c,-1) #bs,c,h*w
        y=self.pa(y,y,y) #bs,c,h*w
        return y




class DAModule(nn.Module):

    def __init__(self,d_model=512,kernel_size=3,H=7,W=7):
        super().__init__()
        self.position_attention_module=PositionAttentionModule(d_model=512,kernel_size=3,H=7,W=7)
        self.channel_attention_module=ChannelAttentionModule(d_model=512,kernel_size=3,H=7,W=7)
    
    def forward(self,input):
        bs,c,h,w=input.shape
        p_out=self.position_attention_module(input)
        c_out=self.channel_attention_module(input)
        p_out=p_out.permute(0,2,1).view(bs,c,h,w)
        c_out=c_out.view(bs,c,h,w)
        return p_out+c_out


if __name__ == '__main__':
    input=torch.randn(50,512,7,7)
    danet=DAModule(d_model=512,kernel_size=3,H=7,W=7)
    print(danet(input).shape)

Finally, post my favorite snippet from a literary memoir:

Formula: Knowledge and love are always proportional. The more you know, the more you love. The opposite direction is: the more you love, the more you know. The order cannot be reversed: there must be a prophet. Ignorant love is not love.

[Attention Mechanism Collection 2] BAM&SGE&DAN Detailed Explanation of Original Text, Structure and Source Code