Attentional Feature Fusion attention feature fusion

Attentional Feature Fusion attention feature fusion

I recently saw a relatively good feature fusion method, based on the attention mechanism , which is very similar AAFto the previous ones SENet, etc., but the performance is better than them, and it is suitable for a wider range of scenarios, including short and long jump connections and in layers . resulting feature fusion. It is the attention feature fusion proposed by China Southern Airlines, plug and play!SKNetAFFInceptionAFF

This blog mainly refers to the author of Zhihu OucQxw, the original address of Zhihu:https://zhuanlan.zhihu.com/p/424031096


Paper download address: https://arxiv.org/pdf/2009.14082.pdf

Github code address: https://github.com/YimianDai/open-aff
insert image description here

1. Motivation

Feature fusion refers to the combination of features from different layers or branches and is an ubiquitous part of modern neural network architectures. It is usually implemented with simple linear operations (eg summation or concatenation), but this may not be optimal. This paper proposes a unified general scheme, Attentional Feature Fusion ( AFF), which is suitable for most common scenarios, including short and long skip connections and induced feature fusion within Inceptionlayers .

To better fuse features with inconsistent semantics and scales, we propose 多尺度通道注意力模块( MS-CAM), a module that solves the problems that arise when fusing features of different scales. We also demonstrate that initial feature fusion can be a bottleneck, and propose an iterative attention feature fusion module ( iAFF) to alleviate this problem.

  1. Problems with fusion of attention features SKNetdeveloped in recent years :ResNeSt
  • Scene limitations : SKNetand ResNeStonly focus on feature selection of the same layer, and cannot achieve cross-layer feature fusion.
  • Simple initial integration : In order to provide the obtained features to the attention module, SKNetfeature fusion is performed by summing, and these features may have large inconsistencies in scale and semantics, and have a large impact on the quality of fusion weights influence, which limits the performance of the model.
  • Biased context aggregation scale : The fusion weights inSKNet and are generated by a global channel attention mechanism, which is more favored for information with a more global distribution, but does not work well for small objects. ResNeStIs it possible to dynamically fuse features of different scales through neural networks?
  1. The contribution of this paper, aiming at the above three problems, proposes the following solutions:
  • Note that the feature fusion module ( AFF) is suitable for most common scenarios, including feature fusion caused by short and long skip connections and within the Inception layer.
  • The iterative attention feature fusion module ( IAFF) alternately integrates the initial feature fusion with another attention module.
  • A multi-scale channel attention module ( MSCAM) is introduced to extract channel attention through two branches with different scales.

2. Method

  1. Multi-scale Channel Attention Module (MS-CAM)

​ MS-CAM mainly SENetcontinues the idea of ​​, and then combines Local / Globalthe features of CNN, and uses Attention to fuse multi-scale information in space.

​There areMS-CAM 2 big differences:

  • MS-CAMFocus on the scale of the channel through point-by-point convolution (1x1 convolution), instead of convolution kernels of different sizes, use point convolution, in order to make as lightweight MS-CAM as possible
  • MS-CAMNot in the backbone, but locally local and global feature contextual features in the channel attention module.

insert image description here

The above figure MS-CAMis the structure diagram of , Xwhich is the input feature and X'the fused feature. The two branches on the right represent the channel attention of the global feature and the channel attention of the local feature respectively. The calculation formula of the channel attention of the local feature L(X)is as follows

insert image description here

insert image description here

insert image description here

The implemented code is as follows:

class MS_CAM(nn.Module):
    '''
    单特征进行通道注意力加权,作用类似SE模块
    '''

    def __init__(self, channels=64, r=4):
        super(MS_CAM, self).__init__()
        inter_channels = int(channels // r)

        # 局部注意力
        self.local_att = nn.Sequential(
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        # 全局注意力
        self.global_att = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        xl = self.local_att(x)
        xg = self.global_att(x)
        xlg = xl + xg
        wei = self.sigmoid(xlg)
        return x * wei
  1. Attentional Feature Fusion(AFF

insert image description here

Two features are given X, Y for feature fusion ( Yrepresenting features with larger receptive fields).

AFFThe calculation method is as follows:

insert image description here

For the two input features X, Y first do the initial feature fusion, and then pass the obtained initial features through MS-CAMthe module and sigmodthe activation function, the output value is between 0 and 1. The author wants to do a weighted average of X , Y and subtract this group from 1 Fusion weight, it can be done Soft selection, through training, let the network determine their respective weights.

The implemented code is as follows:

class AFF(nn.Module):
    '''
    多特征融合 AFF
    '''

    def __init__(self, channels=64, r=4):
        super(AFF, self).__init__()
        inter_channels = int(channels // r)

        # 局部注意力
        self.local_att = nn.Sequential(
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        # 全局注意力
        self.global_att = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        self.sigmoid = nn.Sigmoid()

    def forward(self, x, residual):
        xa = x + residual
        xl = self.local_att(xa)
        xg = self.global_att(xa)
        xlg = xl + xg
        wei = self.sigmoid(xlg)

        xo = x * wei + residual * (1 - wei)
        return xo
  1. iterative Attentional Feature Fusion ( iAFF )

insert image description here

​ In the attention feature fusion module, Xthe Yfusion of initial features is simply the addition of corresponding elements, and then used as the input of the attention module will affect the final fusion weight. The author believes that if you want to have a complete perception of the input feature map, you can only use the mechanism of attention fusion for the initial feature fusion. An intuitive method is to use another attentionmodule fuse the input features.

insert image description here

The formula is the same as the calculation AFFof , just add an extra layer of attention.

The implemented code is as follows:

class iAFF(nn.Module):
    '''
    多特征融合 iAFF
    '''

    def __init__(self, channels=64, r=4):
        super(iAFF, self).__init__()
        inter_channels = int(channels // r)

        # 局部注意力
        self.local_att = nn.Sequential(
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        # 全局注意力
        self.global_att = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        # 第二次局部注意力
        self.local_att2 = nn.Sequential(
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )
        # 第二次全局注意力
        self.global_att2 = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(inter_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(channels),
        )

        self.sigmoid = nn.Sigmoid()

    def forward(self, x, residual):
        xa = x + residual
        xl = self.local_att(xa)
        xg = self.global_att(xa)
        xlg = xl + xg
        wei = self.sigmoid(xlg)
        xi = x * wei + residual * (1 - wei)

        xl2 = self.local_att2(xi)
        xg2 = self.global_att(xi)
        xlg2 = xl2 + xg2
        wei2 = self.sigmoid(xlg2)
        xo = x * wei2 + residual * (1 - wei2)
        return xo

3. Experiments

Some experimental results are shown here. For detailed experimental results, please refer to the original paper.

  1. In order to verify whether the Multi-scale approach is effective, the author set up two methods: Global + Global and Local + Local. Compared with Global + Local, it is found that the effect of global + local is still optimal.

insert image description here

insert image description here

  1. In various mainstream networks, using the feature fusion method proposed in this paper for short-hop connection, long-hop connection, and feature fusion of the same layer, the effect is better than the previous model.

insert image description here

  1. On different image classification data sets, the feature fusion method proposed in this paper is added to the original network model, and compared with the original model, it is found that the accuracy rate and the parameter size of the network have been greatly improved.

insert image description here

Guess you like

Origin blog.csdn.net/L28298129/article/details/126521418