Attentional Feature Fusion attention feature fusion
I recently saw a relatively good feature fusion method, based on the attention mechanism , which is very similar
AAF
to the previous onesSENet
, etc., but the performance is better than them, and it is suitable for a wider range of scenarios, including short and long jump connections and in layers . resulting feature fusion. It is the attention feature fusion proposed by China Southern Airlines, plug and play!SKNet
AFF
Inception
AFF
This blog mainly refers to the author of Zhihu OucQxw
, the original address of Zhihu:https://zhuanlan.zhihu.com/p/424031096
Paper download address: https://arxiv.org/pdf/2009.14082.pdf
Github code address: https://github.com/YimianDai/open-aff
1. Motivation
Feature fusion refers to the combination of features from different layers or branches and is an ubiquitous part of modern neural network architectures. It is usually implemented with simple linear operations (eg summation or concatenation), but this may not be optimal. This paper proposes a unified general scheme, Attentional Feature Fusion ( AFF
), which is suitable for most common scenarios, including short and long skip connections and induced feature fusion within Inception
layers .
To better fuse features with inconsistent semantics and scales, we propose 多尺度通道注意力模块
( MS-CAM
), a module that solves the problems that arise when fusing features of different scales. We also demonstrate that initial feature fusion can be a bottleneck, and propose an iterative attention feature fusion module ( iAFF
) to alleviate this problem.
- Problems with fusion of attention features
SKNet
developed in recent years :ResNeSt
- Scene limitations :
SKNet
andResNeSt
only focus on feature selection of the same layer, and cannot achieve cross-layer feature fusion. - Simple initial integration : In order to provide the obtained features to the attention module,
SKNet
feature fusion is performed by summing, and these features may have large inconsistencies in scale and semantics, and have a large impact on the quality of fusion weights influence, which limits the performance of the model. - Biased context aggregation scale : The fusion weights in
SKNet
and are generated by a global channel attention mechanism, which is more favored for information with a more global distribution, but does not work well for small objects.ResNeSt
Is it possible to dynamically fuse features of different scales through neural networks?
- The contribution of this paper, aiming at the above three problems, proposes the following solutions:
- Note that the feature fusion module (
AFF
) is suitable for most common scenarios, including feature fusion caused by short and long skip connections and within the Inception layer. - The iterative attention feature fusion module (
IAFF
) alternately integrates the initial feature fusion with another attention module. - A multi-scale channel attention module (
MSCAM
) is introduced to extract channel attention through two branches with different scales.
2. Method
- Multi-scale Channel Attention Module (
MS-CAM
)
MS-CAM mainly SENet
continues the idea of , and then combines Local / Global
the features of CNN, and uses Attention to fuse multi-scale information in space.
There areMS-CAM
2 big differences:
MS-CAM
Focus on the scale of the channel through point-by-point convolution (1x1 convolution), instead of convolution kernels of different sizes, use point convolution, in order to make as lightweightMS-CAM
as possibleMS-CAM
Not in the backbone, but locally local and global feature contextual features in the channel attention module.
The above figure MS-CAM
is the structure diagram of , X
which is the input feature and X'
the fused feature. The two branches on the right represent the channel attention of the global feature and the channel attention of the local feature respectively. The calculation formula of the channel attention of the local feature L(X)
is as follows
The implemented code is as follows:
class MS_CAM(nn.Module):
'''
单特征进行通道注意力加权,作用类似SE模块
'''
def __init__(self, channels=64, r=4):
super(MS_CAM, self).__init__()
inter_channels = int(channels // r)
# 局部注意力
self.local_att = nn.Sequential(
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
# 全局注意力
self.global_att = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
xl = self.local_att(x)
xg = self.global_att(x)
xlg = xl + xg
wei = self.sigmoid(xlg)
return x * wei
- Attentional Feature Fusion(
AFF
)
Two features are given X,
Y
for feature fusion ( Y
representing features with larger receptive fields).
AFF
The calculation method is as follows:
For the two input features X
, Y
first do the initial feature fusion, and then pass the obtained initial features through MS-CAM
the module and sigmod
the activation function, the output value is between 0 and 1. The author wants to do a weighted average of X
, Y
and subtract this group from 1 Fusion weight
, it can be done Soft selection
, through training, let the network determine their respective weights.
The implemented code is as follows:
class AFF(nn.Module):
'''
多特征融合 AFF
'''
def __init__(self, channels=64, r=4):
super(AFF, self).__init__()
inter_channels = int(channels // r)
# 局部注意力
self.local_att = nn.Sequential(
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
# 全局注意力
self.global_att = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
self.sigmoid = nn.Sigmoid()
def forward(self, x, residual):
xa = x + residual
xl = self.local_att(xa)
xg = self.global_att(xa)
xlg = xl + xg
wei = self.sigmoid(xlg)
xo = x * wei + residual * (1 - wei)
return xo
- iterative Attentional Feature Fusion (
iAFF
)
In the attention feature fusion module, X
the Y
fusion of initial features is simply the addition of corresponding elements, and then used as the input of the attention module will affect the final fusion weight. The author believes that if you want to have a complete perception of the input feature map, you can only use the mechanism of attention fusion for the initial feature fusion. An intuitive method is to use another attention
module fuse the input features.
The formula is the same as the calculation AFF
of , just add an extra layer of attention.
The implemented code is as follows:
class iAFF(nn.Module):
'''
多特征融合 iAFF
'''
def __init__(self, channels=64, r=4):
super(iAFF, self).__init__()
inter_channels = int(channels // r)
# 局部注意力
self.local_att = nn.Sequential(
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
# 全局注意力
self.global_att = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
# 第二次局部注意力
self.local_att2 = nn.Sequential(
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
# 第二次全局注意力
self.global_att2 = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, inter_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(inter_channels),
nn.ReLU(inplace=True),
nn.Conv2d(inter_channels, channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm2d(channels),
)
self.sigmoid = nn.Sigmoid()
def forward(self, x, residual):
xa = x + residual
xl = self.local_att(xa)
xg = self.global_att(xa)
xlg = xl + xg
wei = self.sigmoid(xlg)
xi = x * wei + residual * (1 - wei)
xl2 = self.local_att2(xi)
xg2 = self.global_att(xi)
xlg2 = xl2 + xg2
wei2 = self.sigmoid(xlg2)
xo = x * wei2 + residual * (1 - wei2)
return xo
3. Experiments
Some experimental results are shown here. For detailed experimental results, please refer to the original paper.
- In order to verify whether the Multi-scale approach is effective, the author set up two methods: Global + Global and Local + Local. Compared with Global + Local, it is found that the effect of global + local is still optimal.
- In various mainstream networks, using the feature fusion method proposed in this paper for short-hop connection, long-hop connection, and feature fusion of the same layer, the effect is better than the previous model.
- On different image classification data sets, the feature fusion method proposed in this paper is added to the original network model, and compared with the original model, it is found that the accuracy rate and the parameter size of the network have been greatly improved.