Hello classmates, today I will share with you the application of attention mechanism in CNN convolutional neural network, focusing on three attention mechanisms and their code reproduction.

I have also used the attention mechanism in my previous articles in the neural network column. For example , the SE attention mechanism is used in the MobileNetV3 and EfficientNet networks. If you are interested, you can take a look: https://blog.csdn.net /dgvv4/category_11517910.html . So today I will talk to you about the attention mechanism.

1 Introduction

The attention mechanism stems from the study of human vision. In cognitive science, due to bottlenecks in information processing, humans selectively focus on a portion of all information while ignoring other visible information. In order to rationally utilize the limited visual information processing resources, humans need to select a specific part of the visual area and then focus on it.

There is no strict mathematical definition of attention mechanism, such as traditional local image feature extraction, sliding window method, etc. can be regarded as an attention mechanism. In neural networks, the attention mechanism is usually an additional neural network capable of hard-selecting certain parts of the input, or assigning different weights to different parts of the input . The attention mechanism can filter out important information from a large amount of information.

There are many ways to introduce attention mechanisms in neural networks. Taking convolutional neural networks as an example, attention mechanisms can be introduced in the spatial dimension , and attention mechanisms (SE) can also be added in the channel dimension . Of course, there are also mixed dimensions (CBAM ). ) , that is, the spatial dimension and the channel dimension increase the attention mechanism.

2. SENet attention mechanism

2.1 Method introduction

The SE attention mechanism (Squeeze-and-Excitation Networks) increases the attention mechanism in the channel dimension , and the key operations are squeeze and excitation.

Through automatic learning, that is, using another new neural network , the importance of each channel of the feature map is obtained, and then this importance is used to assign a weight value to each feature, so that the neural network can focus on certain feature channel . Boost the channels of feature maps that are useful for the current task, and suppress feature channels that are not useful for the current task.

As shown in the figure below, before entering the SE attention mechanism (white image C2 on the left), the importance of each channel of the feature map is the same. After passing through SENet (color image C2 on the right), different colors represent different The weights make the importance of each feature channel different, so that the neural network focuses on certain channels with large weight values.

2.2 Implementation process:

(1) Squeeze (Fsq) : Through global average pooling , the two-dimensional feature (H*W) of each channel is compressed into 1 real number , and the feature map is changed from [h, w, c] ==> [1, 1,c]

(2) excitation (Fex) : Generate a weight value for each feature channel . In the paper, the correlation between channels is constructed through two fully connected layers. The number of output weight values is the same as the number of channels in the input feature map . [1,1,c] ==> [1,1,c]

(3) Scale (Fscale) : Weight the normalized weights obtained earlier to the features of each channel . Multiplication is used in the paper, and the weight coefficients are multiplied on a channel-by-channel basis . [h,w,c]*[1,1,c] ==> [h,w,c]

Below I use the SE attention mechanism in EfficientNet to illustrate this process.

Squeeze operation: The feature map is subjected to global average pooling, and the feature map is compressed into a feature vector [1,1,c]

Excitation operation: FC1 layer + Swish activation + FC2 layer + Sigmoid activation. Through the fully connected layer (FC1), the channel dimension of the feature map vector is reduced to the original 1/r, that is, [1,1,c*1/r]; then through the Swish activation function; then through a fully connected layer (FC2 ), the feature map of the feature map vector is raised back to the original [1,1,c]; then it is converted into a normalized weight vector between 0-1 through the sigmoid function.

Scale operation : Multiply the normalized weight and the original input feature map channel by channel to generate a weighted feature map.

Section:

(1) The core idea of SENet is to automatically learn the feature weights according to the loss loss through the fully connected network , instead of directly judging according to the numerical distribution of the feature channels, so that the weight of the effective feature channels is large . Of course, the SE attention mechanism inevitably increases some parameters and calculation amount, but the cost performance is still quite high.

(2) The paper believes that the advantage of using two fully connected layers in the excitation operation is that compared to using one fully connected layer directly, it has more nonlinearities and can better fit the complex correlation between channels.

2.3 Code reproduction

import tensorflow as tf
from tensorflow.keras import layers, Model, Input

# se注意力机制
def se_block(inputs, ratio=4):  # ratio代表第一个全连接层下降通道数的系数
    
    # 获取输入特征图的通道数
    in_channel = inputs.shape[-1]
    
    # 全局平均池化[h,w,c]==>[None,c]
    x = layers.GlobalAveragePooling2D()(inputs)
    
    # [None,c]==>[1,1,c]
    x = layers.Reshape(target_shape=(1,1,in_channel))(x)
    
    # [1,1,c]==>[1,1,c/4]
    x = layers.Dense(in_channel//ratio)(x)  # 全连接下降通道数
    
    # relu激活
    x = tf.nn.relu(x)
    
    # [1,1,c/4]==>[1,1,c]
    x = layers.Dense(in_channel)(x)  # 全连接上升通道数
    
    # sigmoid激活，权重归一化
    x = tf.nn.sigmoid(x)
    
    # [h,w,c]*[1,1,c]==>[h,w,c]
    outputs = layers.multiply([inputs, x])  # 归一化权重和原输入特征图逐通道相乘
    
    return outputs  


# 测试SE注意力机制
if __name__ == '__main__':
    
    # 构建输入
    inputs = Input([56,56,24])
    
    x = se_block(inputs)  # 接收SE返回值
    
    model = Model(inputs, x)  # 构建网络模型
    
    print(x.shape)  # (None, 56, 56, 24)
    model.summary()  # 输出SE模块的结构

View the structural framework of the SE module

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 56, 56, 24)] 0                                            
__________________________________________________________________________________________________
global_average_pooling2d (Globa (None, 24)           0           input_1[0][0]                    
__________________________________________________________________________________________________
reshape (Reshape)               (None, 1, 1, 24)     0           global_average_pooling2d[0][0]   
__________________________________________________________________________________________________
dense (Dense)                   (None, 1, 1, 6)      150         reshape[0][0]                    
__________________________________________________________________________________________________
tf.nn.relu (TFOpLambda)         (None, 1, 1, 6)      0           dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1, 1, 24)     168         tf.nn.relu[0][0]                 
__________________________________________________________________________________________________
tf.math.sigmoid (TFOpLambda)    (None, 1, 1, 24)     0           dense_1[0][0]                    
__________________________________________________________________________________________________
multiply (Multiply)             (None, 56, 56, 24)   0           input_1[0][0]                    
                                                                  tf.math.sigmoid[0][0]            
==================================================================================================
Total params: 318
Trainable params: 318
Non-trainable params: 0
__________________________________________________________________________________________________

3. ECANet attention mechanism

3.1 Method introduction

ECANet is an implementation of the channel attention mechanism, and ECANet can be regarded as an improved version of SENet.

The authors show that dimensionality reduction in SENet brings side effects to the channel attention mechanism , and that capturing dependencies between all channels is inefficient and unnecessary.

The ECA attention mechanism module uses a 1x1 convolutional layer directly after the global average pooling layer, removing the fully connected layer . This module avoids dimensionality reduction and effectively captures cross-channel interactions. And ECA only involves a few parameters to achieve good results.

ECANet completes the cross-channel information interaction through one-dimensional convolution layers.Conv1D . The size of the convolution kernel is adaptively changed through a function , so that layers with a larger number of channels can perform more cross-channel interactions. The adaptive function is: , where $k = \left | \frac{log_{2}(C)}{\gamma } +\frac{b}{\gamma } \right |$ $\gamma =2, b=1$

3.2 Implementation process

(1) The input feature map is subjected to global average pooling, and the feature map changes from a matrix of [h,w,c] to a vector of [1,1,c]

(2) Calculate the adaptive one-dimensional convolution kernel size kernel_size

(3) Use kernel_size in one-dimensional convolution to get the weight for each channel of the feature map

(4) Multiply the normalized weight and the original input feature map channel by channel to generate a weighted feature map

3.3 Code Implementation

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model, layers
import math

def eca_block(inputs, b=1, gama=2):
    
    # 输入特征图的通道数
    in_channel = inputs.shape[-1]
    
    # 根据公式计算自适应卷积核大小
    kernel_size = int(abs((math.log(in_channel, 2) + b) / gama))
    
    # 如果卷积核大小是偶数，就使用它
    if kernel_size % 2:
        kernel_size = kernel_size
    
    # 如果卷积核大小是奇数就变成偶数
    else:
        kernel_size = kernel_size + 1
    
    # [h,w,c]==>[None,c] 全局平均池化
    x = layers.GlobalAveragePooling2D()(inputs)
    
    # [None,c]==>[c,1]
    x = layers.Reshape(target_shape=(in_channel, 1))(x)
    
    # [c,1]==>[c,1]
    x = layers.Conv1D(filters=1, kernel_size=kernel_size, padding='same', use_bias=False)(x)
    
    # sigmoid激活
    x = tf.nn.sigmoid(x)
    
    # [c,1]==>[1,1,c]
    x = layers.Reshape((1,1,in_channel))(x)
    
    # 结果和输入相乘
    outputs = layers.multiply([inputs, x])
    
    return outputs


# 验证ECA注意力机制
if __name__ == '__main__':
    
    # 构造输入层
    inputs = keras.Input(shape=[26,26,512])
    
    x = eca_block(inputs)  # 接收ECA输出结果
    
    model = Model(inputs, x)  # 构造模型
    model.summary()  # 查看网络架构

Looking at the ECA module, the amount of parameters is greatly reduced compared to SENet, and the amount of parameters is equal to the size of the kernel_size of the one-dimensional convolution

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            [(None, 26, 26, 512) 0                                            
__________________________________________________________________________________________________
global_average_pooling2d_1 (Glo (None, 512)          0           input_2[0][0]                    
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 512, 1)       0           global_average_pooling2d_1[0][0] 
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 512, 1)       5           reshape_1[0][0]                  
__________________________________________________________________________________________________
tf.math.sigmoid_1 (TFOpLambda)  (None, 512, 1)       0           conv1d[0][0]                     
__________________________________________________________________________________________________
reshape_2 (Reshape)             (None, 1, 1, 512)    0           tf.math.sigmoid_1[0][0]          
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 26, 26, 512)  0           input_2[0][0]                    
                                                                  reshape_2[0][0]                  
==================================================================================================
Total params: 5
Trainable params: 5
Non-trainable params: 0
__________________________________________________________________________________________________

[Deep learning] (8) Channel attention mechanism (SEnet, ECAnet) in CNN, with complete code of Tensorflow