Summary of attention in computer vision

Introduction: The
earliest application of attention mechanism (attention) should be in machine translation. In recent years, it has become popular in computer vision (CV) tasks. The main purpose of the attention mechanism in CV is to allow the neural network to focus on learning interesting places.

Foreword:

Attention, there are two, one is a soft focus (soft attention), the other is strong attention (hard attention).
1. Soft attention pays more attention to areas or channels, and soft attention is deterministic attention, which can be generated directly through the network after learning is completed. The most important point is that soft attention is differentiable, which is a very important local. The attention that can be differentiated can be calculated through the neural network to calculate the gradient and forward propagation and backward feedback to learn the weight of attention.
2. Strong attention means more attention, that is, every point in the image may extend attention. At the same time, strong attention is a random prediction process that emphasizes dynamic changes. Of course, the most important thing is that strong attention is a non-differentiable attention, and the training process is often done through reinforcement learning (RL).

Attention in spatial domain:

Corresponding spatial transformation is performed on the spatial domain information in the picture, so that the key information can be extracted. Mask the space and score it, representing the Spatial Attention Module.
1. The following is the STN network proposed by Google DeepMind:
Insert picture description here
Localization Net here is used to generate affine transformation coefficients, the input is a C×H×W dimensional image, and the output is a spatial transformation coefficient, and its size depends on the type of transformation to be learned However, if it is an affine transformation, it is a 6-dimensional vector. That is, locate the target position, and then perform operations such as rotation, making the input sample easier to learn.
2. Dynamic Capacity Networks uses two sub-networks, namely a low-performance sub-network (coarse model) and a high-performance sub-network (fine model).
Insert picture description here
The low-performance sub-network is used to process the whole image and locate the region of interest, as shown in the operation fc in the following figure. The high-performance sub-network refines the region of interest, as shown in the operation ff below. The two are used together to obtain lower calculation cost and higher accuracy.

Channel domain Attention——SENet:

It is similar to adding a weight to the signal on each channel to represent the correlation between the channel and the key information. The larger the weight, the higher the correlation. Generate a mask for the channel and score it. The representative is SEnet, Channel Attention Module.
1. SEnet is the 2017 world champion. The full name of SE is Squeeze-and-Excitation. It is a module. If the existing network is embedded in the SE module, then the network is SENet, which can be embedded in almost any popular network.
Insert picture description here
The above picture is the Block unit of SEnet. Ftr in the picture is the traditional convolution structure. X and U are the input (C'×H'×W') and output (C×H×W) of Ftr. These are the past The structure already exists. The added part of SENet is the structure after U: First, do a Global Average Pooling for U (Fsq(.) in the figure, the author calls it the Squeeze process), and the output 1×1×C data will go through two levels of full connection (Figure In Fex(.), the author called the Excitation process), and finally use sigmoid (self-gating mechanism in the paper) to limit to the range of [0, 1], and multiply this value as scale to C channels of U, As the input data of the next level. The principle of this structure is to increase the important features and weaken the unimportant features by controlling the size of the scale, thereby making the extracted features more directional.
In layman's terms: by processing the convoluted feature map, a one-dimensional vector equal to the number of channels is obtained as the importance score of each channel, and then the importance score of each channel is multiplied by each channel The original value is the real feature map we seek. Different channels in this feature map have different importance.
An example of embedding Inception structure and Resnet structure is given below:
Insert picture description here
pytorch code implementation:

class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )
 
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

Guess you like

Origin blog.csdn.net/qq_42823043/article/details/107785329