Deep learning basic learning-attention mechanism (in computer vision)

I saw a lot of explanations about the attention mechanism on the Internet, and I will summarize it myself below. Detour by the big guy
Below are the links to a few articlesAdd
attention model in deep learningAttention mechanism in
computer visionAttention mechanism in
image processingAttention summary attention mechanismDetailed attention mechanismSummary of
spatial attention mechanism and channel attention mechanism Detailed (very comprehensive) overview - attention mechanism in image processing


The attention mechanism is a resource allocation scheme that allocates computing resources to more important tasks and solves the problem of information overload in the case of limited computing power. Generally, in neural networks, the more model parameters are trained, the stored The greater the amount of information, it will bring about the problem of information overload. By introducing an attention mechanism, we can focus more "attention" on key information, reduce attention to other information, or filter out irrelevant information , so that the overload problem mentioned above can be solved, and the efficiency and accuracy of task processing can also be improved.
The essence of the attention mechanism is a set of weight coefficients independently learned by the network, and a "dynamic weighting" method to emphasize the region of our interest while suppressing the irrelevant background region. In the past few years, there have been attention mechanism applications in image processing, speech recognition or natural language processing. This article focuses on the application of attention mechanisms to images.

1. Popular understanding of attention mechanism

The naming method of the attention mechanism actually uses the naming method of human attention.

1.1 Image example 1

insert image description here
As shown in the figure, the red part in the figure is the part that people habitually pay more attention to, such as the face of a person, the title of an article, and the beginning of a paragraph.
Human vision quickly scans the global image to obtain the target area that needs to be focused on, which is generally referred to as the focus of attention, and then invests more attention resources in this area to obtain more detailed information about the target that needs to be focused on. And suppress other useless information.
The attention mechanism in deep learning is essentially similar to the selective visual attention mechanism of human beings. The core goal is to select information that is more critical to the current task goal from a large number of information.

1.2 Image example 2

insert image description here
Attention (attention) is actually a very common, but often overlooked fact. For example, when a bird flies past in the sky, your attention will often follow the bird, and the sky naturally becomes a background (background) information in your visual system.
insert image description here
For the neural network, the features extracted by the neural network have no difference to the network itself, it will not pay special attention to a specific feature, just like in this picture you don’t tell it that you want to pay attention to the bird Well, then the information of the whole picture is actually that the sky accounts for a larger proportion, so it will think that this is a photo about the sky, not a bird.

1.3 Attention mechanism in computer vision

The basic idea of ​​the attention mechanism in computer vision is to let the system learn attention-to be able to ignore irrelevant information and focus on key information.
For example, in our daily life, we sit in a coffee shop and play with our mobile phones. If we focus on our mobile phones, we basically don’t know what the outside world is talking about. Start focusing on the person's voice and you'll be able to hear the conversation clearly.

The same is true for vision. It is almost difficult to notice some information when you glance over it, but if you focus on the past, the details of things will form an impression in your mind.

The Attention Mechanism in the neural network is a resource allocation scheme that allocates computing resources to more important tasks and solves the problem of information overload in the case of limited computing power. In neural network learning, generally speaking, the more parameters of the model, the stronger the expressive ability of the model, and the greater the amount of information stored in the model, but this will bring about the problem of information overload. Then by introducing the attention mechanism, focusing on the information that is more critical to the current task among the numerous input information, reducing the attention to other information, and even filtering out irrelevant information, the problem of information overload can be solved and the efficiency of task processing can be improved. efficiency and accuracy.

This is similar to the human visual attention mechanism. By scanning the global image, the target area that needs to be focused on is obtained, and then more attention resources are invested in this area to obtain more detailed information related to the target, while ignoring others. irrelevant information. Through this mechanism, limited attention resources can be used to quickly screen out high-value information from a large amount of information.

With the development of deep learning today, it is more important to build a neural network with an attention mechanism. On the one hand, this kind of neural network can learn the attention mechanism independently, and on the other hand, the attention mechanism can help us in turn. Understanding the world as neural networks see it

2. Classification of attention mechanism

Here is a brief introduction to the basic classification of the attention mechanism.
In recent years, most of the research work on the combination of deep learning and visual attention mechanism has focused on using masks to form attention mechanisms. The principle of the mask is to identify the key features in the image data through another layer of new weights. Through learning and training, the deep neural network can learn the areas that need to be paid attention to in each new image, which forms the attention. .

This kind of thinking has evolved into two different types of attention, one is soft attention (Soft-attention), and the other is hard-attention (Hard-attention).
insert image description here
If you classify the domains of attention, Starting from different dimensions (such as channel, space, time, category, etc.), it can be divided into the following types
insert image description here

Among them,
the attention domain of soft attention : spatial domain, channel domain, mixed domain, self attention (self attention) the
attention domain realized by hard attention : time domain (time domain)

Specifically speaking, each of the following categories is actually a big piece. Here I just give a basic understanding from the shallowest level.

3. Hard attention and its corresponding attention domain

To put it bluntly, hard attention is a 0/1 problem, which areas are attentioned and which areas are not concerned. The application of hard attention in images has been known for many years: image cropping (image cropping) hard attention (strong attention
) The difference from soft attention is that first of all, strong attention is more focused, that is, every point in the image may extend attention, and strong attention is a random prediction process, which emphasizes dynamic changes. Of course, the most important thing is that strong attention is a non-differentiable attention, and the training process is often done through reinforcement learning.

To put it simply, the hard attention mechanism is non-differentiable, and it is usually implemented by reinforcement learning. Through the incentive of the income function, the model can pay more attention to the details of certain parts.

3.1 Time attention (time attention)

This concept is actually relatively large, because if computer vision only recognizes a single picture, there is no concept of time domain, but in some articles Recurrent Attention Model, an attention mechanism based on Recurrent Neural Network (RNN) is proposed Identify the model.

The scene where the RNN model is more suitable is that the data has temporal characteristics. For example, using the RNN to generate the attention mechanism is better in natural language processing. Because natural language processing is text analysis, there is actually a temporal correlation behind the generation of text. For example, a word will be followed by another word, which is a temporal dependency correlation.

The picture data itself does not have natural timing characteristics, and a picture is often a sample at a time point. But in video data, RNN is a better data model, so that RNN can be used to generate recognition attention.

The RNN model is deliberately called the attention of the time domain, because this model adds a new dimension of time to the space domain, channel domain, and mixed domain introduced earlier. The generation of this dimension is actually based on the timing characteristics of the sampling points.

In the Recurrent Attention Model, the attention mechanism is regarded as a sampling of an area point on a picture, and this sampling point is the point that needs attention. And the attention in this model is no longer a differentiable attention information, so this is also a hard attention model. The training of this model needs to be trained using reinforcement learning, and the training time is longer.

4. Soft attention and its corresponding attention domain

The soft attention mechanism is simply a continuous distribution problem between [0,1]. The degree of attention to each area is represented by a score of 0~1.
The key point of soft attention is that this kind of attention pays more attention to areas or channels, and soft attention is deterministic attention. After learning, it can be generated directly through the network. The most critical point is that soft attention is differentiable , which is a very important place. Differentiable attention can calculate the gradient through the neural network and learn the weight of attention through forward propagation and backward feedback.

In simple terms, the soft attention mechanism is implemented by gradient descent, which is differentiable and continuous. In neural networks, the weights of soft attention can be learned and adjusted through forward propagation and back propagation.

4.1 channel attention (channel attention)

Channel attention aims to show the correlation between different channels (feature maps), automatically obtain the importance of each feature channel through network learning, and finally assign different weight coefficients to each channel , so as to strengthen important features and suppress unimportant features.

The principle of the attention mechanism in the channel domain is very simple, and we can understand it from the perspective of basic signal transformation. In signal system analysis, any signal can actually be written as a linear combination of sine waves. After time-frequency transformation, the continuous sine wave signal in the time domain can be replaced by a frequency signal value.
insert image description here
In the convolutional neural network, each picture is initially represented by (R, G, B) three channels, and after passing through different convolution kernels, each channel will generate new signals, such as each of the picture features Using a 64-core convolution for each channel will generate a matrix of 64 new channels (H, W, 64), where H and W represent the height and width of the image features, respectively.

The feature of each channel actually represents the components of the picture on different convolution kernels, similar to time-frequency transformation, and the convolution of the convolution kernel is similar to the Fourier transform of the signal, so that this feature can be transformed The information of one channel is decomposed into signal components on 64 convolution kernels.

insert image description here
Since each signal can be decomposed into components on the kernel function, the new 64 channels must contribute more or less to the key information. If we add a weight to the signal on each channel to represent the In terms of the correlation between the channel and the key information, the greater the weight, the higher the correlation, that is, the channel we need to pay more attention to.

The representative work in this area is SE-Net , which adaptively adjusts the feature response between channels through feature recalibration. In addition, there is also the well-known SK-Net , which is inspired by Inception-block and SE-block. From the perspective of multi-scale feature representation, it learns feature maps at different scales by introducing multiple convolution kernel branches. Attention allows the network to focus more on important scale features. In addition, there is ECA-Net , which uses 1-dimensional sparse convolution operations to optimize the fully connected layer operations involved in the SE module to greatly reduce the amount of parameters and maintain comparable performance. In order to compress the amount of parameters and improve computational efficiency, SE-Net adopts the strategy of "dimension reduction first and then dimension increase", using two multi-layer perceptrons to learn the correlation between different channels, that is, each current feature The graphs interact with other feature maps, which is a dense connection. ECA-Net simplifies this connection method, so that the current channel only exchanges information with its k domain channels.
insert image description here

4.1.1 SEC

Among them, SENet (Sequeeze and Excitation Net) is the champion network of the 2017 ImageNet classification competition. It is essentially a channel-based Attention model. It models the importance of each feature channel, and then enhances or suppresses different channels for different tasks. , the schematic diagram is as follows.
insert image description here

First, the leftmost is the feature X of the original input image, and then after transformation, such as convolution transforming the image, a new feature signal U is generated. U has C channels, and we hope to learn the weight of each channel through the attention module, thereby generating the attention of the channel domain.

The middle module is the innovative part of SENet, which is the attention mechanism module. This attention mechanism is divided into three parts: squeeze, excitation, and attention.

  1. Squeeze (squeeze)
    insert image description here
    It is obvious that this function makes a global average, adding and averaging all the feature values ​​in each channel, which is also a mathematical expression of global average pooling.

  2. The excitation
    insert image description here
    delta function is ReLU, and σ is a sigmoid activation function. The dimensions of W1 and W2 are to
    picture
    picture
    learn these two weights through training, and obtain a one-dimensional excitation weight to activate each layer of channels.

  3. Scaling function
    insert image description here
    This step is actually a scaling process, the values ​​of different channels are multiplied by different weights, so as to enhance the attention to the key channel domain.

4.2 Spatial attention

Spatial attention aims to improve the feature expression of key regions. Essentially, the spatial information in the original image is transformed into another space and retains key information through the spatial transformation module, and a weight mask (mask) is generated for each position and The output is weighted such that specific object regions of interest are enhanced while irrelevant background regions are weakened.

Not all regions in the image contribute equally to the task. Only task-related regions need to be concerned, such as the main body of the classification task. The spatial attention model is to find the most important parts in the network for processing.

The more outstanding work in this area is CBAM, which is based on the original channel attention and connects a spatial attention module (SAM). SAM is based on channel-based global average pooling and global maximum pooling operations to generate two feature maps representing different information. After merging, a 7×7 convolution with a large receptive field is used for feature fusion, and finally through the Sigmoid operation. To generate a weight map superimposed back to the original input feature map, so that the target area can be enhanced. In general, for spatial attention, since the features in each channel are treated equally, the information interaction between channels is ignored; while channel attention is to directly process the information in a channel globally, It is easy to ignore the information interaction in the space. The author finally verified through experiments that the way of first channel and then space is better than the way of first space and then channel or parallel channel space. In addition, similar improved modules include the Double Attention module proposed by A2-Net and the variant attention module scSE inspired by SE-Net.
insert image description here

4.2.1 STN network

The STN network (Spatial Transformer Network) proposed by Google DeepMind , which completes the preprocessing operation suitable for the task by learning the deformation of the input, is a space-based Attention model. The network structure is as follows: Here, the Localization Net is used to generate
insert image description here
affine Transformation coefficient, the input is a C×H×W dimensional image, and the output is a spatial transformation coefficient. Its size depends on the type of transformation to be learned. If it is an affine transformation, it is a 6-dimensional vector.

The effect of such a network to be completed is as follows:
insert image description here

That is, locate the position of the target, and then perform operations such as rotation to make the input sample easier to learn. Here is a one-step tweaked solution

Compared with Spatial Transformer Networks, which completes target positioning and affine transformation adjustment in one step, Dynamic Capacity Networks uses two sub-networks, namely a low-performance sub-network (coarse model) and a high-performance sub-network (fine model).

The low-performance sub-network (coarse model) is used to process the whole image and locate the region of interest, as in the operation fc in the figure below.
The high-performance sub-network (fine model) refines the region of interest, as shown in the operation ff in the figure below.
Both can be used together to obtain lower computational cost and higher precision.
insert image description here

Since the area of ​​interest is only a small part of the image in most cases, the essence of spatial attention is to locate the target and perform some transformations or obtain weights.

For specific explanations, please refer to the attention mechanism in computer vision (Visual Attention)

Experimental results of the space transformer model
For example, this space transformer model experiment result diagram:

  • Column (a) is the original image information, in which the first handwritten number 7 has no transformation, the second handwritten number 5 has a certain rotation change, and the third handwritten number 6 has added some noise signals ;
  • (b) The colored border in the column is the bounding box of the learned spatial transformer
    . Each box is actually a spatial transformer learned from the corresponding picture;
  • Column (c) is
    the feature map converted by the spatial transformer. It can be seen that the key area of ​​7 is selected, 5 is rotated into a positive picture, and the noise information of 6 is not recognized.

Finally, these transformed feature maps can be used to predict the value of the handwritten digits in column (d).

The spatial transformer is actually the realization of the attention mechanism, because the trained spatial transformer can find out the areas that need to be paid attention to in the picture information, and at the same time, this transformer can also have the functions of rotation and zoom transformation, so that the important information of the local part of the picture can be transformed And it is extracted by the box.

4.3 Mixed attention (mixed attention)

After understanding the design ideas of the first two attention domains, make a simple comparison.
First of all, the attention in the spatial domain ignores the information in the channel domain, and treats the image features in each channel equally. This approach will limit the spatial domain transformation method to the original image feature extraction stage, and apply it to other neural network layers. Layers are not very interpretable.

The attention of the channel domain is to directly pool the information in a channel globally and ignore the local information in each channel. This approach is actually a relatively violent behavior. Therefore, combining the two ideas, we can design the attention mechanism model of the mixed domain.

The full name of CBAM is Convolutional Block Attention Module, which is one of the masterpieces of the attention mechanism published on ECCV2018. In this paper, the authors study attention in network architectures, not only to tell us where to focus, but also to improve the representation of attention. The goal is to increase expressiveness by using an attention mechanism, focusing on important features and suppressing unnecessary ones. To emphasize meaningful features in both spatial and channel dimensions, we sequentially apply channel and spatial attention modules to learn what and where to focus on in channel and spatial dimensions, respectively. Additionally, information flow within the network is also aided by knowing what information to emphasize or suppress. Improved performance while reducing the number of parameters.

The main network architecture is also very simple. One is the channel attention module and the other is the spatial attention module. CBAM integrates the channel attention module and the spatial attention module successively.
insert image description here

4.4 Self attention

The self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.

In the neural network, we know that the convolution layer obtains the output features through the linear combination of the convolution kernel and the original features. Since the convolution kernel is usually local, in order to increase the receptive field, the way of stacking convolution layers is often adopted. In fact, this This approach is not efficient. At the same time, many tasks in computer vision are due to insufficient semantic information, which affects the final performance. The self-attention mechanism obtains a larger receptive field and context information by capturing global information.

The self-attention mechanism (self-attention) has made great progress in the sequence model; on the other hand, context information (context information) is critical for many vision tasks, such as semantic segmentation and target detection. The self-attention mechanism provides an effective modeling method to capture global context information through the triplet of (key, query, value).

As an effective way to model context, the self-attention mechanism has achieved good results in many visual tasks. At the same time, the shortcomings of this modeling method are also obvious. One is that the information on the channel is not considered, and the other is that the computational complexity is still very large. The corresponding improvement strategy, on the one hand, is how to effectively combine spatial and channel information, and on the other hand, how to sparsely capture information. The advantage of sparseness is that it can be more robust while maintaining a smaller amount of calculation and video memory. Finally, graph convolution is a hot research direction in recent years. How to connect self-attention mechanism and graph convolution, and a deeper understanding of self-attention mechanism are very important directions in the future.

Guess you like

Origin blog.csdn.net/m0_47146037/article/details/126260922