Attention Mechanism


 When we humans are looking at something, we tend to focus our attention on one place rather than all the information. For example, when we see the picture of the cat below, we will mainly focus on the cat's face and the cat's torso, while the grass behind will be ignored as the background, which means that we are in every The attention distribution at the spatial location is different.
 
In this way, humans will invest more attention resources in the target area that needs to focus on to obtain more detailed information, while suppressing other area information, so that humans can use limited attention resources from a large amount of information. It can quickly obtain high-value information in the process, which greatly improves the efficiency of brain processing of information.

So can this "attention mechanism" of humans be used in AI?
Let's take a look at the effect of introducing the "attention mechanism" in the image description (Image Caption). "Picture description" is a typical application of deep learning, that is, input a picture, and the AI ​​system outputs a description text according to the content of the picture. Let's take a look at the effect of "picture description". The left side is the input original image, the bottom sentence is the description text automatically generated by the AI ​​system, and the right side is when the AI ​​system generates the underlined word, which corresponds to the focused position area in the picture, as follows Figure:
 
It can be seen that when outputting words such as frisbee (flying saucer) and dog (dog), the AI ​​system will allocate more attention to the corresponding positions of the flying saucer and dog in the picture to obtain a more accurate output, right? It's amazing, how does this happen?

1. What is the "attention mechanism?"
The Attention Mechanism in deep learning is similar to the attention mechanism of human vision, that is, it focuses on important points among many information, and selects key information. And ignore other unimportant information.

2. Encoder-Decoder framework (encoding-decoding framework)
At present, most attention models are attached to the Encoder-Decoder framework, so let's first understand this framework. The Encoder-Decoder framework can be regarded as a research model in the field of text processing. The abstract representation of the framework is as follows:
 
Given an input X, the target Y is generated by the Encoder-Decoder framework. Among them, the Encoder (encoder) encodes the input X and converts it into an intermediate semantic representation C through nonlinear transformation; the Decoder (decoder) generates target information according to the semantic representation C of the input X and the previously generated historical information.
The Encoder-Decoder framework is a general framework. There are many scenarios. It is often used in various fields such as text processing, image processing, and speech recognition. Encoder and Decoder can use various model combinations, such as CNN/RNN/BiRNN/LSTM and so on. For example, for automatic question answering, X is a question, Y is the answer; for machine translation, X is a language, Y is another language; for automatic summarization, X is an article, Y is an abstract; for picture description, X is a picture, Y is the text description of the picture...

3. Attention model
The human visual attention mechanism mentioned at the beginning of this article has a different distribution of attention when processing information. The Encoder-Decoder framework encodes the input X into a semantic representation C, which will cause all input processing weights to be the same, which does not reflect the concentration of attention. Therefore, it can also be regarded as a "distraction model".
In order to reflect the attention mechanism, the semantic representation C is extended, and different Cs are used to represent the concentration of different attention, and the weight of each C is different. Then the expanded Encoder-Decoder framework becomes:

The following is an example of an English translation of Chinese to illustrate the "attention model".
For example, the input English sentence is: Tom chase Jerry, and the translation result of the target is "Tom chase Jerry". Then in language translation, the three words Tom, chase, and Jerry have different degrees of influence on the translation result. Among them, Tom and Jerry are the subject and object, they are two names, and chase is the predicate and action, then these three The order of the degree of influence of each word is Jerry>Tom>chase, for example (Tom,0.3) (Chase,0.2) (Jerry,0.5), different degrees of influence represent the amount of attention that the AI ​​model allocates to different words during translation , that is, the size of the assigned probability.

Using the above image to extend Ci's Encoder-Decoder framework, the process of translating Tom chase Jerry is as follows.
The process of generating the target sentence word is in the following form:
 
where f1 is the nonlinear transformation function of the Decoder (decoding)
Each Ci corresponds to the attention distribution probability distribution of different source words, which is calculated in the following form:
 
Among them, the f2 function Indicates the conversion function of the input English word in the Encoder (coding) node, and the g function represents the conversion function of the Encoder (coding) to synthesize the intermediate semantic representation of the entire sentence. Generally, the weighted summation method is used, as follows:
 
where aij represents the weight, hj Represents the conversion function of the Encoder, that is, h1=f2("Tom"), h2=f2("Chase"), h3=f2("Jerry"), Tx represents the length of the input sentence
When i is "Tom", then pay attention to The force model weights aij are 0.6, 0.2, 0.2 respectively. So how is this weight obtained?
aij can be regarded as a probability, reflecting the importance of hj to ci, and can be represented by softmax:
 
where
 
f here represents a scoring function of matching degree, which can be a simple similarity calculation or a complex The results of the neural network calculation. Here, since h'i is not yet available when ci is calculated, the closest h'i-1 is used instead. When the matching degree is higher, the probability of aij is higher.
Therefore, the process of obtaining aij is as follows:
 
Among them, hi represents the Encoder conversion function, and F(hj, Hi) represents the matching scoring function between the prediction and the target

Stringing the above processes together, the structure of the attention model is shown in the following figure:
 
Among them, hi represents the conversion function of the Encoder stage, ci represents the semantic encoding, and h'i represents the conversion function of the Decoder stage.

The above is the classic Soft-Attention model, and the attention model has many other categories according to different dimensions.

4. Classification of attention models
According to the differentiability of attention, it can be divided into:

  • Hard-Attention, is the 0/1 problem, a certain area is either paid attention or not paid attention, this is a non-differentiable attention;
  • Soft-Attention, a continuous distribution problem between [0, 1], uses different scores from 0 to 1 to indicate the level of attention of each area, which is a differentiable attention.

According to the attention area of ​​attention, it can be divided into:

  • spatial domain
  • channel domain
  • layer domain
  • mixed domain
  • time domain

 

Welcome to follow my WeChat public account "Big Data and Artificial Intelligence Lab" (BigdataAILab) for more information

 

Recommended related reading

1. AI combat series

2. Dahua Deep Learning Series

3. Graphical AI series

4. AI talk

5. Big data super detailed series

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324106353&siteId=291194637