[Study Notes] Classification and Summary of Attention Mechanism

Classification and summary of attention mechanism


The attention mechanism was first used for machine translation and appeared in the paper "Neural Machine Translation by Jointly Learning to Align and Translate".

Motivation

In the traditional Seq2Seq model, the encoder encodes the source sequence information into a fixed-length vector, no matter how long the sequence is. However, when the length of the sequence is long, the fixed-length vector is not enough to summarize all the important information , which will inevitably affect the decoding and generation effect.

By introducing the attention mechanism, the Decoder can access the hidden state of each token in the Encoder every time it calculates the output, and calculate the attention weight of each hidden state, and then use the context vector obtained by weighting and summing the two as Decoder input. Therefore, the Decoder needs to recalculate a context vector every time step. The advantage is that the decoder can focus on using the information in the source sentence when calculating the output, and focus on the information that is most conducive to the current correct output. .

Attention weight α t , i \alpha_{t,i}at,iIt is used to measure the alignment between the i-th token in the source sentence and the t-th output of the Decoder .

Bahdanau Attention vs. Luong Attention

The main difference between the two is that the query vector used by the attention scoring function is different . The query vector used by Bahdanau in the scoring function is Decoder's previous state ht − 1 h_{t-1}ht1, and the query vector used by Luong in the scoring function is Decoder's current moment state ht h_tht. Both K and V are the hidden states of the Encoder. In addition, in the original text, Luong also proposed three scoring functions: dot product, bilinear, and additive.

Soft Attention vs. Hard Attention

Our commonly used Attention is Soft Attention, each weight ranges from [0,1], and the weights are learned based on the relationship between features.

For Hard Attention, the attention of each key will only take 0 or 1, that is to say, we will only pay attention to a few specific keys, and the weights are all 1, which is used in computer vision tasks for comparison many. Like local-attention, it only selects part of the source information, but hard-attention selects part of the source information in a non-differentiable way, so it cannot follow the model to be optimized. Hard Attention mainly performs random cropping in the local feature area.

Global Attention vs. Local Attention vs. Hierarchical Attention

Global Attention

Generally, unless otherwise specified, the Attention we use is Global Attention. According to the original Attention mechanism, each decoding moment does not limit the number of decoding states, but can dynamically adapt the encoder length to match all encoder states. In fact, all hidden states are considered, that is, all Encoder hidden states are weighted as Keys.

Local Attention

In long text, we align and match the entire encoder length, which may cause inattention problems, so we make the attention mechanism more effective by limiting the scope of the attention mechanism . In LocalAttention, ht h_t for each decoderhtcorresponds to an encoder position pt p_tpt, the selected interval size D (generally selected based on experience), and then in the [ pt − D , pt + D ] [p_t-D,p_t+D] of the encoder[ptD,pt+D ] The position uses the Attention mechanism, according to the selectedpt p_tptDifferent, Local Attention can be divided into Local-m and Local-p. Local-m: simply set pt p_tptfor ht h_thtCorresponding position: pt = t p_t=tpt=t ; Local-p: useht h_thtpredict pt p_tpt, and then use the Gaussian distribution to make the weight of Local Attention pt p_tptexhibit peak shape.

Hierarchical Attention

Hierarchical Attention can also be used to solve the problem of inattention for long texts (such as documents) . Unlike Local Attention, Local Attention forcibly limits the scope of the attention mechanism and ignores the remaining positions;

Hierarchical Attention uses hierarchical thinking and uses attention mechanisms in all states . The attention network network can be seen as two parts, the first part is the "word attention" part, and the other part is the "sentence attention" part. The entire network divides a sentence into several parts. For each part, a bidirectional RNN combined with an attention mechanism is used to map a small sentence into a vector. Then, for a set of sequence vectors obtained by mapping, we use a layer of bidirectional RNN combined with attention. The force mechanism realizes the classification of text.

Intra Attention(Self Attention) vs. Inter Attention

Intra Attention is also called Self-Attention. Inter Attention is Encoder-Decoder Attention.

Self Attention is the Attention in Transformer. It connects different positions of a sentence to calculate the feature expression of the sentence, which is equivalent to encoding itself into a feature vector.

attention scoring model

Location (only align the target position), Cosine (calculate the cosine similarity of QK), additive model, dot product, zoom dot product, bilinear model.

insert image description here
Mainly these four types insert image description here
where W, U and v are learnable network parameters, and d is the dimension of input information. q is the query vector, xi x_ixiis the key vector.

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/129191780