Attention is All You Need (Introduction to Transformer)

1 Introduction

The BERT algorithm for generating word vectors proposed by the Google team in 2018 has achieved excellent results in 11 NLP tasks, which can be called the most exciting news in the field of deep learning in 2018. The BERT algorithm is based on the Transformer, and the Transformer is based on the attention mechanism.

2 Encoder-Decoder

2.1 Introduction to Encoder-Decoder

At present, most attention models are implemented by attaching to the Encoder-Decoder framework. In NLP, the Encoder-Decoder framework is mainly used to deal with sequence-sequence problems.

  • Text summary, input an article (sequence data), generate a summary of the article (sequence data)
  • Text translation, input a sentence or an English (sequence data) to generate translated Chinese (sequence data)
  • Question answering system, input a question (sequence data), generate an answer (sequence data)

2.2 Encoder-Decoder structure principle

insert image description here
Encoder: Encoder, which encodes the input sequence <x1, x2, x3…xn> to convert it into a semantic code C, which stores the information of the sequence <x1, x2, x3…xn>, Encoder Coding method, RNN/LSTM/GRU/BiRNN/BiLSTM/BiGRU

Decoder: The decoder encodes C according to the input semantics, and then decodes it into sequence data. The decoding method can also use RNN/LSTM/GRU/BiRNN/BiLSTM/BiGRU.

3 Attention Model

3.1 Introduction to Attention attention mechanism

In the same picture, different people may observe and notice different places. This is the human attention mechanism. Attention is designed to imitate human attention mechanism.

3.2 Attention principle

insert image description here
The above picture is the Encoder-Decoder framework that introduces the Attention mechanism. We can see at a glance that there is no longer a single semantic code C in the above picture, but multiple codes such as C1, C2, and C3. When we are predicting Y1, maybe Y1's attention is on C1, then we use C1 as the semantic code, when predicting Y2, Y2's attention is focused on C2, then we use C2 as the semantic code, By analogy, the human attention mechanism is simulated.

The fixed intermediate semantic representation C is replaced by a Ci that is adjusted to add changes to the attention model according to the current output word.

The g function is a weighted summation. αi represents the weight distribution.
C i = ∑ j = 1 n α ijhj C_{i}=\sum_{j=1}^{n} \alpha_{ij} h_{j}Ci=j=1naijhj

3.3 The essential idea of ​​the Attention mechanism

insert image description here
Referring to the above figure, we can understand Attention in this way. Think of the constituent elements in Source as consisting of a series of <Key, Value> data pairs (corresponding to our example above, key and value are equal, both of which are encoders. Output value h), at this time, given an element Query in Target (corresponding to hi in the decoder in the above example), by calculating the similarity or correlation between Query and each Key, the corresponding Value of each Key is obtained The weight coefficient, and then the Value is weighted and summed to obtain the final Attention value. So in essence, the Attention mechanism is to weight and sum the Value values ​​​​of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value. That is, its essential idea can be rewritten as the following formula:

Attention ⁡ (  Query, Source  ) = ∑ i = 1 L x Similarity ⁡ (  Query  ,  Key  i ) ∗  Value  i \operatorname{Attention}(\text { Query, Source })=\sum_{i=1}^{L_{x}} \operatorname{Similarity}\left(\text { Query }, \text { Key }_{i}\right) * \text { Value }_{i} Attention( Query, Source )=i=1LxSimilarity( Query , Key i) Value i

Lx represents the length of the source, and Similarity (Q, Ki) is calculated as follows:
insert image description here
Dot product: It is the dot product of Query and Keyi, which is the method used in Transformer.
Cosine similarity-cosine similarity:
the numerator is the dot product of two vectors Query and Key,
the denominator is the L2 norm of the two vectors, (L2 norm: point to the sum of the squares of each element of the vector and then find the square root)

After calculating the similarity between Query and Keyi, the second stage introduces a calculation method similar to SoftMax to convert the similarity score in the first stage. The probability distribution with a sum of 1; on the other hand, the weight of important elements can be more prominent through the internal mechanism of SoftMax. That is, the following formula is generally used for calculation:
insert image description here
the calculation result ai of the second stage is the weight coefficient corresponding to Valuei, and then the weighted summation can be used to obtain the Attention value:
insert image description here
insert image description here
Stage 1: Query and each Key calculate the similarity to obtain a similarity score s
stage 2: Softmax the s score into a probability distribution between [0,1]
Stage 3: Use [a1, a2, a3...an] as the weight matrix to weight and sum the Value to get the final Attention value

3.4 Attention advantages and disadvantages

advantage:

  • high speed. The Attention mechanism no longer depends on RNN, which solves the problem that RNN cannot be calculated in parallel.
  • The effect is good. The good effect is mainly due to the attention mechanism, which can obtain local important information and grasp the key points.

shortcoming:

  • Parallel computing can only be realized in the Decoder stage. The Encoder part still uses RNN and LSTM models that are encoded in sequence. The Encoder part still cannot realize parallel computing, which is not perfect.
  • It is because the Encoder part still relies on RNN, so there is no way to obtain the relationship between two words between medium and long distances.

4 Self-Attention

In the Encoder-Decoder framework of general tasks, the content of the input Source and the output Target are different. For example, for English-Chinese machine translation, the Source is an English sentence, and the Target is the corresponding translated Chinese sentence. The Attention mechanism occurs in Between elements of Target and all elements in Source. As the name suggests, Self Attention refers not to the Attention mechanism between Target and Source, but the Attention mechanism that occurs between the internal elements of Source or between the internal elements of Target. It can also be understood as the attention in the special case of Target=Source computing mechanism. The specific calculation process is the same, but the calculation object has changed, which is equivalent to Query=Key=Value, and the calculation process is the same as attention.

insert image description here
After introducing Self Attention in this way, it will be easier to capture the long-distance interdependent features in the sentence, because if it is RNN or LSTM, it needs to be calculated sequentially. For the long-distance interdependent features, it will take several time steps to accumulate information. Only then can the two be connected, and the farther the distance, the less likely it is to be effectively captured. However, during the calculation process, Self Attention will directly connect the connection between any two words in the sentence through a calculation step, so the distance between long-distance dependent features is greatly shortened, which is conducive to the effective use of these features. In addition, Self Attention also directly helps to increase the parallelism of calculations. It just makes up for the two shortcomings of the attention mechanism, which is the main reason why Self Attention is gradually being widely used.

5 Reference

https://blog.csdn.net/Tink1995/article/details/105012972

Guess you like

Origin blog.csdn.net/qq_54372122/article/details/131498114