Attention 和self-attention

1.Attention

The first comes from a paper Bengio team: N EURAL M ACHINE T RANSLATION BY J OINTLY L EARNING the TO A LIGN the AND T RANSLATE, ICLR published in 2015.

encoder-decoder model is common practice to encode an input sentence into a fixed-size state, then such a state is input to the decoder in each time, this approach would be detrimental to handle long sentences, especially with increase the length of the sentence, the effect of rapid decline.

Paper Motivation: is not good for long sentences translated encoder-decoder performance issues.

Solving principle: imitation of the human brain structure, a picture or a sentence, to be able to focus on different parts.

Paper Solutions: when generating the current word, as long as a state and as a fusion of all of the input word, then do a weight calculation. Generated in this manner will be targeted words, the effect is particularly noticeable when the length of the longer sentences.

FIG overall framework as follows:

 

2.Self-attention

Google comes from the team's paper: the Attention by You Need Is is All, published in 2017 NIPS.

Paper motivation: structure RNN itself, hindering the parallelization; RNN while long-distance dependency problems, the effect will be poor.

Solutions: a matrix vector multiplication by different words, to obtain a degree of similarity between the word and the word, and thus no distance limitations.

the whole frame:

 

 

 

multi-head attention:

The word cut into a vector h dimensions, each dimension h similarity calculation request attention. Since the word in the high-dimensional space as mapped in vector form, each spatial dimension can learn different characteristics, the findings in the adjacent space is more similar, as compared to the corresponding all put together more reasonable space. For example, the word vectors vector-size = 512, whichever h = 8, a space in every 64 Attention, learning results more refined.

self-attention:

  Each word bit word can ignore the direction and distance, have the opportunity to directly and sentence each word encoding. For example, the figure below this sentence, between each word and sentence with other words have an edge as a contact, the deeper the color side indicate the linkage is stronger, and vague general sense of the word even deeper than the edges. For example: law, application, missing, opinion.

 

 

Guess you like

Origin www.cnblogs.com/AntonioSu/p/12019534.html