Transformer —— attention is all you need

https://www.cnblogs.com/rucwxb/p/10277217.html

Transformer —— attention is all you need

Transformer model was proposed in May 2018, can replace a traditional RNN and CNN's new architecture for implementing machine translation, the name of the paper is the attention is all you need. Whether or RNN CNN, in dealing with NLP tasks have drawbacks. CNN is its inherent convolution of the sequence is not very suitable for text, which is not parallelized RNN easily exceed memory restrictions (such as the length of the sentence 50tokens will occupy a large memory).

Below left transformer is a structural model, the encoder is divided into the left and right frame Nx Nx of the frame Decoder, compared to the RNN + attention (on top of an orange box) Attention between the common encoder-decoder, the encoder and more than inside the decoder self-attention (lower two orange boxes). Each feature has attention-head multi. Finally, the position encoding never considered adding location information. From under multi-head attention, self-attention, position encoding describes several angles.

multi-head attention:   
the word into a vector cut dimensions h, seeking each time dimension h Attention similarity calculation. Since the word in the high-dimensional space as mapped in vector form, each spatial dimension can learn different characteristics, the findings in the adjacent space is more similar, as compared to the corresponding all put together more reasonable space. For example, the word vectors vector-size = 512, whichever h = 8, a space in every 64 Attention, learning results more refined.

self-attention:   
the word each word bits can ignore the direction and distance, have the opportunity to directly and sentence each word encoding. For example, right above this sentence, between each word and sentence with other words we have an edge as a contact, the deeper the color side indicate the linkage is stronger, and vague general sense of the word even deeper than the edges. For example: law, application, missing, opinion . . .

position encoding:   
Because the transformer has neither CNN nor recurrence RNN of the convolution, but the sequence order information is important, for example, you owe me one million tomorrow and also meaning you owe me one million tomorrow also very different. . . calculating position information token transformer used herein a sine wave ↓, similar to the analog signal periodically changing propagation. This cycle function can increase the generalization ability of the model to some extent.

However, a BERT direct training position to retain embedding position information, a random initialization vector for each location, added to the model training, embedding a finally obtained contains location information (simple and crude ..), this last position and embedding the embedding Word binding mode on, BERT choose direct splicing.

Guess you like

Origin www.cnblogs.com/Ann21/p/11457185.html