Transformer, long since Mechanisms of attention note: Attention is all you need

Original paper

Papers central idea: we propose a model using only the attention mechanism model does not incorporate any RNN or CNN, encoder by - decoder is a powerful model to achieve an efficient model.

 

Introduction and Background

After birth attentional mechanisms, various improvements born models, these models are generally circular focus mechanism and neural network (including a number of improvements, such as LSTM), these models have drawbacks, is parallel computing not strong, in order to solve these problems, the paper presents a model based on attention mechanism only has a powerful parallel computing as well as good results.

Model structure

The overall structure model encoder-decoder, in the encoder, the symbols represent the input sequence is (x_1,x_2,...,x_n)mapped to a continuous representation z = (z_1,...,z_n), after obtaining a Z, the decoder generates an output symbol sequence (Y_1, ..., y_n), one at a time to generate.

Model structure diagram:

Codec:

Encoder: encoder consists of 6 layers of the same layers, each layer has two sublayers. The first layer is a long self-focus mechanism, the other being a simple front fully connected feedforward network. After standardization layer, which layers each have adopted a residual connection, that the output of each sub-layer LayerNorm(x+Sulayer(x)), which Sublayer(x)is a function of the sub-layer itself is achieved. To optimize cleavage of these residuals, all sublayer and the intercalation layer model 512 generates an output dimensions.

Decoder: Decoder layer 6 is composed of the same layer, out of two sub-layers in each encoder, and the third is the output from a long back focus layer encoder. Similar to the encoder, each sub-layer after normalization are surrounded with a residual connection. In order to ensure that there is a cover up layer sequence information.

attention

Attention is actually a function to a query, a key-value collection (about the query, key, value can be seen in this paper: Key-Value Memory Networks for Directly Reading Documents) is mapped into a single output, which query, key, value , output is a vector. In fact, the output is a weighted sum of values, wherein the weights assigned to each value of a correlation function calculated by the weight of the current query relevance to key.

Zoom dot product attention

Inputs include: querie and are key d_kdimension, value for the d_vdimension,

After calculating the dot product of the query and all key, dividing \sqrt{d_k}(in order to prevent the disappearance of gradient), to give the United States and a weight value of the weight with a softmax function.

In fact, when calculated by a matrix of parallel computing.

Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

The reason for using the dots because increasing the speed through some efficient optimization technique.

Long attention

In fact, the more attention since linked up

MultiHead(Q,K,V)=Concat(head_1,...,head_h)W^o

head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)

At the same time, be reduced by reducing the overall dimensions even if consumed.

Fully connected feedforward network

Transformer fully connected networks, all are the same, the linear conversion and two intermediate RELU a function of composition.

FFN(x)=max(0,xW_1+b)W_2+b_2

Position encoder

Because the model does not use a convolutional or cyclic structure, in order to use the sequence information sequence, it is necessary to increase the relative or absolute position information. He joined the lower position encoding in the encoder and decoder for this paper.

PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})

PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})

Wherein, pos is a position, i is the dimension

 

 

 

 

 

Published 54 original articles · won praise 36 · views 40000 +

Guess you like

Origin blog.csdn.net/aaalswaaa1/article/details/103942346