Papers central idea: we propose a model using only the attention mechanism model does not incorporate any RNN or CNN, encoder by - decoder is a powerful model to achieve an efficient model.
Introduction and Background
After birth attentional mechanisms, various improvements born models, these models are generally circular focus mechanism and neural network (including a number of improvements, such as LSTM), these models have drawbacks, is parallel computing not strong, in order to solve these problems, the paper presents a model based on attention mechanism only has a powerful parallel computing as well as good results.
Model structure
The overall structure model encoder-decoder, in the encoder, the symbols represent the input sequence is mapped to a continuous representation , after obtaining a Z, the decoder generates an output symbol sequence , one at a time to generate.
Model structure diagram:
Codec:
Encoder: encoder consists of 6 layers of the same layers, each layer has two sublayers. The first layer is a long self-focus mechanism, the other being a simple front fully connected feedforward network. After standardization layer, which layers each have adopted a residual connection, that the output of each sub-layer , which is a function of the sub-layer itself is achieved. To optimize cleavage of these residuals, all sublayer and the intercalation layer model 512 generates an output dimensions.
Decoder: Decoder layer 6 is composed of the same layer, out of two sub-layers in each encoder, and the third is the output from a long back focus layer encoder. Similar to the encoder, each sub-layer after normalization are surrounded with a residual connection. In order to ensure that there is a cover up layer sequence information.
attention
Attention is actually a function to a query, a key-value collection (about the query, key, value can be seen in this paper: Key-Value Memory Networks for Directly Reading Documents) is mapped into a single output, which query, key, value , output is a vector. In fact, the output is a weighted sum of values, wherein the weights assigned to each value of a correlation function calculated by the weight of the current query relevance to key.
Zoom dot product attention
Inputs include: querie and are key dimension, value for the dimension,
After calculating the dot product of the query and all key, dividing (in order to prevent the disappearance of gradient), to give the United States and a weight value of the weight with a softmax function.
In fact, when calculated by a matrix of parallel computing.
The reason for using the dots because increasing the speed through some efficient optimization technique.
Long attention
In fact, the more attention since linked up
At the same time, be reduced by reducing the overall dimensions even if consumed.
Fully connected feedforward network
Transformer fully connected networks, all are the same, the linear conversion and two intermediate RELU a function of composition.
Position encoder
Because the model does not use a convolutional or cyclic structure, in order to use the sequence information sequence, it is necessary to increase the relative or absolute position information. He joined the lower position encoding in the encoder and decoder for this paper.
Wherein, pos is a position, i is the dimension