[Paper notes] Attention is all you need

Before reading this article, detailed description of self-attention, more comprehensive summary copy of the transformer before watching this article .
With the foundation of self-attention after a look at this paper, it feels easy.
Paper: Attention is all you need.

1-2 Introduction & Background

RNN : This Inherently Sequential Nature precludes parallelization the WITHIN Training examples, Which Becomes Critical AT longer lengths Sequence , AS Memory limit Batching across the Constraints examples.
Solve (a temporary solution, because there are still restrictions on the sequence of calculation):

  1. factorization tricks.
  2. conditional computation.

Also used to CNN Basic Building Block AS, SUCH AS: ByteNet, ConvS2S But:. Makes IT More diffucult to Learn to Dependencies BETWEEN Distant Positions . (Calculation amount proportional to the length of the observation sequences X and Y of the output sequence)

History :

name Explanation Limit
seq2seq
encoder-decoder Traditionally, with the RNN
RNN \ LSTM \ GRU Direction: bidirectional way; depth: single-layer or multi-layer; RNN difficult to cope with long sequences, can not be implemented in parallel, alignment problems; all necessary information to the neural network needs to be able to compress the source statements into fixed length vectors
CNN Parallel computing, longer sequences of samples Total memory, a lot of trick, the large amount of data parameter adjustment is not easy
Attention Mechanism Attention to the subset, alignment problems to solve

Mentioned points:
Self-Attention;
Recurrent Attention mechanism.
Transduction Models

3 Model Architecture

Most of the encoder-decoder structure:
input sequence: input sequence x = (x1, ..., xn ), N number
encoder outputs continuous representation: z = (z1, ..., zn), N number
docoder of outputs: y = ( y1, ..., ym), M number
one at a time element. The AS Previously Generated Symbols Consuming Additional INPUT When Generating The Next.
transformer model architecture
transforme architecture model 2
of The Overall Architecture the using the Transformer Follows the this Stacked Self-Attention and Point-Wise , Fully Connected Layers for both The Encoder and Decoder.

3.1 Encoder and Decoder Stacks

transformer
transformer structure

Encoder: a stack of N = 6 identical layers.Each layer has two sub-layers:(从下到上)

  1. multi-head self-attention mechanism.
  2. simple, position-wise fully conntected feed-forward network(以下叫ffnn).

encoder
A employ WE residual Connection around each of Sub-Layers The TWO (residual connections between two sub-layers) + Layer Normalization . That is, the output of each sub-layer is LayerNorm (X + Sublayer (X)) .

Residual connecting structure
Residual connection
all sub-layers in the model + embediding layers, produce outputs of dimension dmodel = 512.

Decoder : A = N. 6 Identical Stack Layers of Decoder The Sub-Layer inserts :( A THIRD from bottom to top).

  1. multi-head self-attention sub-layer (the output received encoder stack) (mask)
  2. encoder-decoder attention
  3. ffnn

transformer_decoding_1decoder hand-painted

masking ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Each sub-layer residual Connections + Layer Normalization . (Residual connections round each of the sub-layers. )

3.2 Attention

attention机制就是mapping a query and a set of key-value pairs to an output(all are vectors). the output is computed as a weighted sum of the values, where the weight assigned to each values is computed by a compatibility function of the query with the corresponding key.

This article is used attenion:

  • scaled Dot-Product Attention
  • Multi-Head Attention
    scaled dot-product attention

3.2.1 Scaled Dot-Product Attention

The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by sqrt(dk), and apply a softmax function to obtain the weights on the values.

在实践中,we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V.
Scaled Dot-Product Attention公式1
Scaled Dot-Product Attention公式2
Scaled Dot-Product Attention公式

PS: 2 Attention most commonly used functions are additive Attention , DOT Attention Product- DOT-Product faster, more efficient use of space.

When d k is very small when both the performance of similar, but dk big time, dot-product without a substantial increase in scale, will then Pushing function at The SoftMax the WHERE IT INTO Regions has Extremely Small Gradients , therefore, scale the dot product by 1 / sqrt (dk) .

3.2.2 Multi-Head Attention

In the case of long attention, we maintain a separate Q / K / V weighting matrix for each head, resulting in a different Q / K / V matrix. As before, we obtain Q / K / V matrix is multiplied by X-WQ / WK / WV matrix.
Also on: Queries, Keys, Values.
If the self-attention to do the same calculation listed above, only eight different weighting matrix can be obtained 8 different Z matrix.
He said attention heads is different matrix Z generated.

Papers presented to queries, keys, values ​​do different projection views h, the dimension map is dk, dv, and passes the scaled dot-product attention spliced ​​together result, the final output by a linear mapping. By long attention model able to obtain location information under the different sub-space.

Are THESE Concatenated and Projected Once Again , the Resulting in at The Final values .
Multi-Head Attention公式
8 different Z matrix is compressed into a matrix, the matrix is multiplied by an additional value W0 right to obtain a result matrix Z, it was from all the attention heads to capture information so this sentence feeding FFNN.
Multi-Head Attention1

multi-head attention points:

  • employ h = 8 parallel attention layers, or heads.
  • Each of THESE WE use for DK DV = D = Model / H = 64 . (D Model is to integrate dimensions)
  • Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

3.2.3 Application of the attention model

transformer model architecture
The Transformer uses multi-head attention in three different ways:

  1. In “encoder-decoder attention” layers

    Self-attention as input the output of the encoder and the decoder. --Queries decoder from previous layers, keys, values output from the encoder.
    Encoder-Decoder Attention works layer and the multi-headed self attention similar, except that it queries from the lower layer to create a matrix, and obtains keys and values output from the encoder matrix in the stack.

  2. encoder的self-attention layers.

    Each position in the encoder can attend to all positions in the previous layer of the encoder
    输入的Q、K、V都是一样的(input embedding and positional embedding)

  3. decoder的self-attention layers.

    Inside of the this Scaled Implement WE DOT-Product Attention by Masking OUT (Setting to -∞) All values in SoftMax The INPUT of Which Correspond to The Illegal Connections.
    In self-attention layer decoder, the only access a position ahead of the current position

The use of local multi-headed attention of

3.3 Position-wise Feed-Forward Networks based on the position feedforward network

The encoder and decoder of each layer feedforward network comprising a fully connected, feedforward network separately and applied to each the same position. Consists of the this TWO Linear Transformations (two Dense layer) with A RELU BETWEEN Activation in . Can be seen as two layers 1d-convolution 1 * 1 a. hidden_size changes: 512-> 2048-> 512

FFN(X) = max(0, xW1+b1)W2 + b2

Different use the Parameters from Layer They to Layer.
Position-Wise Feed Forward Network, is actually a MLP network.

3.4 Embeddings and Softmax

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation. In the embedding layers, we multiply those weights by sqrt(dmodel).

3.5 Positional Encoding

we must inject some information about the relative or absolute position of the tokens in the sequence. add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.

Add a transformer --positional encoding vectors input to each embedding. These vectors follow a particular pattern learning model, which helps determine the position of each word , or the distance between the sequences in different words .

posotional encoding

The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.

positional encodings can be learned and fixed.

Using the sine and cosine functions of different frequencies:
PositionalEncoding

WHERE POS IS The position and I IS The Dimension.
We chose the this function Because WE hypothesized IT Would the allow The Model to Easily Learn to Attend by relative Positions , Operating since for the any Fixed offset K, Pepos + K CAN BE Represented AS A Linear function of PEpos.
sine and cosine functions have periodicity, the fixed length deviation k (similar cycle), post + PE k can be expressed as a position on pos linear change of position PE (linear relationship), this can easily model learning word a relative positional relationship between the words.
The advantage of this method is that the coding sequence length can be extended to invisible, such as if we were asked to translate a training model focused on any sentence is longer sentences than training.

In other NLP papers, position embedding, usually a trained vector, but position embedding only extra features, there will be better the information, but did not have a great performance does not decline, because RNN, CNN itself able to capture position information, but in the Transformer model, position Embedding is the sole source of location information, and therefore is the core component of the model , not the characteristic auxiliary properties.

4 why self-attention

  1. T The otal Computational Complexity per Layer. (total computational complexity of each layer)
  2. the amoount of computation tha can be parallelized.
  3. the path length between long-range dependencies in the network.

5 Training

5.1 training data and batching

sentence pairs, Sentences were encoded using byte-pair encoding. which has a shared source-target vocabulary of about 37000 tokens.
Sentence pairs were batched together by approximate sequence length.

5.2 Hardware and Schedule

NVIDIA GPUs P100 8,
Base Models: Steps or 12 hours of training 100,000.
Big Models: training 300,000 steps (3.5 days).

5.3 Optimizer

Adam optimizer, β1 = 0: 9 , β2 = 0:98 and θ = 10-9.
Change learning rate during exercise according to the following formula:
learningrate

5.4 regularization

Dropout Residual : the Apply to Dropout at The the Output of the each Sub-Layer , to at The sums of at The embeddings and at The Positional the Encodings in both-at The Encoder and Decoder Stacks Pdrop = 0.1 In this paper.
(Output of each sub-layer, positional encodings embeddings and, encoder and decoder stacks of three places with the residual dropout.)

label smoothing
transformer results

6 results

6.1 Machine Translation

Machine Translation variety of model parameters:
Machine translation network parameters

we vary the number of attention heads and the attention key and value dimensions.发现如下:

  1. reducing the attention key size dk hurts model quality
  2. a more sophisticated compatibility function than dot product may be beneficial
  3. bigger models are better
  4. dropout is very helpful in avoiding over-fitting
  5. replace our sinusoidal positional encoding with learned positional embeddings.
Published 63 original articles · won praise 13 · views 40000 +

Guess you like

Origin blog.csdn.net/changreal/article/details/102630873