Thesis Notes: Attention in NLP is all you need

Thesis Notes: Attention in NLP is all you need. The structure and characteristics of Transformer

ref:
1.Step-by-step to Transformer: In-depth analysis of the working principle (taking Pytorch machine translation as an example)

2.How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models

1. The self-attention mechanism of the transformer is actually the internal modeling of the respective language models by the encoder and decoder themselves, establishing a distribution to find hidden.

2. The context attention mechanism in Seq2Seq is hidden between encoder and decoder.

3. Mask:

encoder: use padding mask in self-attention
decoder: use padding mask and sequence mask in self-attention
use padding in context-attention mask

4.Embedding:

  1. wording embedding
  2. position embedding: encoding of word position

5. LayerNorm: Calculate the mean and variance in the d-model dimension and normalize.

6. multi-head self-attention layer:

Insert image description here
scaled dot-product attention In order to alleviate the gradient disappearance problem
multi-head When initializing the Q K V mapping matrix, do multiple linear mappings

7. Forward propagation position-wise feed forward: It is a fully connected layer, using relu as the activation function

8. residual connection: The purpose is to alleviate the gradient disappearance problem

9.Structurally:

encoder:multihead self-attention + feed forward + ResNet
decoder:multihead self-attention + multihead context attention + feed forward + ResNet

10. Limitation:

Because when doing data segmentation, the sequence is divided into fixed-length segments. Contextual information may be lost between different fragments.
An improvement was made to transformer-XL. Use the hidden of the previous fragment as context information to supplement the current training. But my understanding is that this processing will turn Transformer-XL back into an RNN-type model, which can only process data according to time series. It cannot enter the next training data when the previous data is not completed. This reduces the parallelism of the model.

Guess you like

Origin blog.csdn.net/jxsdq/article/details/105817669