Self-attention mechanism and transformer

insert image description here

1. self-attention

Attention means weight, the matrix weight from S to encoding vector C!
insert image description here
Continue to simplify:
insert image description here

insert image description here
Calculate the attention score, and then normalize it through the Soft-max layer (other activation functions are also available); the
insert image description here
next step is to further extract information from the attention score, and obtain a vector that considers all contexts:
insert image description here
Let's look at the process of matrix operations:
insert image description here
Calculate the attention score the process of:
insert image description here


insert image description here


insert image description here
Summary: Only the matrix qkv needs to be obtained from learning and training
insert image description here

2、Positional Encoding

The vector obtained above does not take into account the information of the word position (such as the verb is generally in the middle). Therefore introduce a new vector eie^iei
insert image description here
insert image description here

3. Self attention and CNN

insert image description here

4. Self attention and RNN

RNN consumes memory and is more complicated; self attention can calculate
insert image description here
similarities and differences in parallel:

  1. Sequence modeling ability: RNN is a classic sequence model that can handle variable-length input sequences and is suitable for sequence tasks such as natural language processing (NLP). Transformer is also suitable for sequential tasks, but it works better for long sequences, thanks to the parallelism of the self-attention mechanism and the ability to process long-range dependencies.
  2. Structure: RNN is a progressively iterative network structure that accepts the current input and the hidden state of the previous time step at each time step, and calculates the current state based on the current input and the previous state. Transformer is an attention-based network that does not depend on the order of time steps and can process the entire sequence in parallel. Transformer uses a self-attention mechanism to capture dependencies between different positions in a sequence.
  3. Parameter sharing: In RNNs, parameter sharing is achieved by using the same parameters at different time steps to capture patterns and long-term dependencies in sequences. The self-attention mechanism in Transformer also allows information to be shared between different positions, and Transformer further enhances the ability of parameter sharing through multi-head attention.
  4. Handling long-range dependencies: RNNs can suffer from vanishing or exploding gradients when dealing with long-range dependencies, which leads to poor modeling of longer sequences. Transformer introduces a self-attention mechanism, which enables it to better handle long-range dependencies without being affected by the gradient problem.
  5. Parallelism: Due to the sequential computation nature of RNN, it is difficult to parallelize it. In contrast, the self-attention mechanism in Transformer can compute the attention weights between different positions in parallel, thus improving the efficiency of training and inference.
  6. Pre-training model: Transformer's attention mechanism has achieved remarkable success in NLP, and has derived various pre-training models, such as BERT, GPT, etc. These pre-trained models perform well in various NLP tasks and become the mainstream models in the field of NLP today.

Overall, RNN and Transformer are two different approaches to sequence modeling. RNN is a classic sequence model, suitable for shorter sequence tasks, but may be limited when dealing with long sequences and long-range dependencies. Transformer introduces a self-attention mechanism, which can better handle long sequences and long-range dependencies, and has higher parallelism . Transformer's success lies in its remarkable results in NLP and has derived a variety of powerful pre-training models.

5、GNN

insert image description here

Two, transformer

1. Seq2seq applicable to everything

insert image description here

1)Encoder

(1) attention mechanism

insert image description here
Then add the multi-head attention mechanism, that is, multiple QKV matrices, mainly to eliminate the influence of the initial value of QKV, and then obtain the weighted average of multiple Z obtained to obtain the final Z,
insert image description here

The Encoder in the figure below is the self-attention structure used.
insert image description here
The overall structure:
insert image description here
explain each block in detail:
the norm below uses layer normalization, which is better than batch normalization
insert image description here

-------------------------------------------------------

insert image description here

(2) Position coding

Transformer's positional encoding (Positional Encoding) is a special encoding method used to add positional information in the input sequence to the embedding vector. Without positional encoding, the Transformer model cannot distinguish words in different positions in the input sequence, because the embedding vector only contains the semantic information of the words, but not the order information . The role of position encoding is to provide a unique encoding for each position in the input sequence , so that Transformer can distinguish words in different positions and capture the order information of the sequence.

2)Decoder

(1)Autoregressive

insert image description here
Specific network structure:
insert image description here
Note: the structure in the decoder is not self-attention but masked self-attention, because the decoding process is to get the next vector after translation one by one

(2)Non-autoregressive (NAT)

insert image description here
NAT has the advantage of one-time simultaneous decoding results

3) Connection between Encoder and Decoder

insert image description here
Expand: that is, the q in the Cross attention comes from the Decoder, and the k V comes from the Encoder
insert image description here

Training: In order to solve the problem that the decoder can only process word vectors one by one to speed up the training, the correct answer is used as the input of the decoder. This method is called "Teacher Foring"
insert image description here

3. Training tips

(1) Copy Mechanism: Copy some specific nouns
(2) Summarization
(3) Guided Attention
(4) Beam Search

3. Code details

1. Input and output

Note that I LOVE YOU E represents the label of the entire output, while SI LOVE YOU is the input of the Decoder, and requires the following MASK to cover each word! !
insert image description here

insert image description here

insert image description here
mask:
insert image description here

Four. Summary

Transformer has the following outstanding advantages:

  1. Self-attention mechanism: Transformer introduces a self-attention mechanism, which allows the model to dynamically pay attention to information at different positions when processing sequence data. This enables the model to better handle long sequences, capture global dependencies, and avoid the gradient vanishing and gradient explosion problems that exist in traditional recurrent neural networks (RNN).

  2. Parallel Computation: The attention computation in Transformer can be processed in parallel, because the output of each position only depends on the information of other positions. This enables Transformer to better utilize the parallelism of hardware devices and accelerate the training and inference process.

  3. Multi-head attention: Transformer uses a multi-head attention mechanism, allowing the model to learn multiple representations from different attention heads simultaneously. Each attention head can focus on different parts of the input sequence, which enhances the expressiveness and generalization ability of the model.

  4. Positional encoding: To handle the sequential information of sequence data, Transformer uses positional encoding to encode each position in the input sequence. This enables the model to not only focus on the semantic information of words when processing sequences, but also understand the position and order of words in the sequence.

  5. Scaled dot-product attention: Transformer uses scaled dot-product attention so that the time complexity of attention computation is independent of the length of the input sequence. This enables the model to handle longer sequences without incurring a significant increase in computational overhead due to increased sequence length.

  6. Transfer Learning: Due to Transformer's excellent performance in NLP, pre-trained Transformer models (such as BERT, GPT, etc.) have become powerful language representation learning tools. These pre-trained models can be adapted to various NLP tasks through fine-tuning, so that good performance can be obtained on a small amount of labeled data.

Guess you like

Origin blog.csdn.net/qq_45889056/article/details/129315191