[Notes] Transformer architecture (Attention is all you need)

Attention Is All You Need


bilibiliOther people’s notes

MLP-RNN-seq2seq/encoder-decoder architecture-attention mechanism-self-attention-transformer

MLP (Multi-Layer Perceptron): MLP is a basic feedforward neural network architecture that consists of multiple fully connected layers with activation functions between each layer . It plays a key role in deep learning and is used to solve a variety of problems, including image classification and speech recognition.

RNN (Recurrent Neural Network): RNN is a type of recurrent neural network that is particularly suitable for processing sequence data, such as natural language text or time series. The loop structure in RNN allows information to be transferred in the sequence , but it also has problems such as gradient disappearance and gradient explosion.

Seq2Seq (Sequence to Sequence): A sequence to sequence model is a neural network architecture that maps input sequences to output sequences . It is commonly used for tasks such as machine translation and natural language generation, where an encoder encodes an input sequence into a context vector and a decoder generates an output sequence based on the context vector .

Encoder-Decoder Architecture: This is a common neural network architecture used for sequence-to-sequence tasks. The encoder is responsible for encoding the input sequence into a fixed-length context vector, while the decoder generates an output sequence based on the context vector. This architecture has been successfully applied to tasks such as machine translation and dialogue generation.

Attention Mechanism: The attention mechanism is a mechanism used to focus on specific parts of the input sequence while the decoder generates the output sequence. It allows the model to better handle long sequences and capture key information. Attention mechanisms play an important role in improving translation and generation quality.

Self-attention (Transformer): Transformer is a revolutionary neural network architecture that introduces a self-attention mechanism. It abandons the traditional RNN structure and allows the model to process sequence data in parallel. Transformer has become a standard architecture in natural language processing, such as BERT and GPT series models are based on Transformer.

A key development in this process is the integration of these components with each other, from MLP and RNN to Seq2Seq and finally Transformer. Transformer has promoted the development of the field of natural language processing with its outstanding performance and capabilities, creating new standards for many automated text processing tasks. The introduction of the self-attention mechanism enables the model to better understand the dependencies in the sequence, thereby improving the performance of many NLP tasks.


introduction

  1. RNN (LSTM and GNN) are still the paradigm for sequence modeling (language modeling) and transduction problems (machine translation). Substantial efforts continue to push the boundaries of recurrent language models and encoder-decoder architectures
  2. RNN features (disadvantages): Calculate step by step from left to right. For the t-th state ht, it is calculated from ht-1 (historical information) and the current word t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths because memory constraints limit batching across examples.
  3. Application of Attention in RNN: Attention is used to effectively transmit encoder information to decoder, allowing modeling of input or output sequence dependencies independent of distance.

background

Write out the connections and differences between your own paper and other papers

In CNN, assuming a 3*3 convolution is used, most of the images (sequential computation) need to be learned in very deep layers, which makes it difficult to learn the dependencies between distant locations. (If two pixels are far apart, you need to use many 3 * 3 convolution layers, layer by layer, to connect the two far apart pixels.)

Transformer's attention mechanism sees all pixels every time, and one layer can see the entire sequence.

Multiple output channels, each channel can recognize different patterns.

Transformer's multi-head self-attention simulates the effect of CNNs' multi-channel output.

不同点:the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.

Model architecture

bilibiliOther people’s notes
Insert image description here

At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

Insert image description here
Inputs ---- Input Embedding input
passes through an Embedding layer, ie, a word comes in and is represented as a vector. The resulting vector value is added to Positional Encoding (3.5).

Transformer's block
Multi-Head attention
Add & Norm: Residual connection + Layernorm

Encoder’s core architecture

Blocks of N Transformers are stacked together.
Each layer has 2 sub-layers.
The first sub-layer is multi-head self-attention and
the second sub-layer is simple, position-wise fully connected feed-forward network, or MLP for short.

The output of each sub-layer is residual connection and LayerNorm
formula: LayerNorm( x + Sublayer(x) )
Sublayer(x) refers to self-attention or MLP

Residual connections require the input and output dimensions to be consistent, and if they are inconsistent, projection is required. For simplicity, the output dimension of each layer is fixed dmodel = 512

Simple design: You only need to adjust two parameters dmodel: how big the dimension of each layer and how many N layers are, which affects the design of a series of subsequent networks, such as BERT and GPT.

Remark: It is different from CNN and MLP. MLP usually reduces the spatial dimension downward; CNN reduces the spatial dimension downward and pulls the channel dimension upward.

Insert image description here
Insert image description here
For two-dimensional data: BN is equivalent to normalizing each feature, and LN is equivalent to normalizing each sample.
Insert image description here
Insert image description here

Decoder’s core architecture

The output of the encoder serves as the input of the decoder.
There is one more Masked Multi-Head Attention.

Insert image description here

How to do MASK???

The output of the decoder enters a Linear layer, performs a softmax, and obtains the output.
Linear + softmax: a standard neural network approach

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output.

output is a weighted sum of value --> the dimensions of the output == the dimensions of value. The weight of the value in the output = the similarity between the query and the corresponding key or the compatibility function. The weight is equivalent to the similarity between the query and the corresponding key. Although the key-value does not change, as the query changes, the weight distribution is different . The same results in different outputs. This is the attention mechanism .
Author: Unit Year
https://www.bilibili.com/read/cv13759416/?jump_opus=1 Source: bilibili

Insert image description here
Insert image description here

Insert image description here

Insert image description here
Insert image description here

Summary: Transformer is a relatively standard encoder-decoder architecture.
Difference: The internal structures of encoder and decoder are different, and there are some differences in how the output of encoder is used as the input of decoder.

Multi-Head Attention

Insert image description here
We found that instead of executing a single attention function using the keys, values, and queries of the dmodel dimension, it is better to linearly project the queries, keys, and values ​​h times to the dk, dk, and dv dimensions with different, learned linear projections, respectively. is beneficial. Then, on each projected version of the query, key and value, we execute the attention function in parallel, producing dv-dimensional output values. Concatenate them and project again to get the final value.
Insert image description here
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different locations. For a single attention head, averaging suppresses this. allows the model to jointly attend to information from different representation subspaces at different positions.
Insert image description here

Multi-head attention gives h opportunities to learn different projection methods, so that some similar functions required by different patterns can be matched in the projected metric space, and then h heads are spliced ​​together, and finally the projection is done again.

Insert image description here
In CNN, a convolutional layer usually includes multiple filters or convolution kernels, each filter is responsible for detecting different features or patterns of the input data. Each filter performs a convolution operation on the input data, producing a feature map (channel) that captures different aspects of the input data.

Similarly, in the attention mechanism, the query, key and value are linearly projected multiple times into different dimensions, which is like learning different features in different "channels". Each linear projection maps the input data into a different representation subspace that captures different aspects or relationships of the input data. Each subspace can be viewed as an attention head, similar to different output channels in CNN.

Just like multiple output channels in CNN can capture different image features, the multi-head attention in the attention mechanism can capture different aspects of information in the input sequence, thereby improving the model's ability to understand the input. This method of processing different subspaces in parallel also helps improve the generalization ability of the model, because different heads can learn different features.

Applications of Attention in our Model

Insert image description here
Insert image description here

Positional Encoding

Insert image description here

1.Why does Transformer use a multi-head attention mechanism? (Why not use a header)
2. Why does Transformer use different weight matrices to generate Q and K, and why can't it use the same value for its own dot multiplication? (Note the difference from the first question)
3. Why does Transformer choose dot multiplication instead of addition when calculating attention? What is the difference in computational complexity and effect between the two?
4. Why attention needs to be scaled before performing softmax (why divide by the square root of dk), and explain using formula derivation.
5. How to mask padding when calculating attention score?
6. Why is it necessary to reduce the dimensionality of each head when performing multi-head attention? (You can refer to the question above)
7. Can you briefly talk about the Encoder module of Transformer?
8. Why do we need to multiply the matrix by the square root of the embedding size after obtaining the input word vector? What is the meaning?
9. Briefly introduce the position encoding of Transformer? What are the significance and advantages and disadvantages?
10. What else do you know about positional encoding technologies, and what are their respective advantages and disadvantages?
11. Let’s briefly talk about the residual structure and meaning in Transformer.
12. Why does the transformer block use LayerNorm instead of BatchNorm? Where is LayerNorm located in Transformer?
13. Briefly talk about BatchNorm technology and its advantages and disadvantages.
14. Briefly describe the feedforward neural network in Transformer? What activation function is used? What are the relevant pros and cons?
15.How do the Encoder side and the Decoder side interact? (You can ask about seq2seq attention knowledge here)
16. What is the difference between the multi-head self-attention in the decoder stage and the multi-head self-attention in the encoder stage? (Why does decoder self-attention require sequence mask)
17. Where is the parallelization of Transformer? Can the Decoder side be parallelized?
19.How is the learning rate set during Transformer training? How is Dropout set up and where is it located? Is there anything I need to pay attention to when testing Dropout?
20. Does the residual structure at the decoding end add mask information that has not been seen subsequently, causing information leakage?

Zhihu answer

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/132740929