"Hands-on Deep Learning"-68Transformer

Mushen's version of "Learning Deep Learning by Hands" study notes, recording the learning process, please buy books for detailed content.

B station video link
open source tutorial link
Mushen Transformer paper intensive reading paragraph by paragraph [paper intensive reading]

Transformer

insert image description here
Transformer's paper: Attention Is All You Need!
insert image description here
Transformer's model architecture
insert image description here
Parts of the encoder and decoder are the same, and the decoder has more Masked multi-head attention. Because there is a residual connection, it is necessary to ensure that the dimension of the data after MLP is the same as the original dimension, so the output dimension is set to 512 in the model. This design affects the subsequent work of BERT, GPT, etc., adjustable The only parameters are the number of layers N and the output dimension of the model.
insert image description here
If someone else’s stuff is used in the relevant part when writing an article, it’s best to tell what it is in the article. You can’t expect others to know all the details. It’s good to be able to explain clearly in a few sentences.

Layer Norm
is used in the layer normalization Transformer. Why don't we use Batch Norm in these variable-length applications ? Consider the simplest two-bit input: LayerNorm standardizes samples, and BatchNorm standardizes features. Using BatchNorm in the case of unequal sequences is easy to make the variance and mean jitter. LayerNorm makes the mean and variance of each sample by itself. It does not need to store global information and is relatively stable.

insert image description here

Scaled dot product attention
insert image description here

The attention calculation with Mask
ensures that when the decoder calculates time t, it can only see the information at time t-1 and before. The specific implementation is to make the corresponding information very small, resulting in these values ​​passing through softmax when calculating the attention matrix Afterwards it tends to 0.

insert image description here

Multi-head attention
scaling dot product attention has no learning parameters, so V, K, and Q are projected to the linear layer first, after multiple attention calculations, and finally stitched together for a projection, in general the effect is Compare multi-channel like convolution.
Transformer selects 8 heads, and each head is projected onto 512/8=64 dimensions.
insert image description here
The use of attention in Transformer
insert image description here
The third multi-head attention layer is to use the output of the encoder as Q and K, and the output of the decoder as V for attention calculation.
The feed-forward layer has two layers, the first layer expands the dimension to 2048, and the second layer scales the dimension to 512.

The difference between Attention and RNN.
Attention is to capture the global information of the entire sequence for aggregation. RNN transfers the information from the previous moment to the next moment as part of the input. The difference between the two models lies in the transmission of sequence information. .
insert image description here
There is no timing information in the position code
attention, so the timing information (position code) is added to the input of the transformer.
insert image description here

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/132175946