Classic model - Transformer

The fourth largest model after MLP, CNN, RNN.

Abstract

The sequence transcription model mainly adopts RNN or CNN. It often contains an encoder and decoder structure.

Only rely on the attention mechanism.

This article is mainly for machine translation. Later it was applied in different fields.

Introduction

question:

  1. RNN is a step-by-step calculation process, which cannot be parallelized, and the calculation performance is very poor;
  2. Timing information is lost as the sequence moves.

The attention mechanism has been combined with RNN very early to better realize the data interaction between codecs.

However, this paper abandons the structure of RNN and completely uses the attention mechanism to complete it.

Background

It is difficult to model relatively long sequences with convolutional neural networks, and it is necessary to use many layers of convolution to expand the receptive field. The advantage of convolution is that there are multiple output channels, and each channel can learn a pattern.

Therefore, this paper proposes a multi-headed attention model.

Model Architecture

For sequence models, the encoder-decoder structure has a good performance in sequence tasks.

For the decoder, in the recurrent neural network, words are output one by one, and the output at the past moment will be used as the input at the current moment, which is called autoregressive.

For the decoder, it is possible to see all the sentences. The sequence obtained by the encoder is delivered to the decoder in its entirety.

Encoder and Decoder Stacks

Encoder: Contains six stacked modules, each with two sublayers. In each module, the two sublayers have corresponding residual connections, which are then normalized (LayerNorm).

In order to avoid the inconsistency of the channel size (need to be projected) during the residual connection, this paper sets its dimension to 512.

BatchNorm与LayerNorm

Internal Covariate Shift: During the training process, the distribution of data is constantly changing, which brings difficulties to the learning of the next layer of the network.

During training, for a two-dimensional matrix, rows represent samples, columns represent features, and BatchNorm standardizes each column (calculate mean, standard deviation, Z-score). When testing,

Generally speaking, the learnable parameters γ , β \gamma,\beta will be used in the endc ,β , doing a linear transformation on the obtained standardized results is equivalent to changing the mean and variance of this distribution.

This is because, if all are unified into a standard normal distribution, the feature distribution learned by the model will be completely eliminated. Therefore, it is necessary to give him the opportunity to fine-tune.

I think the role of the BN layer should be to limit its distribution to not be too outrageous on the one hand, to have a basic prototype, and on the other hand, it is not expected to be carved out of the same mold.

At test time, the mean and variance parameters used are calculated during training.

The formula is μ = m μ + ( 1 − m ) μ batch , σ is the same as\mu = m\mu+(1-m)\mu_{batch},\sigma is the samem=m μ+(1m ) mbatch,σ is the same .

If the input is [B,C,H,W], the output is [C,H,W].

insert image description here

LayerNorm is for a Sample in a Batch. It calculates the mean and variance of each parameter in all channels, and performs normalization, that is, only in the C dimension.

Inverse y = x − E [ x ] V ar [ x ] + ϵ ⋅ γ + β y=\frac{xE[x]}{\sqrt{Var[x]+\epsilon}}\cdot\gamma+\betay=V a r [ x ]+ϵ xAnd [ x ]c+b

LN is usually used in NLP, because a Sample in NLP is a sentence, and each dimension in the sentence is a word, so there is no common feature relationship between words of the same dimension, and padding is useless for BN block, so the effect of BN is very bad.

Therefore, the normalized object in the training process is a word.

Because it is done for each sample, there is no need to calculate the global mean and variance during the training process.

Decoder: with mask. Make the behavior of training and prediction consistent. Its input is the entire output of the Encoder.

Attention

The attention function is a function that maps a query and a series of keys into an output.

Scaled Dot-Product Attention

There are two common attention mechanisms, namely addition and dot product. The dot product form is used here because it is more efficient to calculate.

But there is also a dk \sqrt{d_k}dk , the purpose is to avoid the saturation of functions such as softmax during the training process (of course it is good to appear at the end).

This is also the origin of the name of the attention.

Next, let's talk about the attention function A attention ( Q , K , V ) = softmax ( QKT dk ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk QKT)V

Each shape of Q, K, and V is (num_sample, num_feature) (num\_sample,num\_feature)(num_sample,n u m _ f e a t u r e ) , it is conceivable thatQKT QK^TQKElements of T ( i , j ) (i,j)(i,j ) means how much my i-th sample needs to pay attention to the j-th sample.

The shape of key and value should be the same, and the shape of query can be different from them.

Please add a picture description
How to make a mask?
In the decoder, you cannot see the subsequent content, so let the query at the tth time only look at the previous key. The calculation here is still normal calculation, that is, the data after the multiplied result t+1 can be turned into a large negative number, and it will become 0 after softmax.

Muti-Head Attention

Please add a picture description
Simulate multiple output channels and divide the input content into many equal-sized channels.

Please add a picture description

Applications of Attention in our Model

There are three different attention layers.

Please add a picture description

The first attention layer of the decoder has a Mask thing.

The input of the second attention layer of the decoder: the key and value come from the encoder, and the query comes from the previous attention layer.

Position-wise Feed-Fordward Networks

To put it bluntly, it is an MLP that acts on the last layer.

Single hidden layer MLP, the hidden layer in the middle changes the dimension to 2048, and then changes it back.

The formula is:
Please add a picture description

If the pytorch input is 3D, the default is to calculate in the last dimension.

The role of Attention is to capture the information in the sequence and do a convergence. The role of MLP is to map into the semantic space I want, because each word has complete sequence information, so MLP can be done alone.

RNN also uses MLP to do a conversion. In order to ensure the acquisition of sequence information, the output of the previous moment is input to the MLP of the next moment.

Embedding and Softmax

It is to map words into vectors.

Positional Encoding

Attention has no timing information, so position encoding is required.

P ( p o s , 2 i ) = sin ⁡ p o s 1000 0 2 i d m o d e l P(pos,2i)=\sin \frac{pos}{10000^{\frac{2i}{d_{model}}}} P(pos,2i ) _=sin10000dmodel2 ipos
P ( p o s , 2 i + 1 ) = cos ⁡ p o s 1000 0 2 i d m o d e l P(pos,2i+1)=\cos \frac{pos}{10000^{\frac{2i}{d_{model}}}} P(pos,2i _+1)=cos10000dmodel2 ipos

Why Self-Attention

Please add a picture description
The first comparison is the computational complexity, the second is the sequence of events (a measure of parallelism), and the third is the maximum path length (how long it takes to go from the first position of the sequence to the last position, reflecting the combination of information sex).

In the case of QK^T, n samples are multiplied by n samples, each multiplied d times, so this is the complexity.

In the case of the cyclic neural network, a d-dimensional sample comes in, and the MLP performs d operations on each dimension, a total of n times, so this is the number.

At present, there is no difference in the computational complexity of the two. Mainly follow-up content: Attention information is not easy to lose and has a high degree of parallelism.

However, Attention has fewer constraints on the model and requires a larger model and more constraints to be trained.

Conclusion

The encoder-decoder structure is used, but the recurrent layers are replaced with multi-headed self-attention.

Guess you like

Origin blog.csdn.net/weixin_46365033/article/details/125487985