[Self-attention neural network]Transfomer architecture

I. Overview

        The biggest difference between the Transformer architecture and traditional CNN and RNN is that it only relies on the self-attention mechanism without convolution/loop operations. Compared with RNN, it does not need to perform timing operations, and can be better parallelized ; compared with CNN, it can focus on the whole image at one time and is not limited to the size of the receptive field.

2. Model Architecture

        1. Functional modules

                The function module structure is shown in the figure below:

                Inputs : Encoder input

                Outputs : Decoder input ( the output of the decoder at the previous moment is used as input)

                Positional Encoding

                Transformer Block (Encoder): Consists of a multi-head attention layer with residual connections and a forward transfer network with residual connections . The output of the encoder is used as the input of the decoder.

                 Transformer Block (decoder): Compared with the encoder, there is a Masked Multi-Head Attention (mask multi-head attention) mechanism.

         2. Network structure

                ①Encoder _

                        6 Transformer Blocks are stacked , and each block has two Sublyaer (sublayers) ( Multi-head self-attention mechanism (multi-head self-attention mechanism) + MLP (multi-layer perceptron)), and finally passes through a Layer  Normalization .

                        Its formula can be expressed as: LayerNorm(x+Sublayer(x))<with residual connection>

                        Layer Norm is similar to Batch Nrom, both of which are averaging algorithms. The difference is that Batch Nrom is to find the mean value in a batch (column), while Layer Norm is to find the mean value in a sample (row) .

                ②Decoder _

                        6 Transformer Blocks are stacked , and there are three Sublyaer (sublayers) in each Block , and an autoregressive will be performed in the decoder (the input at the current moment is the output at the previous time). In order to ensure that the subsequent output will not be seen at time t, a masking mechanism is added to the first multi-head attention block for shielding.

                ③Attention mechanism

                        Attention function ( a function that maps query and some key-value pairs into an output , and the weight of each value is obtained from the similarity between its corresponding key and the query query )

                        Its formula can be written as:Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

                        The lengths of query and key are equal d_k, and the length of value is equal d_v; the inner product of each set of query and key is used as the similarity (the larger the value, the higher the similarity--cos function); after obtaining the result, divide by (that is, the vector length); finally get the weight with a softmax.\sqrt{d_k}

                        After the weight is obtained, it is multiplied with vuale to get the output.

                        In actual operation, both query and key can be written as a matrix, and the method shown in the figure below is used for calculation.

                        Masking mechanism : For the input of time k , only the value up to the moment Q_tshould be seen in the calculation , but in fact, the attention calculation will operate with all k. Solidly introduce a masking mechanism, the specific method is: replace the calculated value with a large negative number , which will become 0 after softmax.k_1k_{t-1}Q_tk_t

                        Multi-head mechanism : project the entire query, key, and value to low dimensions (h times, h=8 in the original text), and then perform h times of attention functions ; combine the outputs of each function , and then project them back to high dimensions to get the result . As shown below:

The Linear                                  in the figure is used for low-dimensional projection; Scaled Dot-Product Attention is the attention mechanism. concat is responsible for combining the results.

                                Its formula is:MultiHead(Q,K,V)=Concat(head_1,...head_h)W^O

                                                                where  head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)

三、VIT(Vision Transformer)

         There is another difficulty in applying Transformer to image processing: if you directly use the pixels of the image as the input sequence, you will face the problem that the input sequence is too long. The VIT network solves this problem by dividing the image into small patches .

        VIT divides the size of 16x16 into a patch. Taking the standard image input of 224x224 as an example, it becomes 196 patches after the patch, which greatly reduces the size of the input sequence.

         1. Model structure

                First, divide the picture into multiple patches; put these patches into the linear projection layer (Linear Projection of Flattened Patches-that is, a fully connected layer), and then add a position encoding information to each patch (in order to prevent the image order from being reversed) Merged into one token (directly added instead of merged). Then send this token to the transformer's encoder. At the same time, the head of the token needs to put the cls token (classification information) to indicate the category to which it belongs; the cls finally output by the self-attention mechanism is the classification information.

         2.Transformer Encoder

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/129343613