Neural Network Study Notes 3 - Transformer, VIT and BoTNet Networks

Series Article Directory

Neural network study notes 1 - ResNet residual network, Batch Normalization understanding and code
Neural network study notes 2 - VGGNet neural network structure and receptive field understanding and code



A. Transformer model

Recently, Transformer is very popular in the field of CV. Transformer was published by Google on Computation and Language in 2017. At that time, it was mainly proposed for the field of NLP natural language processing, mainly for machine translation tasks in the field of NLP. Prior to this, the models used by everyone to deal with such tasks were temporal networks such as RNN and LSTM, but such models inevitably have the problem of limited memory length or limited sentence information that can be used. , and the gradient explosion and gradient disappearance brought about by the increase of the sequence length. Although LSTM alleviates this kind of problem to a certain extent on the basis of RNN, another serious problem of this kind of RNN-based network model is that it cannot be parallelized. change. If you want to calculate the data at time tn, you must first calculate the data at time tn-1. The problem caused by the inability to parallelize is that the calculation efficiency is extremely low. In response to these problems, the Google team proposed Transformer (Chinese name Transformers). The current Transformer is considered to be the fourth largest type of basic model after NLP, CNN, and RNN. Perhaps this is the gold content of Attention Is All You Need. A core of the transformer is to propose a model that relies on the attention mechanism Attention.


1. Supplementary details

1. Parallel Computing

In data-parallel model training, the training task is divided into multiple processes (devices), and each process maintains the same model parameters and the same computing tasks, but processes different data (batch data). Data parallelism can improve training throughput by adding parallel training devices.

The input data splits the entire training dataset:

  1. Divided according to the number of parallel processes, each process only reads its own split data
  2. Reading is only in charge of a specific process, divided according to the number of parallel processes, and then sending different data blocks to the corresponding process

RNN series models must be calculated from X1 to Xn in order. This problem leads to poor computing performance even on GPUs, and too long sequences will cause early hidden historical information ht to be lost and iterated. .
The reason why RNN cannot be parallelized is because although you know X0, X1, and X2, but when you calculate, you need to input not only Xi but also the dependency ht-1 calculated by Xi-1 at the previous moment, then you can We have to calculate h0 obtained from X0 one by one in order, and h1 obtained from h0 and X1. And you can't directly process this batch of data ([X0,X1,X2,...,Xn]) as a complete matrix at the same time as the transformer, but need to be processed one by one.

The parallelization of Transformer refers to the parallelization of data in the training phase. In the testing phase, only the encoder can be parallelized, but the decoder cannot be parallelized. In the decoder phase, the Transformer may be a little different at the code level.
The parallelization of Transformer is considered to be mainly reflected in the self-attention module. On the Encoder side, the Transformer can process the entire sequence in parallel, and obtain the output of the entire input sequence through the Encoder side. In the self-attention module, for a certain sequence from X1, X2,… ,Xn , the self-attention module can directly calculate the dot product result of ​Xi,Xj. When the problem of batch is not considered, it is assumed that a time series Xi is input, representing the value of X at time i (or it can be understood as inputting a video, Xi represents the i-th frame picture). When training, all the training sets Xi enter the model at the same time, instead of entering X0, X1, and X2 one by one. Therefore, the transformer can process this batch of data at the same time (various matrix operations), and it can also be parallelized.

2、BatchNorm和LayerNorm

BatchNorm is generally used in the CV field, while LayerNorm is generally used in the NLP field. This is determined by the essential difference between the two tasks. Visual features are objectively existing features, while semantic features are more of a statistic determined by contextual semantics. features, so their normalization methods will also be different.

BatchNorm batch normalization : Batch normalization is a normalization operation on a single neuron in an intermediate layer, and the batch direction is normalized. Calculate the N results output by each neuron in the l-th layer network on a sample with a Batch size of N, calculate the mean and variance of the N results output by each neuron, and then use the mean and variance to compare the N output The result is normalized, so the normalized dimension is performed on the Batch

LayerNorm layer normalization : Layer normalization is to normalize all neurons in an intermediate layer, and normalize the channel direction. Calculate the output of each neuron in the Batch for each input sample of each neuron in the l-th layer network. For each input sample, calculate the mean and variance of the output results of all neurons, and use the mean and variance to normalize for this input sample. The output results, so the normalized dimension is performed on the entire layer

All in order to better obtain the mean and variance
insert image description here


2. Encoder-decoder (encode-decode)

The encoder is responsible for mapping the natural language sequence into a hidden layer (containing the mathematical expression of the natural language sequence), and then the decoder maps the hidden layer into a natural language sequence.

Autoregressive steps:
1. Input the natural language sequence to the encoder
2. The hidden layer output by the encoder is input to the decoder
3. Start the decoder
4. Get the first word
5. Drop the first word and input To the decoder, get the second word
6. Repeat this process until the decoder outputs a terminator and the sequence generation is complete.

The model diagram of the original paper:
insert image description here
This is actually an encoder-decoder architecture, which can be simply divided into


The area on the left is the encoder:
Inputs: Encoder input
Positional Enocding: Position encoding
Nx: n such layers stacked
Multi-Head Attention: Multi-head attention mechanism layer
Add & Norm: Similar to residual settings, normalized
Feed Forward: feedforward neural network

The first sublayer connection structure consists of a multi-head self-attention sublayer (qkv comes from the encoder input, q=k=v) and normalization layer and a residual connection.
The second sublayer connection structure consists of a feed-forward fully connected sublayer and normalization layer and a residual connection.

Encoder Encoder: Consists of N=6 identical layers, each layer contains two sub-layers. The first sub-layer is the multi-head attention layer (multi-head attention layer) and then a simple fully connected layer . Each sub-layer has added residual connection (residual connection) and normalization (normalization).


The area on the right is the decoder:
Outputs: Decoder input
Positional Enocding: Position encoding
Nx: n such layers are stacked
Masked Multi-Head Attention: Multi-head attention mechanism layer with mask, masking the value behind, so that its weight is 0
Multi-Head Attention: multi-head attention mechanism layer
Add & Norm: similar residual settings, normalized
Feed Forward: feedforward neural network

The first sublayer connection structure includes a multi-head mask self-attention sublayer (qkv comes from the decoder input, q=k=v, as a Masked mask) and a normalization layer and a residual connection.
The second sublayer connection structure consists of a multi-head attention sublayer (kv comes from the encoder output, q comes from the output of the first sublayer of the decoder) and normalization layer and a residual connection.
The third sublayer connection structure includes a feedforward fully connected sublayer and normalization layer and a residual connection.

Decoder Decoder: It consists of N=6 identical Layers, but the layer here is different from the encoder. The layer here contains three sub-layers, including a self-attention layer, encoder-decoder attention layer and finally a fully connected layer. The first two sub-layers are based on the multi-head attention layer. There is a special point here is masking. The role of masking is to prevent future output words from being used during training. For example, during training, the first word cannot refer to the generated result of the second word. Masking will turn this information into 0 to ensure that the information for predicting position i can only be based on an output smaller than i.


3. Attention mechanism Attention

The so-called Attention mechanism is a mechanism that focuses on local information, for example, a certain image area in an image. Attentional regions tend to change as tasks change. Its essence is a set of weight coefficients independently learned by the network, and a "dynamic weighting" method to emphasize the region of our interest while suppressing the irrelevant background region.

In the field of computer vision, attention mechanisms can be roughly divided into two categories: strong attention and soft attention. Since strong attention is a kind of random prediction, which emphasizes dynamic changes, although it works well, its application is very limited due to its non-differentiable nature. On the contrary, soft attention is differentiable everywhere, that is, it can be obtained through neural network training based on gradient descent method, so its application is relatively extensive. Soft attention starts from different dimensions (such as channel, space, time, category, etc.),

The current mainstream attention mechanism can be divided into the following three types: channel attention, spatial attention, and self-attention.

Aiming at the way of causing the attention mechanism, it can be divided into two categories, one is involuntary prompting, and the other is voluntary prompting. Among them, the involuntary prompt refers to the attention tendency caused by the prominent characteristics of the object itself, and the autonomous prompt refers to the attention tendency to the object with prior weight under the intervention of prior knowledge. In other words, it can be understood that the involuntary cue comes from the object itself, while the voluntary cue comes from a subjective tendency.

Regarding the difference between attention and self-attention mechanism, you can see this blog for specific understanding


4. Self-Attention

  1. q stands for query, which will be matched with each k in the future
  2. k stands for key, which will be matched by each q later
  3. v stands for Value, the information extracted from a
  4. The subsequent matching process of q and k can be understood as calculating the correlation between the two. The greater the correlation, the greater the weight of v
  5. When q=k=v, it is called self-attention mechanism

insert image description here

Imagine that the constituent elements in the input source are composed of a series of <Key,Value> data pairs. At this time, given a certain element Query in the target, by calculating the similarity or correlation between Query and each Key, each A Key corresponds to the weight coefficient of the Value, and then the Value is weighted and summed to obtain the final Attention value. So in essence, the Attention mechanism is to weight and sum the Value values ​​​​of the elements in the input, while Query and Key are used to calculate the weight coefficients corresponding to Value.

In addition, there is another way of understanding, we can understand the query, key and value as a kind of soft addressing (Soft Addressing). Value can be regarded as the content stored in the memory, and Key is regarded as the address of the memory. When Key==Query, the Value value in the memory corresponding to the Key address is taken out, which is called hard addressing. Soft addressing is to address by calculating the similarity between Key and Query. This method not only obtains the Value value of the memory in a Key address, but obtains the weighted sum of the Value values ​​​​in all memories. As for the weight (importance) of each Value, it is calculated through the similarity between Key and Query, and the final output is the weighted sum of all Value values ​​and their weights.

The specific calculation process of the Attention mechanism, if most of the current methods are abstracted, can be summarized into two processes: the first process is to calculate the weight coefficient according to the Query and Key, and the second process is to weight the Value according to the weight coefficient summation. The first process can be subdivided into two stages: the first stage calculates the similarity or correlation between the two based on Query and Key; the second stage normalizes the original score of the first stage;

insert image description here

insert image description here

5. Multi-Head Attention

insert image description here
According to the model diagram, it can be understood that the original V, K, and Q are separately input into the linear layer for processing and projected to a lower dimension, and then do h times of Scaled Dot-Product Attention to obtain h outputs, and finally do a linear projection after combining the outputs.

Through multiple parallel runs of an attention mechanism, the independent attention outputs are concatenated and transformed linearly into the desired dimension. Intuitively, multiple attention heads allow attention operations on different parts of the sequence, allowing the model to simultaneously attend to information from different representation subspaces at different locations.

Multi-head attention works by implementing multiple attention modules in parallel utilizing multiple different versions of the same query. The idea is to linearly transform the query using different weight matrices to get multiple queries. Each newly formed query essentially requires different types of relevant information, allowing the attention model to introduce more information in the context vector computation.

Each head creates its own representation of the query and input matrix, allowing the model to learn more information. For example, when training a language model, one attention head can learn to pay attention to the relationship of certain verbs (e.g., walk, drive, buy) to nouns (e.g., student, car, apple), while another attention head learns to pay attention to The relationship of pronouns (e.g., he, she, it) to nouns can learn more like convolutions.
Each head will also create its own vector of attention scores, and a corresponding vector of attention weights
insert image description here


B. VIT (Vision Transformer) model

Since the outbreak of deep learning, CNN has been the mainstream model in the CV field, and has achieved good results. In contrast, the Transformer based on the self-attention structure shines in the NLP field. Although the Transformer structure has become the standard in the NLP field, its application in the field of computer vision is still very limited.

ViT (vision transformer) is a model proposed by Google in 2020 that directly applies Transformer to image classification. Through the experiments in this article, the best model given can achieve an accuracy rate of 88.55% on ImageNet1K (first in Google's own Pre-trained on the JFT dataset), indicating that Transformer is indeed effective in the CV field, and the effect is quite amazing. VIT promotes the unification of NLP and CV, and promotes the development of the multimodal field.

1. transformer and image

The length of the sequence supported by the hardware is generally hundreds to thousands, and the transformer needs to convert the 2D image into a 1D sequence. For example, the input size of an image is 224x224, and the sequence is elongated on the basis of pixels. Then the sequence length For 224x224=50176, the square complexity is too large.
How to reduce the sequence length:
1. Purify the feature map as input. For example, the feature map in the later stage of resnet will be relatively small and can be used as transformer input.
2. Stand-Alone Attention isolates self-attention, divides it into local small windows, and controls input.
3. Axial Attention axis attention, split the height and width hw into two 1-dimensional matrices, and do self-attention respectively.
4. vit divides the image into n patches, and each patch is input as an element. A picture is equivalent to a sentence, and a patch is equivalent to a word.

2. Model Analysis

insert image description here

  1. Given an image (lower left corner), and divided into n patches
  2. Treat patches as elements and arrange them in a sequence
  3. Each patch will be processed by the Linear Projection of Flattened Patches linear projection layer, which is a fully connected layer.
  4. Add a position code to the patch to get the Embedding layer token of Patch + Position
  5. Add additional class token (0*) through Extra learnable embedding and position coding, and learn the required information when exchanging information between tokens
  6. Use DropOut/DropPath regularization processing
  7. Incoming Transformer Encoder module processing
  8. Classification according to embedding (0*) and MLP Head general classification head
  9. Model training with MLP cross entropy function

insert image description here

1. Embedding layer

From the picture, the Transformer Encoder reads a small piece of such a picture. The reason for this is: treat each small image as a token (token can be understood as a word or word in NLP), and calculate the correlation between each token in Transformer.

This is very different from convolutional neural networks. In the past, CNN continuously down-sampled in the way of convolution + pooling, so theoretically the model can achieve the purpose of increasing the receptive field by deepening the model depth. However, this has two disadvantages:
the actual results show that the CNN's response to the edge is very weak. This is also very easy to understand. The closer to the edge of the pixel, because the number of convolutions is less, it naturally contributes less when the gradient is updated.
CNNs can only compute correlations with neighboring pixels. Due to the characteristics of its sliding window convolution, it is impossible to jointly calculate non-field pixels, for example, the pixel in the upper left corner cannot be jointly convolved with the pixel in the lower right corner. This results in some spatial information being unavailable. At the same time, according to the MAE paper, natural images are redundant, that is, the information represented by adjacent pixels is similar, so only calculating domain pixels cannot maximize the use of image features.

Back in ViT, it is not enough to just split the image into small patches (patch). What Transformer Encoder needs is a vector with a shape of [num_token, token_dim]. For image data, the shape of [H, W, C] does not meet the requirements, so it needs to be converted, and the image data must be converted into a token through this Embedding layer.

Take ViT-B/16 as an example: Suppose the input image is 224x224x3, and the shape of a token original image is 16x16x3, then the image can be split into (224/16)^2 = 196 patches, that is, 196 image blocks , and then linearly map each image block to a one-dimensional vector, then the length of this one-dimensional vector is 16 16 3=768 dimensions. The final dimension of 196 tokens is [196, 768].

At the same time, it should be noted that the token can add positional encoding. Due to the self-attention mechanism, if there is no character position, the patch cut out of the picture will be disrupted, so when the position code is added, the classification accuracy will be higher. Among them, the position coding is also divided into 1-D, 2-D, and Relative, but the difference will not be great.

In addition, the input tensor also needs an additional class token (0*). The class token (0*) is used whether it is NLP or an image, mainly as an overall feature of the image. Its data format is [1,768], which is convenient It is directly spliced ​​with the information of other tokens later. This is because the classification information needs to be taken out later to make predictions separately, and the shape changes from [196, 768] to [197, 768].

2. Transformer Encoder module

Transformer Encoder is to stack the encoder block several times, similar to the conventional CNN, mainly composed of several parts:

  1. Layer Normalization:
    Compared with Batch Normalization, BN calculates the mean and variance of a certain feature map for all samples, and then normalizes the neurons of this feature map. LN is to calculate the mean and variance of all feature maps of the sample for a certain sample, and then normalize the sample. BN is suitable for the situation where the data distribution of different mini batches is not much different, and BN needs to open up variables to store the mean and variance of each node, and the space consumption is slightly larger; and BN is suitable for scenarios with mini_batch. LN only needs one sample to perform normalization, which can avoid the problem of being affected by the distribution of mini-batch data in BN, and does not need to open up space to store the mean and variance of each node.

  2. K, Q, V:
    K=Q=V, each is [197,768], that is, the KQV dimension is 197x(768/Lx), and the subsequent output will be restored to 768 after splicing.

  3. Multi-Head Attention:
    This structure is a kind of self-attention, which is described in detail above.

  4. DropOut/DropPath:
    The ViT-B/16 model uses DropOut, but the actual reproduced code uses DropPath. The two have little impact on the final result, so I won't go into details here.

  5. MLP Block:
    It consists of a fully connected layer + GELU activation function + DropOut. It adopts an inverted bottleneck structure. The input feature layer is enlarged after being fully connected once, and the channel is expanded to 4 times the original size, and the latter fully connected layer is restored to the original Number of. So the shape of the tensor after the MLP Block is unchanged.

  6. Lx:
    make L stacks

3. MLP Head module

After Transformer Encoder, the shape of the output tensor is the same as the shape of the input tensor. Taking ViT-B/16 as an example, the input is [197, 768], and the output is still [197, 768]. If the downstream task is a classification model, the corresponding class token (0*) needs to be extracted to obtain the classification result. Vision Transformer says that MLP is composed of a fully connected layer + Tanh activation + fully connected layer. But in actual use, a fully connected layer can be directly classified. If you need to get the probability of each category later, you need to connect a softmax activation function.

C. Bottleneck Transformer (BoTNet) network hybrid model

A simple but powerful backbone that incorporates self-attention into a variety of computer vision tasks. It is a combination of CNN + Attention, including image classification, target detection and instance segmentation. The method significantly improves baselines in instance segmentation and object detection, while also reducing parameters so that latency is minimized.

One of the reasons for combining convolution and self-attention is that the pure ViT type structure is particularly sensitive to the input size and cannot be changed. It is 224 × 224 224 is 224 × 224 224 . However, the input of our target detection, instance segmentation and other tasks may be a large image of 1024 × 1024 1024. If the hard Train is a pure Transformer, the amount of calculation may be too much for the machine.

BoTNet integrates the Attention module into the original backbone of CNN, and only in ResNet, replaces 3 × 3 convolution with Multi-Head Self-Attention (MHSA), and does not make any other changes. The resnet model can be seen in my previous blog

1. Multi-Head Self-Attention module

insert image description here

This MHSA Block is the core innovation point, where the relative position codes of height and width are Rh and Rw respectively. , where q, k, and v represent query, key, and weight, respectively. Ten and X represent element-wise summation and matrix multiplication, respectively, while 1x1 represents point-wise convolution. The blue parts represent the position encoding position encodings and the value mapping value projection respectively.

Specific steps:

  1. Pass in the input X, the format of X should be H × W × d, respectively representing the height and width of the input feature matrix and the dimension of a single token
  2. Let X be WQ, WK, WV respectively
  3. Initialize two learnable parameter vectors Rh and Rw to represent the position codes of different positions of height and width respectively, and then add them up through the broadcast mechanism, that is, the two-dimensional ( i , j ) position codes are two of Rhi + Rwj d-dimensional vector addition, which differs from the one-dimensional positional encoding of the VIT. Output r, representing the position code.
  4. Note that there are now four parameters, namely q, w, v, r, and their matrix format is H × W × d. Perform matrix multiplication on qr to obtain content-position and output qrT, and perform matrix multiplication on qk to obtain content-content-content and output qkT.
  5. Perform matrix addition of qrT and qkT, and perform softmax normalization index processing on the obtained matrix, and the matrix format of the processed output value is H W×H W.
  6. Finally, matrix multiplication is performed between the output value and the weight V to obtain the output Z.

It is different from the MHSA in the previous Transformer:

  1. Since the processing object is not one-dimensional, but similar to the CNN model, there are many characteristics related to it.
  2. Normalization here does not use Layer Norm but Batch Norm, which is consistent with CNN.
  3. Non-linear activation, BoTNet uses three non-linear activations
  4. The content-position module on the left introduces two-dimensional position encoding, which is the biggest difference from Transformer.

insert image description here

BotNet will replace the bottleneck in the fourth block in ResNet with the MHSA (Multi-Head Self-Attention) module to form a new module named Bottleneck Transformer (BoT). Only the replacement of the ResNet-50 C5 structure, the final test achieved 44.4% Mask AP and 49.7% Box AP in the instance segmentation of Mask R-CNN on the coco dataset. In the classification task, a top-1 accuracy of 84.7% was achieved on ImageNet. And it is 2.33 times faster than EfficientNet.

Since MHSA is so powerful, why not replace all 3X3 convolutional layers from C2 to C5 with MHSA? It is mentioned in the paper that if all of them are replaced by MHSA, the calculation amount will be geometrically or even exponentially increased, but the benefits brought are far from proportional. So in the end, only the structure of C5 was modified, and the bottlenecks of C5 may not all use MHSA.
insert image description here

二、Bottleneck Transformer

The Bottleneck Transformer is obtained by adding 1 × 1 convolution before and after the Multi-Head Self-Attention structure. The Bottleneck Transformer and the Transformer block in ViT are actually related, and they are not very different structures. The author first mentioned in the paper that the ResNet botteneck block with MHSA can be regarded as a Transformer block with a bottleneck structure and other minor differences (such as residual structure, regularization layer, etc.).
insert image description here
The rightmost one is a single C5 layer, which shrinks the output of the previous layer. The Contraction dimension is reduced from 2048 to 512 (÷4) as input, and the output of H×W×512 is obtained through the MHSA module, and then enlarged from 512 to 2048 ( ×4) as output.

D. Summary

This is the end of the basic study of the principles and model structure of Transformer, VIT and BoTNet. For actual use, it may not be done temporarily, because there is no suitable opportunity and certain hardware limitations. The reason is that although the experimental results reached SOTA at the time, it required more data sets than CNN. The accuracy of training on small data sets is not as good as CNN, but ViT has higher accuracy on large data sets. More data and a better training platform are what I lack. An intuitive explanation for this difference is: because of the unique mechanism of self-attention, ViT makes more use of the information between tokens and tokens across pixels, while CNN only calculates pixels in the field, so in the case of the same parameters , ViT obtains more information, which can be regarded as a deeper model to some extent. So ViT is underfitting on small data sets. So the practice in actual development is: training on a large data set with tens of millions of levels, get a pre-training weight, and then do migration on a small data set. Although the paper talks about vit being relatively inexpensive, expensive refers to training consumption. It means something like that, but this is all relative to large data sets. At the level of small and medium data sets, CNN is still quite easy to use.

Guess you like

Origin blog.csdn.net/qq_45848817/article/details/127111460