Transformer variant—Swin Transformer

Paper name: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Original paper address: https://arxiv.org/abs/2103.14030

Official open source code address: https://github.com/microsoft/Swin-Transformer

This article is very powerful. It is the ICCV 2021 best paper. After reading it, I will become more powerful after rounding it up.

This article makes a lot of reference to the blog of the little mung bean of the sun flower_CSDN blog-deep learning, Tensorflow, blogger in the field of software installation   , for self-study use only, the blogger has a video, the thief understands~

1. Overall framework of the network

First compare the Swin Transformer with the previous Vision Transformer, as shown below

Difference 1: Swin Transformer uses a hierarchical construction method (Hierarchical feature maps) similar to the convolutional neural network. As the feature layer continues to deepen, the height and width of the feature map become smaller and smaller. For example, the size of the feature map has The image is downsampled by 4 times, downsampled by 8 times and downsampled by 16 times. Such a backbone is helpful for building target detection, instance segmentation and other tasks on this basis. In the previous Vision Transformer, the sampling rate was directly down-sampled by 16 times from the beginning, and the following feature maps also maintained the same down-sampling rate.
Difference 2: The concept of Windows Multi-Head Self-Attention (W-MSA) is used in the feature map of Swin Transformer. For example, in the 4 times downsampling and 8 times downsampling in the figure below, the feature map is divided into multiple disjoint regions, and Multi-Head Self-Attention is only performed within each window. Compared with directly performing Multi-Head Self-Attention on the entire feature map in Vision Transformer, the amount of calculation can be greatly reduced, especially when the shallow feature map is large. Although this reduces the amount of calculation, it also isolates the information transmission between different windows. Therefore, in the paper, the author proposes the concept of Shifted Windows Multi-Head Self-Attention (SW-MSA). transfer in adjacent windows.
 

 Architecture diagram of Swin Transformer (Swin-T) network:

First, a three-channel image is input into the Patch Partition module for block, that is, every 4x4 adjacent pixels are a Patch, and then flattened in the channel direction. Assuming that the input is an RGB three-channel image, then each patch has 4x4=16 pixels, and each pixel has three values ​​of R, G, and B, so it is 16x3=48 after flattening, so the image shape after passing Patch Partition From [H, W, 3] to [H/4, W/4, 48].

Then linearly transform the channel data of each pixel through the Linear Embeding layer, from 48 to C, that is, the image shape changes from [H/4, W/4, 48] to [H/4, W/4 , C]. In fact, Patch Partition and Linear Embeding in the source code are implemented directly through a convolutional layer, which is exactly the same as the Embedding layer structure mentioned in the previous Vision Transformer.

Then, feature maps of different sizes are constructed through four stages. Except for a Linear Embeding layer in Stage1, the remaining three stages are all down-sampled through a Patch Merging layer. Then the Swin Transformer Block is repeatedly stacked. The Block here has two structures, as shown in the above figure (b). The difference between the two structures is that one uses the W-MSA structure and the other uses the SW-MSA structure. These two structures are used in pairs, so the number of stacked Swin Transformer Blocks is even (because they are used in pairs).

Finally, for the classification network, a Layer Norm layer, a global pooling layer, and a fully connected layer will be connected to obtain the final output.

2. Patch Merging

In each Stage, a Patch Merging layer must first be used for downsampling (except Stage1). After passing the Patch Merging layer, the height and width of the feature map will become 1/2 of the original, and the channel will become 2 times. Assuming the input Patch Merging is a 4x4 single-channel feature map. Patch Merging divides each 2x2 adjacent pixel into a patch, and then puts together the same position (same color) pixels in each patch to get 4 A feature map. Then concat the four feature maps in the depth direction, and then pass a LayerNorm layer. Finally, a fully connected layer is used to make a linear change in the depth direction of the feature map, and the depth of the feature map is changed from C to C/2. It can be seen from this simple example that after passing the Patch Merging layer, the height and width of the feature map will be halved, and the depth will be doubled.

3. Windows Multi-head Self-Attention(W-MSA)

Purpose: to reduce the amount of calculation

Disadvantages: Information interaction cannot be performed between each window

The previous Multi-head Self-Attention is to find q, k, and v for each pixel, and then match k one by one, but this swin transformer divides the feature map into several windows, and then performs self on each window individually -Attention, the amount of calculation is much smaller, and at the same time, there is no connection between the window and the window, which leads to a smaller receptive field, and the final result is inaccurate.

4. Shifted Windows Multi-Head Self-Attention(SW-MSA)

Purpose: To solve the problem of inability to transfer information between windows

The window on the first layer is moved to the right and down by two pixels respectively, that is, to move half the size of the original window obliquely, and it becomes the following picture

After re-dividing, he merged the ones that were not together just now. After the division, another problem appeared, that is, 4 windows were calculated on the upper layer, and now it has become 9 windows. When calculating, It is necessary to change the surrounding small windows into 4x4, and the amount of calculation will increase again. Oh, it’s just a matter of going back and forth, and then solve this problem again.

How to solve it: use this method: Efficient batch computation for shifted configuration, move the overall image of the three windows in the first row to the last row, then move the three windows in the first column to the right as a whole, and then move the entire feature map Divided into four large windows, then if the MSA calculation is performed at this time, the calculation amount will be the same as before, then Hou Xin’s problem will appear again at this time, that is, the four large windows he later divided into, maybe It is a large window composed of two small windows, and the two areas that may not be adjacent to each other before are integrated together.

So the solution is: masked MSA is MSA with a mask, so that the information of different areas can be calculated separately by setting the mask, and the calculations are performed separately for several areas in the large window, and each pixel in the window must be Calculate q, k, v, and match q with k of all pixels. If it is assumed that \alpha(0, 0) represents the result of matching q0 with k0 corresponding to pixel 0, then α (0.0) to α (0, 15). For the masked MSA here, pixel 0 belongs to area 1 of the large window, and we only want it to match the pixels in area 1. Then we can subtract 100 from the matching results of pixel 0 and all pixels in other areas. The value of α is generally a few tenths of a number. After subtracting some of the numbers by 100, it becomes a large negative number. After passing through SoftMax The corresponding weights can be ignored. Finally, after the calculation, the data must be restored to its original position.

 The picture is from the blogger referenced at the beginning

5. Relative Position Bias

 

To put it simply, the relative index of each position is subtracted from the absolute position index of each area by the absolute index of other positions, as shown in the figure, which is really clear. Each relative position index matrix is ​​flattened row by row, and stitched together to get the following 4x4 matrix.

Then get the corresponding parameters according to the relative position index. For example, the yellow pixel is to the right of the blue pixel, so the relative position index relative to the blue pixel is (0,−1). The green pixel is to the right of the red pixel, so the relative position index relative to the red pixel is (0,−1). It can be found that the relative position index of the two is (0,−1), so the relative position bias parameters they use are the same.

For the sake of convenience, the two-dimensional index is converted into a one-dimensional index: but the problem is that if it is directly converted, (0,−1) and (−1,0) will be added directly to -1, so it can’t be done directly, and I thought again a way.
(1) First add M-1 to the original relative position index (M is the size of the window, in this example M=2), that is, add 1 to each position, and there will be no negative numbers in the index after adding up.

(2) Then multiply all row labels by 2M-1.

(3) Finally, add the row and column labels.

After finishing, solve the above problems.

Then the relative position paranoid parameter B in the formula is obtained according to the above relative position index table by relative position bias tablelooking up the table , and the length of the table is equal to ( 2M − 1) × ( 2M − 1).

6. Summary

The end of the universe is mathematics, I love mathematics

Guess you like

Origin blog.csdn.net/Zosse/article/details/126431289