Intensive reading of Swin Transformer papers - station B up: learning AI explanation notes from Li Mu

Intensive reading of Swin Transformer paper

https://www.bilibili.com/video/BV13L4y1475U

Swin almost covers the downstream tasks of CV (downstream tasks refer to the tasks solved by the head behind the backbone network, such as: classification, detection, semantic segmentation), and has refreshed the list of multiple data sets.

Title Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Meaning: Hierarchical Vision Transformer using a moving window. Swin hopes that Transformer can be divided into several blocks like a convolutional network, and can perform hierarchical feature extraction, so that the extracted features have a hierarchical concept.

Abstract

This paper proposes a new vision transformer, called Swin Transformer, which can serve as a general backbone for computer vision. The challenges of applying Transformers from language to vision arise from the differences between the two domains, such as large differences in the scale of visual entities (the same object may be of different sizes in different pictures) and the fact that images have different sizes compared to words in text. Pixel high resolution. To address these discrepancies, we propose a hierarchical Transformer whose representation is computed with shifted windows. shifted windows improves efficiency (because self-attention is calculated within the window, the sequence length is greatly reduced); self-attention is limited to non-overlapping local windows, while allowing cross-window connections (by shifting, between adjacent windows There is interaction. Therefore, cross-windows connection can be made between the upper and lower layers, so as to achieve a global modeling capability in disguise). The benefits of this hierarchical structure: ①, flexibly provide information on various scales (modeling flexibility); ②, linear computational complexity relative to image size (self-attention is calculated in a small window, computationally complex degree grows linearly with image size, not quadratically). These properties of Swin Transformer make it compatible with a wide range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance substantially exceeds +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, showing the potential of Transformer-based models as vision backbones . Hierarchical design and shifted window approach have also been shown to be beneficial for all MLP architectures.

Introduction

The meaning of the first two paragraphs: In the field of vision, the convolutional network was dominant before, and Transformer works well in the field of NLP, and Transformer can be applied to the field of vision. ViT has implemented this; and Swin's starting point is to prove that Transformer can be used as a general backbone network in the vision domain.


Please add a picture description

Although ViT can achieve global modeling through global self-attention operations, its grasp of multi-scale features is weaker. For specific vision tasks, such as detection and segmentation, multi-scale features are especially important. The feature of ViT processing is single low resolution - the features that have been processed after the 16 times downsampling rate have been processed. Therefore ViT may not be suitable for processing intensive prediction tasks. On the other hand, ViT's self-attention is always performed on the whole image. It is a global model, and the complexity grows quadratically with the image size.

Swin draws on a lot of experience and prior knowledge of convolutional networks. For example, in order to reduce the sequence length and reduce the computational complexity. Swin proposed to calculate self-attention in a small window, unlike ViT on the whole image. As long as the window size is fixed, the computational complexity is fixed; the computational complexity of the entire image grows linearly with the size of the image. This can also be regarded as using the inductive bias of locality in the convolutional network: different parts of the same object or different objects with similar semantics will most likely appear in connected places.

How does Swin generate multi-sized features? In the convolutional network, multi-size features are mainly generated by pooling. The pooling operation can increase the receptive field seen by each convolution kernel, so that the pooled features can capture different sizes of objects. Correspondingly, Swin proposed a pooling-like operation——patch merging: combine adjacent small patches into large patches. The big patch can see what the small patch saw before, and the receptive field increases, as shown in the figure above (left).

(third paragraph)


Please add a picture description

(The fourth paragraph. The fourth paragraph introduces the shifting operation. Here it is explained in conjunction with Fig 2.)


The fifth paragraph presents the results achieved by Swin.

The author's outlook in the sixth paragraph: The unified framework of CV and NLP can promote the common development of the two fields.

(Swin makes good use of the prior knowledge in vision and kills all directions in vision tasks. But in the grand unification, ViT is better. Because it does not change anything and does not add prior information, Transformer is in two fields They are all used very well. In this way, the model can share parameters, and even the output of multiple modalities can be spliced ​​together as a long input, which is directly passed to the Transformer, regardless of the differences between the modalities.)

Conclusion

Swin is a new visual Transformer that produces hierarchical feature representations and has linear computational complexity to the input image size. Swin achieves state-of-the-art performance on COCO object detection and ADE20K semantic segmentation, outperforming previous best methods by a wide margin. We hope that the strong performance of Swin on a variety of vision problems will encourage the unified modeling of visual and linguistic signals. (first paragraph)

The author proposes that the most critical contribution of Swin is shifted window based self-attention. It is more important for downstream tasks of vision, especially dense prediction tasks. The author proposes that the next work can apply shifted window based self-attention to NLP. (second paragraph)

Related Work

The author first briefly talks about the convolutional neural network, then talks about how self-attention or Transformer can help the convolutional neural network, and finally talks about pure Transformer as the backbone network of vision.
(Swin's related work is almost the same as ViT's related work, so I won't repeat it here)

Method

The author divides this chapter into two parts: 3.1 Describes the overall process of Swin, mainly going through the forward process of Swin and describing how to implement the patch merging operation. 3.2 Discuss how Swin turns the shifted window-based self-attention into a Transformer block for calculation.

Overall Architecture



Please add a picture description

(21:06—29:00 According to the model overview Fig. 3, the forward calculation process of Swin is described)

Suppose the input image size is the standard ImageNet size 224 ∗ 224 ∗ 3 224*224*32242243 . Because the patch size of Swin is4 ∗ 4 4*444 , so the processed image block is56 ∗ 56 ∗ 48 ( 56 = 224 4 ) ( 48 = 4 ∗ 4 ∗ 3 represents the dimension of each patch vector) 56*56*48 (56=\frac{224}{ 4}) (48=4*4*3\ represents the dimension of each \ patch\ vector)56564856=4224)(48=443 represents the dimension of each pa t ch vector )    .

Through Linear Embedding, the vector dimension is changed to a preset size (that is, the size that Transformer can accept). In the Swin paper, this hyperparameter is set to CCC (for Swin-T shown in Fig 3,C = 96 C=96C=96 ). Then the size of the result obtained through Linear Embedding is3136 ∗ 96 ( 3136 = 56 ∗ 56 ) 3136*96 (3136=56*56)3136963136=5656 ) ; that is, the length of the sequence is3136 31363136 , the dimension of each token vector is96 9696

The two-step operation of Patch Partition and Linear Embedding in Swin is equivalent to the operation of Patch Projection in ViT. In the Swin code, the author does it with one-step convolution.

At this time, the length of the sequence entering Transformer is 3136 31363136 , far exceeding the sequence length of ViT entering Transformer (its sequence length is196 196196 ); and Transformer cannot accept such a long sequence. Swin introduced self-attention based shifted windows. By default, each window has 49 patches, that is, the sequence length is only 49. Swin Transformer Block is calculated by self-attention based shifted windows.


Next we understand how Patch Merging is implemented.

Please add a picture description

It should be clear that the input of Patch Merging is H ∗ W ∗ CH*W*CHWC , the output after four steps in the middle Ⅰ, Ⅱ, Ⅲ, Ⅳ isH 2 ∗ W 2 ∗ 2 C \frac{H}{2}*\frac{W}{2}*2C2H2W2C . _ That is, after Patch Merging, the effect of "halving the width and height and doubling the channels" is realized.

  • The first step divides the given image into 4 windows, and each window has 4 patches.
  • In the second step, the patches corresponding to each window are put together to achieve the effect of "halving the width and height".
  • Step Ⅲ Stack them together to get the result of "half the width and height, and quadruple the channels".
  • Step IV uses 1 ∗ 1 1*11The convolution of 1 reduces the dimension of the channel, realizing "halving the width and height, and doubling the channel".

Look back at Fig 3 again.

Every time after Patch Merging of each Stage, the width and height will be halved, and the channels will be doubled (these changes have been marked in Fig 3).

So far, the forward propagation (calculation) of the Swin backbone network is completed.

Swin has four stages, and the last three stages have Patch Merging operations similar to pooling, and its self-attention is calculated within a small window.



Shifted Window based Self-Attention

Why introduce Shifted Window based Self-Attention? Global self-attention leads to a quadratic complexity; when doing downstream tasks of vision, especially dense prediction tasks, or when the image size is large, the calculation of global self-attention is very, very expensive! Therefore, the author proposes Shifted Window based Self-Attention. (first paragraph)


How did the author realize the window division? ( Start 20:45 - End 30:54 ) (The following figure uses the input of Stage 1 as an example to explain the division of windows)

Please add a picture description

The input size is 56 ∗ 56 ∗ 96 56*56*96565696 . The original image is evenly divided into windows with no overlap. These windows are indicated in orange in the figure above. These windows are not the smallest computing unit, the smallest computing unit is patch. That is to say, there are m ∗ mm*min each windowmm patches. In Swin's paper,m = 7 m=7m=7 . That is, there are 49 in each window( 49 = 7 ∗ 7 ) 49 (49=7*7)4949=77 ) A patch. And because the calculation of self-attention is carried out in the window, the sequence length is always49 4949 . Then the above picture can be divided into64 ( 56 7 ∗ 56 7 ) 64 (\frac{56}{7}*\frac{56}{7})64756756) windows, respectively in this64 6464 windows are counted as self-attention.


Explain the calculation method of the complexity of Shifted Window based Self-Attention and the comparison with the global self-attention complexity. ( start 30:54 - end 34:41 )

(This part of the notes is omitted. I will add it later when needed)


How does the author implement the shift window? ( start 35:07 - end 36:95 )

Please add a picture description

The left part of the figure above has introduced the moving process to achieve the purpose of inter-window communication. The arrangement of the Swin Transformer Block structure is also particular (consistent with the moving process shown in the left part): first do a window-based multi-head self-attention each time, and then do a moving window-based multi-head self-attention ( as shown in the right part of the figure above).
z ^ l = W _ MSA ( LN ( zl − 1 ) ) + zl − 1 \hat{\pmb{z}}^l = W\_MSA(LN(\pmb{z}^{l-1})) + \pmb{z}^{l-1}z^l=W_MSA(LN(zl1))+zl1
z l = M L P ( L N ( z ^ l ) ) + z ^ l {\pmb{z}}^l = MLP(LN(\hat{\pmb{z}}^l)) + \hat{\pmb{z}}^l zl=MLP(LN(z^l))+z^l
z ^ l + 1 = S W _ M S A ( L N ( z l ) ) + z l \hat{\pmb{z}}^{l+1} = SW\_MSA(LN(\pmb{z}^{l})) + \pmb{z}^{l} z^l+1=SW_MSA(LN(zl))+zl
z l + 1 = M L P ( L N ( z ^ l + 1 ) ) + z ^ l + 1 {\pmb{z}}^{l+1} = MLP(LN(\hat{\pmb{z}}^{l+1})) + \hat{\pmb{z}}^{l+1} zl+1=MLP(LN(z^l+1))+z^l+1


Explain how Swin uses masks to improve the calculation efficiency of moving windows. ( start 37:04 - end 51:00 )

(This part of the notes is omitted. I will add it later when needed)


Architecture Variants

( start 51:00 - end 51:43 )

In this section the author proposes four variants of Swin: Swin-T, Swin-S, Swin-B, Swin-L. The purpose is to make a fair comparison with Resnet. The difference between the four variants is reflected in two hyperparameters: ①, vector dimension size CCC , ②, the number of transformer blocks owned by each stage. (This is similar to the structure of Resnet. Resnet is also divided into four stages, and each stage has a different number of residual blocks)

Guess you like

Origin blog.csdn.net/Snakehj/article/details/130874322
Recommended