[Thesis Notes] Swin-Transformer Series Reading Notes

一、Swin Transformer V1

paper:Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

github:https://github.com/microsoft/Swin-Transformer

This paper proposes a general backbone model for computer vision tasks: the Swin Transformer. Swin restricts self-attention to local windows, which reduces the amount of calculation of Attention. At the same time, it uses the sliding window mechanism to establish connections between different windows. Swin ranks in each visual task (one word: strong).

1. Network structure

The network structure of Swin-T is shown below, mainly including Patch Partion, Linear Embedding, Swin Transformer Block, Patch Merging and other modules.

(a)Patch Partion/Linear Embedding

Patch Partion and Linear Embedding are combined into Patch Embedding, and 1/4 downsampling is achieved by convolution with a stride of 4 and a kernel size of 4.

(b)Patch Merging

Except for Patch Partion, all other downsampling is implemented by Patch Merging, assuming that the input dimension is [N, C, H, W]. For adjacent 2x2 embeddings, concat them to get the feature map dimension as [N, C, H/4, W/4], the output of [N, 2C, H/4, W/4] can be obtained by connecting a linear proj, while reducing the resolution and increasing the feature dimension, please refer to the following figure. (The picture below is from Dr. Zhu Yi's [Learning Transformer from Scratch], the link is at the end of the article)

 (c) Shifted Window based Self-Attention

Assuming that the input dimension is [batch_size, hw, c], the calculation amount is as follows. When the resolution of the feature map is high, the hw is larger and the calculation amount is very large.

 In order to reduce the amount of calculation, it can be divided into non-repetitive windows, and the amount of calculation is significantly reduced (M is the window size):

 The comparison between the amount of Attention calculation and the amount of calculation after dividing the window:

Dividing windows can significantly reduce the amount of computation, but it introduces a new problem: the connection between windows is lost. To solve this problem, shifted windows were introduced. As shown in the figure below, when dividing windows, you can add connections between different windows by changing the position of the windows. But dividing the windows directly will introduce new problems: the number of windows and the included patches will change.

The solution is also relatively simple. Use cyclic shift to replace the window partition strategy, and shift the value of the feature map to a certain position as a whole (torch.roll function). In the paper, the offset is M/2, and M is the size of the window.

(d)Relative position bias

In the self-attention, the position offset B is introduced. Because the value range of the relative position is [-M+1, M-1], the learnable parameter matrix is ​​set \widehat{B}, \widehat{B} \in \mathbb{R}^{(2M-1)\times (2M-1)} and the value of B \widehat{B}is taken from it. Experiments show that using Relative position bias is better than not using Relative position bias or using absolute position embedding.

(e)Swin Transformer Block        

2 consecutive transformer blocks, the interval is shifted window (SW-MSA means shifted window multi-head self-attention, W-MSA means window multi-head self-attention).

 2. Experimental results

1、ImageNet-1k

Pre-trained on ImageNet-22k, ImageNet-1k achieves 87.3% accuracy (too strong).

 2. Target detection COCO

 3. Semantic segmentation ADE20K

 二、Swin Transformer V2

to be completed

Reference link:

1. [Learn Visual Transformer from scratch]

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123939058