Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

https://arxiv.org/pdf/2103.14030.pdf

Code is available at https:// github.com/microsoft/Swin-Transformer.

Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.

1. 直点主题：本文提出了 Swin Transformer，提供一个计算机视觉的通用 backbone。

2. 挑战难点：Transformer 从 NLP 到 CV 到难点在于，与文本中的单词相比，视觉实体的规模和图像像素的高分辨率有很大的变化

3. 解决方案：提出了一种层级 Transformer，其表示是通过移动窗口计算的。

移位窗口优点：通过将自注意计算限制在不重叠的局部窗口上，同时允许跨窗口连接，从而带来更高的效率。

层次结构优点：具有在不同尺度上建模的灵活性，并具有与图像大小相关的线性计算复杂性。

4. 实践效果：在图像分类/目标识别/语义分割都有不俗表现，后两者达到了 sota 结果，展示了基于 Transformer 的模型作为视觉 backbone 的潜力。

摘要逻辑十分流畅。

Introduction

Modeling in computer vision has long been dominated by convolutional neural networks (CNNs). Beginning with AlexNet [38] and its revolutionary performance on the ImageNet image classification challenge, CNN architectures have evolved to become increasingly powerful through greater scale [29, 73], more extensive connections [33], and more sophisticated forms of convolution [67, 17, 81]. With CNNs serving as backbone networks for a variety of vision tasks, these architectural advances have led to performance improvements that have broadly lifted the entire field.

第一段，引出第一个关键词，backbone。

主要是介绍了 CNN 作为 backbone 的情况，CNN 作为各种视觉任务的 backbone 网络，这些架构上的进步导致了性能的改进，从而广泛提升了整个领域。

作者写这段的真正意图是引出 backbone 的重要性，目前各种网络都以 CNN 作为 backbone，这个 backbone 对网络的性能至关重要。预示了本文的野心在于更换 CNN，为 CV 提供新的通用 backbone。

On the other hand, the evolution of network architectures in natural language processing (NLP) has taken a different path, where the prevalent architecture today is instead the Transformer [61]. Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data. Its tremendous success in the language domain has led researchers to investigate its adaptation to computer vision, where it has recently demonstrated promising results on certain tasks, specifically image classification [19] and joint vision-language modeling [46].

第二段，引出第二个关键词，Transformer。

Transformer 已经在 NLP 领域中成为通用结构，表现出强大能力。这中性能驱使研究者将其应用于 CV 领域，亟欲将其成为 CV 的通用结构。

第三段，介绍核心研究内容。该段很长，本博客将其分解为三层。

（1）In this paper, we seek to expand the applicability of Transformer such that it can serve as a general-purpose backbone for computer vision, as it does for NLP and as CNNs do in vision.

（2）We observe that significant challenges in transferring its high performance in the language domain to the visual domain can be explained by differences between the two modalities.

One of these differences involves scale. Unlike the word tokens that serve as the basic elements of processing in language Transformers, visual elements can vary substantially in scale, a problem that receives attention in tasks such as object detection [41, 52, 53]. In existing Transformer-based models [61, 19], tokens are all of a fixed scale, a property unsuitable for these vision applications.

Another difference is the much higher resolution of pixels in images compared to words in passages of text. There exist many vision tasks such as semantic segmentation that require dense prediction at the pixel level, and this would be intractable for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic to image size.

（3）To overcome these issues, we propose a generalpurpose Transformer backbone, called Swin Transformer, which constructs hierarchical feature maps and has linear computational complexity to image size. As illustrated in Figure 1(a), Swin Transformer constructs a hierarchical representation by starting from small-sized patches (outlined in gray) and gradually merging neighboring patches in deeper Transformer layers.

With these hierarchical feature maps, the Swin Transformer model can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks (FPN) [41] or U-Net [50].

The linear computational complexity is achieved by computing self-attention locally within non-overlapping windows that partition an image (outlined in red). The number of patches in each window is fixed, and thus the complexity becomes linear to image size.

These merits make Swin Transformer suitable as a general-purpose backbone for various vision tasks, in contrast to previous Transformer based architectures [19] which produce feature maps of a single resolution and have quadratic complexity.

（1）研究目标和动机：寻求扩展 Transformer 的适用性，使其作为计算机视觉的通用 backbone，就像其在 NLP 和 CNN 在视觉中的作用一样。

（2）研究难点和挑战：两个挑战，

一，规模。与在语言 Transformer 中作为处理的基本元素的单词符号不同，视觉元素可以在规模上有很大的变化，这是在诸如对象检测等任务中受到关注的问题。在现有的基于 Transformer 的模型中，token 都是固定规模的，这样的属性不适合这些视觉应用程序。

二，分辨率。图像中的像素分辨率比文本段落中的单词高得多。有许多视觉任务，如语义分割，需要密集的预测像素级，这对于将 Transformer 应用在高分辨率图像是棘手的，因为其自注意的计算复杂度是图像大小的二次方。

（3）研究方法和优点：

方法：提出了一个通用的 Transformer backbone，Swin Transformer，它构建层次特征映射，并具有图像大小的线性计算复杂度。

层次特征映射：Swin Transformer 通过从小的 patch (图 1 中用灰色标出) 开始，并逐渐合并更深的 Transformer 层中的邻近 patch 来构建分层表示。有了这些层次特征映射，Swin Transformer 模型可以方便地利用高级技术进行密集预测，如特征金字塔网络 (FPN) 或 U-Net。

线性计算复杂度：线性计算复杂度是通过在非重叠窗口内局部计算自注意来实现的，该窗口对图像进行划分 (图 1 中用红色标出)。每个窗口中的 patch 数量是固定的，因此复杂度与图像大小成线性关系。

上述两个优点，使 Swin Transformer 适合作为各种视觉任务的通用 backbone。

A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers, as illustrated in Figure 2. The shifted windows bridge the windows of the preceding layer, providing connections among them that significantly enhance modeling power (see Table 4). This strategy is also efficient in regards to real-world latency: all query patches within a window share the same key set, which facilitates memory access in hardware. In contrast, earlier sliding window based self-attention approaches [32, 49] suffer from low latency on general hardware due to different key sets for different query pixels. Our experiments show that the proposed shifted window approach has much lower latency than the sliding window method, yet is similar in modeling power (see Tables 5 and 6).

第四段，着重讲了一个关键特性， 移动窗口 shifted window 在真实硬件应用中的延时问题上的优势。主要体现在新设计的 移动窗口 shifted window 与现有的 滑动窗口 sliding window 的对比。非常重要的加分项。

Swin Transformer 的一个关键设计元素是它在连续的自注意力层之间的窗口分区的移动，如图 2 所示。移动窗口 shifted window 桥接了上一层的窗口，提供了它们之间的连接，显著地提高了建模能力(见表4)。这个策略在现实延迟方面也是有效的：一个窗口内的所有 query patch 共享相同的 key 集，这有利于硬件中的内存访问。相比之下，早期的基于滑动窗口 sliding window 的自注意力方法 [32,49] 在一般硬件上受到低延迟的影响，这是由于对不同的 query 像素有不同的 key 集。实验表明，所提出的移动窗口 shifted window 方法比滑动窗口 sliding window 方法具有更低的延迟，但建模能力相似 (见表 5 和 6)。

The proposed Swin Transformer achieves strong performance on the recognition tasks of image classification, object detection and semantic segmentation. It outperforms the ViT / DeiT [19, 60] and ResNe(X)t models [29, 67] significantly with similar latency on the three tasks. Its 58.7 box AP and 51.1 mask AP on the COCO test-dev set surpass the previous state-of-the-art results by +2.7 box AP (Copy-paste [25] without external data) and +2.6 mask AP (DetectoRS [45]). On ADE20K semantic segmentation, it obtains 53.5 mIoU on the val set, an improvement of +3.2 mIoU over the previous state-of-the-art (SETR [78]). It also achieves a top-1 accuracy of 86.4% on ImageNet-1K image classification

第五段，实验结论：在图像分类/目标检测/语义分割和现有的 ViT/DeiT 进行对比，体现了本文方法的优势。strong performance，outperforms，surpass，an improvement of +3.2 mIoU over，学习一下这些表达优越性能的词汇。

It is our belief that a unified architecture across computer vision and natural language processing could benefit both fields, since it would facilitate joint modeling of visual and textual signals and the modeling knowledge from both domains can be more deeply shared. We hope that Swin Transformer’s strong performance on various vision problems can drive this belief deeper in the community and encourage unified modeling of vision and language signals.

第六段，研究意义：跨越计算机视觉和自然语言处理的统一体系结构将有利于这两个领域，因为它将促进视觉和文本信号的联合建模，并且来自这两个领域的建模知识可以更深入地共享。希望Swin Transformer 在各种视觉问题上的强大性能可以在社区中推动这种信念，并鼓励视觉和语言信号的统一建模。

立意还是很高的。

Method

Overall Architecture

An overview of the Swin Transformer architecture is presented in Figure 3, which illustrates the tiny version (Swin-T). It first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. In our implementation, we use a patch size of 4 × 4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension (denoted as C).

Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens. The Transformer blocks maintain the number of tokens ( H 4 × W 4 ), and together with the linear embedding are referred to as “Stage 1”.

首先，介绍了基本单元的构成。

图 3 给出了 Swin Transformer 体系结构的概述，其中演示了小型版本 (Swin-T)。

（1）embedding：首先通过 patch splitting module (同 ViT) 将输入 RGB 图像分割成不重叠的patches。每个 patch 被视为一个“token”，其特征被设置为原始像素 RGB 值的连接。本文使用 4 × 4 的 patch 大小，因此每个 patch 的特征维数为 4 × 4 × 3 = 48。

（2）linear embedding 层：在该原始值特征上应用 linear embedding 层，将其投射到任意维度 (记为 C)。

（3）Transformer block：在这些 patch tokens 上应用几个 Swin Transformer block。Transformer block 维持 token 的数量 (h/4 × w/4)，并与 linear embedding 层一起被称为 “阶段1”。

To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2× downsampling of resolution), and the output dimension is set to 2C. Swin Transformer blocks are applied afterwards for feature transformation, with the resolution kept at H/8 × W/8 . This first block of patch merging and feature transformation is denoted as “Stage 2”. The procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16 × W/16 and H/32 × W/32 , respectively. These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, e.g., VGG [51] and ResNet [29]. As a result, the proposed architecture can conveniently replace the backbone networks in existing methods for various vision tasks.

接着，分析了 hierarchical representation。

为了产生分层表示 hierarchical representation，随着网络的深入，通过 patch merging layer 来减少 tokens 的数量。第一个 patch merging layer 将每组 2 个相邻 patch 的特征拼接起来，并在 4C 维的拼接特征上应用一个 linear layer。这将 token 的数量减少了 2×2 = 4 (分辨率的2次降采样)的倍数，并且输出维度被设置为 2C。然后应用 Swin Transformer block 进行特征变换，分辨率保持在 H/8×w/8。这第一个 patch merging 和特征转换的 block 记为 阶段 2。此过程重复两次，分别为阶段 3 和阶段 4，输出分辨率分别为 H/16×w/16 和 H/32×w/32。这些阶段共同产生一个分层表示 hierarchical representation，具有与典型卷积网络 (如VGG[51]和ResNet[29]) 相同的特征图分辨率。因此，该体系结构可以方便地替代现有方法中的 backbone 网来完成各种视觉任务。

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows (described in Section 3.2), with other layers kept the same. As illustrated in Figure 3(b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

最后，构建了 Swin Transformer block

Swin Transformer 的构建方法是将 Transformer block 中的标准多头自注意力 (MSA) 模块替换为基于移位窗口的模块 (见第3.2节)，其他层保持不变。如图 3(b) 所示，Swin Transformer 模块由基于MSA 的平移窗口模块和介于 GELU 非线性之间的 2 层 MLP 组成。在每个 MSA 模块和每个 MLP 之前应用一个 LayerNorm (LN) 层，在每个模块之后应用一个 residual connection。

Shifted Window based Self-Attention

The standard Transformer architecture [61] and its adaptation for image classification [19] both conduct global selfattention, where the relationships between a token and all other tokens are computed. The global computation leads to quadratic complexity with respect to the number of tokens, making it unsuitable for many vision problems requiring an immense set of tokens for dense prediction or to represent a high-resolution image.

研究动机：

标准 Transformer 体系结构及其对图像分类的适应性都进行全局自注意，其中计算一个 token 和所有其他 token 之间的关系。全局计算导致 token 数量的二次方复杂度，这使得它不适用于许多需要大量标记集来密集预测或表示高分辨率图像的视觉问题。

Self-attention in non-overlapped windows

For efficient modeling, we propose to compute self-attention within local windows. The windows are arranged to evenly partition the image in a non-overlapping manner. Supposing each window contains M × M patches, the computational complexity of a global MSA module and a window based one on an image of h × w patches are:

where the former is quadratic to patch number hw, and the latter is linear when M is fixed (set to 7 by default). Global self-attention computation is generally unaffordable for a large hw, while the window based self-attention is scalable.

具体方法：

为了高效建模，本文提出在局部窗口内计算自注意。窗户被安排以不重叠的方式均匀地分割图像。假设每个窗口包含 M × M 个 patches ，则一个全局 MSA 模块和一个基于 h × w 个 patches 图像的窗口计算复杂度为公式（1）和（2），其中，当 M 固定时 (默认设置为7)，前者是 patch 个数 hw 的二次方，后者是线性的。全局自注意力计算对于大型 hw 来说通常是负担不起的，而基于窗口的自注意力是可以的。

Shifted window partitioning in successive blocks

The window-based self-attention module lacks connections across windows, which limits its modeling power. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, we propose a shifted window partitioning approach which alternates between two partitioning configurations in consecutive Swin Transformer blocks.

子研究动机：上述解决方案延伸出的新子问题

基于窗口的自注意力模块缺乏跨窗口的连接，这限制了它的建模能力。为了引入跨窗口连接，同时保持非重叠窗口的高效计算，本文提出了一种移位窗口分区方法，该方法在连续 Swin Transformer block 的两个分区配置之间交替进行。

As illustrated in Figure 2, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the 8 × 8 feature map is evenly partitioned into 2 × 2 windows of size 4 × 4 (M = 4). Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows by () pixels from the regularly partitioned windows. With the shifted window partitioning approach, consecutive Swin Transformer blocks are computed as

where $\hat{z}^l$ and $z^l$ denote the output features of the (S)WMSA module and the MLP module for block l, respectively; W-MSA and SW-MSA denote window based multi-head self-attention using regular and shifted window partitioning configurations, respepectively.

子具体方法：解决上述子问题的方法

如图 2 所示，第一个模块使用一个普通窗口划分策略从左上的像素，和 8×8 特征图是均匀地划分为 2×2 的 4×4 大小的窗户 (M = 4)。然后, 下一个模块采用窗口配置从前面的层，用 () 像素替换常规分区的窗口。采用移位窗口划分方法，连续 Swin Transformer block 计算为（3），其中 $\hat{z}^l$ 、 $z^l$ 分别表示l块 (S)WMSA 模块和 MLP 模块的输出特征；W-MSA 和 SW-MSA 分别使用规则和移位的窗口分区配置表示基于窗口的多头自注意力。

Efficient batch computation for shifted configuration

An issue with shifted window partitioning is that it will result in more windows, from $\small \left \lceil h/M \right \rceil\times \left \lceil w/M \right \rceil$ to $\small (\left \lceil h/M \right \rceil +1) \times(\left \lceil w/M \right \rceil)$ in the shifted configuration, and some of the windows will be smaller than M × M. A naive solution is to pad the smaller windows to a size of M × M and mask out the padded values when computing attention. When the number of windows in regular partitioning is small, e.g. 2 × 2, the increased computation with this naive solution is considerable (2 × 2 → 3 × 3, which is 2.25 times greater).

孙研究动机：上述子解决方案中引起的子子问题

改变窗口分区的一个问题就是它会导致更多的窗户，从 $\small \left \lceil h/M \right \rceil\times \left \lceil w/M \right \rceil$ 到 $\small (\left \lceil h/M \right \rceil +1) \times(\left \lceil w/M \right \rceil)$ ，并且一些 windows 将小于 M × M。一个简单的解决方案是将较小的窗口填充到 M × M 的大小，并在计算注意力时屏蔽填充值。当常规分区中的窗口数量很小时，例如 2 × 2，使用这个简单的解决方案增加的计算量是相当大的 (2 × 2 → 3 × 3，是 2.25 倍)。

Here, we propose a more efficient batch computation approach by cyclic-shifting toward the top-left direction, as illustrated in Figure 4. After this shift, a batched window may be composed of several sub-windows that are not adjacent in the feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-window. With the cyclic-shift, the number of batched windows remains the same as that of regular window partitioning, and thus is also efficient. The low latency of this approach is shown in Table 5.

孙具体方法：解决上述子子问题的方法

因此本文提出了一种更高效的批处理计算方法，即向左上角循环移动，如图 4 所示。在此移动之后，一个批处理窗口可能由几个在特征映射中不相邻的子窗口组成，因此采用掩蔽机制将自注意计算限制在每个子窗口内。使用循环移位，批处理窗口的数量与常规窗口分区的数量相同，因此也是高效的。这种方法的低延迟如表 5 所示。

至于这个 masked MSA 是怎么实现的，还是要参考一下代码。

Relative position bias

In computing self-attention, we follow [48, 1, 31, 32] by including a relative position bias $B\in \mathbb{R}^{M^2\times M^2}$ to each head in computing similarity:

here Q, K, V $\in \mathbb{R}^{M^2\times d}$ are the query, key and value matrices; d is the query/key dimension, and $M^2$ is the number of patches in a window. Since the relative position along each axis lies in the range [−M + 1, M −1], we parameterize a smaller-sized bias matrix $\hat{B}\in \mathbb{R}^{(2M-1)\times (2M-1)}$ , and values in B are taken from $\hat{B}$ .

在计算 self-attention 时，对每个 head 包含一个相对位置偏差 $B\in \mathbb{R}^{M^2\times M^2}$ ，如公式（4），其中

d 为 query/key 的维度， $M^2$ 为窗口中的 patch 数。由于每个轴上的相对位置在 [−M + 1, M−1] 范围内，作者参数化一个较小尺寸的偏置矩阵 $\hat{B}\in \mathbb{R}^{(2M-1)\times (2M-1)}$ ，B 中的值取自 $\hat{B}$ 。

We observe significant improvements over counterparts without this bias term or that use absolute position embedding, as shown in Table 4. Further adding absolute position embedding to the input as in [19] drops performance slightly, thus it is not adopted in our implementation.

本文进一步观察发现，相对位置能改进性能，但绝对位置会影响性能。如下表 4 所示。

The learnt relative position bias in pre-training can be also used to initialize a model for fine-tuning with a different window size through bi-cubic interpolation [19, 60].

在训练前学习到的相对位置偏差也可以通过双三次插值初始化模型，用于不同窗口大小的微调[19,60]。这段内容的解释，可以参考 ViT 和 DeiT 这两篇文章。

Architecture Variants

We build our base model, called Swin-B, to have of model size and computation complexity similar to ViTB/DeiT-B. We also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively. Note that the complexity of Swin-T and Swin-S are similar to those of ResNet-50 (DeiT-S) and ResNet-101, respectively. The window size is set to M = 7 by default. The query dimension of each head is d = 32, and the expansion layer of each MLP is α = 4, for all experiments. The architecture hyper-parameters of these model variants are:

• Swin-T: C = 96, layer numbers = {2, 2, 6, 2}

• Swin-S: C = 96, layer numbers ={2, 2, 18, 2}

• Swin-B: C = 128, layer numbers ={2, 2, 18, 2}

• Swin-L: C = 192, layer numbers ={2, 2, 18, 2}

where C is the channel number of the hidden layers in the first stage. The model size, theoretical computational complexity (FLOPs), and throughput of the model variants for ImageNet image classification are listed in Table 1.

本小节给出了本文模型的配置情况。一目了然，不具体展开解释了。

需要注意的是，Swin-T 和 Swin-S 的复杂度分别与 ResNet-50 (DeiT-S) 和 ResNet-101 相似。窗口大小默认设置为 M = 7。对于所有实验，每个 head 的 query 维度为 d = 32，每个 MLP 的扩展层为 α = 4。

表 1 给出了用于图像分类各个模型的 FLOPs。

MyDLNote-Transformer: Swin Transformer, 使用移位窗口的分层 Vision Transformer