[Deep Learning] Semantic Segmentation: Paper Reading (I don't understand much): (2022-1) Lawin Transformer: Large Window Attention Improves Semantic Segmentation of Multi-scale Representation

details

I don't understand
Name: Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
Unit: Beijing University of Posts and
Telecommunications Paper
Code

Summary

Multi-scale representations are crucial for semantic segmentation. Currently witnesses the vigorous development of convolutional neural networks (CNNs) for semantic segmentation that exploit multi-scale contextual information. Due to the power of the Visual Transformer (ViT) for image classification, several ViTs for semantic segmentation have been proposed recently, most of which achieve impressive results, but at the cost of computational economy.

  • A multi-scale representation is introduced into semantic segmentation ViT via a window attention mechanism , and further improves the performance and efficiency.
    To this end, large window attention is introduced, which allows local windows to query context windows of larger regions with little computational overhead .

  • Large window attention is enabled to capture contextual information at multiple scales by adjusting the ratio of contextual regions to query regions .

  • Furthermore, adopting the spatial pyramid pooling framework in collaboration with large-window attention, a novel decoder named large-window attention Spatial Pyramid Pooling (LawinASPP) is proposed for semantic segmentation ViT.

ViT Lawin Transformer

  • Encoder: Efficient Hierarchical Vision Transformer (HVT)
  • Decoder: Composed by LawinASPP.

1. Introduction

previous technology

Main work of CNN
: Utilize multi-scale representation
Method: Apply filters or pooling operations such as atrous convolution and adaptive pooling to the spatial pyramid pooling (SPP) module.

vit
Disadvantages: High computational cost, especially if the input image is large
Solving :
The method is purely based on the Hierarchical Visual Transformer (HVT)

Swin Transformer is one of the most representative hvts, using a heavy decoder to classify pixels.

SegFormer improves the encoder and decoder design, resulting in very efficient semantic segmentation ViT.
Disadvantages: incrementally improve performance only by increasing the model capacity of the encoder , which may lower the efficiency ceiling.

The main current problem: the lack of multi-scale context information , which affects its performance and efficiency.
Proposed method: a new window attention mechanism - large window attention.

current method

Large-window attention
In large-window attention, as shown in Figure 1, the uniformly segmented patch queries the context patch covering a larger area, while the patch in local window attention only queries itself. On the other hand, considering that attention becomes computationally unacceptable as the context patch grows larger,

A simple but effective strategy is devised to alleviate the dilemma of large contexts .
Specifically,
1. First pool large context patches into the spatial dimension of the corresponding query patch to maintain the original computational complexity.
2. Then we enable the multi-head mechanism under the large window attention, and strictly set the number of multi-heads equal to the square of the downsampling ratio R when pooling the context, mainly to restore the dependency between the discarded query and the context.
3. Finally, inspired by the marker-mixing MLP in MLP-mixer [37], the R-squared position mixing operation is applied separately on the R-squared subspace of the head, which enhances the spatial representation ability of multi-head attention.

Therefore, our proposed patch in large-window attention can capture contextual information at any scale with only a small computational overhead caused by positional blending operations. Coupled with the large window attention of different ratios R, the SPP module evolves into a large window attention space pyramid pooling (LawinASPP) , which can be used like ASPP (Atrous Spatial Pyramid Pool) [9] and PPM (Pyramid Pool Module) [50 ] to exploit multi-scale representations for semantic segmentation.

2. Related Work

vit's exploration

ViT is the first end-to-end visual transformer for image classification, which projects an input image into a sequence of labels and attaches it to a class label.

The efficiency of PVT and Swin Transformer sparked interest in **Hierarchical Vision Transformer (HVT)**.

1. SETR deploys ViT as an encoder and upsamples the output patch embeddings to classify pixels.
'
2. Unified perceptual parsing for scene understanding
Swin Transformer extends itself to a semantic segmentation ViT by concatenating an UpperNet.

3. Segmenter relies on ViT/DeiT as the backbone and proposes a masked transformer decoder .

4. Segformer shows a simple, effective, yet powerful encoder and decoder design for semantic segmentation.

5. MaskFormer redefines semantic segmentation as a mask classification problem with fewer FLOPs and parameters than Swin-UperNet.

In this paper, we take a new step towards more efficient ViT design for semantic segmentation by introducing multi-scale representations in HVT .

MLP-Mixer

MLP-Mixer [37] is a new type of neural network that is much simpler than ViT. Similar to ViT, MLP-Mixer
first adopts linear projection to obtain a ViT-like token sequence.

MLP-mixer is fully based on multi-layer perceptron (MLP), as it replaces the self-attention of the transformer layer with a token-mixing MLP. Token-Mixing MLP works along the channel dimension, and Token-Mixing MLP (position) to learn spatial representation. In our proposed large-window attention, Token-Mixing MLP is applied to pooled context patches, which we call positional mixing, to improve the spatial representation of multi-head attention.

3. Method

In this part,
we first briefly introduce multi-head attention and token-mixing MLP
, then elaborate on large-window attention, and describe the architecture of LawinASPP
. Finally, the overall structure of Lawin transformer is given.
insert image description here
Overall architecture
1. The image is fed into the encoder part, which is a MiT
2. Then, the features of the last three stages are aggregated and fed into the decoder part, which is a LawinASPP
3. Finally, the first stage of the encoder is utilized Features enhance the low-level information of the obtained features. MLP stands for Multilayer Perceptron. CAT "indicates connection characteristics." Lawin "indicates large window attention." "R" indicates the ratio of the size of the context patch to the size of the query patch.

3.1. Background

Token-mixing MLP
Token-mixing MLP is the core of MLP-mixer, which aggregates spatial information by allowing spatial locations to communicate with each other .
Assuming the input 2D feature map x2d ∈ RC × H × W, the operation of token-mix MLP can be expressed as:
insert image description here
where W1 ∈ RHW × Dmlp and W2 ∈ RDmlp × HW are linear transformations for learning, and σ provides nonlinear activation function.

3.2. Large Window Attention

With some regularization on the head subspace, multi-head attention can learn the desired diverse representations [12, 16, 18]. Considering that the spatial information becomes abstract after downsampling, we hope to enhance the spatial representation ability of multi-head attention.
insert image description here
Figure 2 A large window of attention. The red patch Q is the query patch, and the purple patch C is the context patch. The context is reshaped and fed into token-mixing MLPs. The output context CP is named position blending context. View best in color.

In MLP-mixer, token-mixing MLP is complementary to channel-mixing MLP for collecting spatial knowledge, we define a set of head-specific position mixing MLP = {MLP1, MLP2, ..., MLPh}. As shown in Figure 2, each head of the context patch in the pool is pushed into its corresponding token (position) hybrid MLP, and the spatial positions in the same head communicate with each other in the same behavior. We call the obtained context the positional mixing context patch, and denote it as P of C. Its calculation formula is:
insert image description here
where ˆCh represents the h-th head of ˆC, and MLPh∈RP 2 ×P 2 is the h-th transformation, which strengthens the h-th head head space representation, where n is the average pooling operation. With positional mixing context CP, we can reformulate (3) and (4) as:
insert image description here

3.3. LawinASPP

To capture multi-scale representations, we employ a spatial pyramid pooling (SPP) architecture in collaboration with large-window attention, resulting in a new SPP module, LawinASPP.
LawinASPP consists of 5 parallel branches, which include a shortcut connection, 3 large window attentions (R = (2, 4, 8)) and an image pooling branch.
As shown in Figure 3, the largest window attention branch provides three receptive field hierarchies for local windows. ** Referring to the previous literature on the window attention mechanism [30], we set the patch size of the local window to 8, thus providing a receptive field of (16, 32, 64).
insert image description here
The image pooling branch uses a global pooling layer to obtain global context information and push it into a linear transformation followed by a bilinear upsampling operation to match the feature dimension.
When outputting all context information, the short path copies the input feature and pastes it. All generated features are first concatenated, and a linear transformation is learned to perform dimensionality reduction to generate the final segmentation map.

3.4. Lawin Transformer

After studying the advanced hvt, MiT and Swin-Transformer are selected as the encoder of Lawin Transformer .

MiT is designed for SegFormer [43] as an encoder, which is a simple, effective, yet powerful semantic segmentation ViT.

Swin-Transformer [30] is a very successful HVT based on local window attention.

Before applying LawinASPP,
the multi-level features with output stride = (8, 16, 32) are concatenated, resized to the size of features with output stride = 8, and a linear layer is used to transform the concatenation. Input the transformed features with output stride = 8 into lawwinaspp to get features with multi-scale contextual information.

In the state-of-the-art semantic segmentation ViT, the features used for the final prediction of the logarithm of the segmentation are always 4-level features from the encoder.
Therefore, we use the first-level features with output stride = 4 to compensate the low-level information.
The output of LawinASPP is up-sampled to a quarter of the size of the input image and then fused with the first-layer features using a linear layer.
Finally, the segmentation log is predicted on the low-level augmented features. More details are shown in Figure 3

4. Express

Please add a picture description

5. Conclusion

Efficient semantic segmentation transformers are called Lawin transformers.
The decoder part of Lawin Transformer is able to capture rich contextual information at multiple scales, which is based on our proposed large window attention. Compared with existing efficient semantic segmentation Transformers, Lawin Transformer can achieve higher performance with less computational overhead. Finally, we conduct experiments on the cityscape, ADE20K, and COCO-Stuff datasets, yielding state-of-the-art results on these benchmarks. We hope Lawin Transformer can inspire creativity in semantic segmentation ViT in the future.

Guess you like

Origin blog.csdn.net/zhe470719/article/details/124846038