[Deep Learning] Semantic Segmentation - Paper Reading: (CVPR 2021) SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspe

0. Details

Name: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Unit: Published in March 2021, it is a semantic segmentation model based on a new ViT architecture jointly proposed by Fudan and Tencent Youtu.
Paper: Paper
Code: Code
Reference Notes:
1. The summary point is clear
2. Concise version

The FCN encoder uses CNN to extract features, that is, to extract features by increasing the depth of the feature map and sacrificing resolution.
The transformer used by SETR does not increase the depth of the feature map and does not sacrifice resolution.

This paper introduces the first representative model of ViT-based semantic segmentation - SEgementation TRanformer (SETR), and proposes to replace the CNN encoder with a pure Transformer-structured encoder to change the existing semantic segmentation model architecture.

1. Summary

Previous practice of semantic segmentation:
designed based on FCN with Encoder-Decoder structure. The Encoder gradually reduces the spatial resolution, and at the same time uses the gradually larger receptive field to learn more abstract semantic features

Recent practice:
In view of the importance of context modeling for semantic segmentation, some recent research focuses on using dilated convolutions or inserting attention modules to increase the receptive field . However, these studies are still based on the Encoder-Decoder FCNs architecture

The purpose of this paper is
to provide an alternative by considering semantic segmentation as a sequence-to-sequence prediction task.
Specifically, an image is encoded as a set of patches using a pure transformer structure (i.e., without convolution and downsampling). **Through the global context modeled by each layer in the transformer, the Encoder can be connected to a simple decoder to form a powerful semantic segmentation model. **This model is called SETR.

2. Introduction

2.1 The original model: FCN

A standard FCN semantic segmentation model has an Encoder-Decoder structure:
encoder, which is stacked by a large number of convolutional layers. The function is to extract richer semantic features , and generally continuously reduce the spatial resolution (size) of the feature map to achieve a larger receptive field .
The decoder is used to upsample the high-level features extracted by the encoder to the original input resolution for pixel-level classification.

	1.感受野的大小决定了特征是否能足够捕获更大范围的周边信息甚至是全局信息,但对于语义分割,损失分辨率意味着空间损失大,分割效果可能会变差
	2.上下文(context)信息是提升语义分割性能最关键的因素,而感受野(respect-field)则大致决定了网络能够利用到多少的信息。
由于网络中的有效感受野是有限的,这将会严重制约模型的表示能力。

Advantages: variability such as translation makes the network have a certain generalization ability, and locality reduces the complexity of the model through parameter sharing.
Disadvantage: CNNs are difficult to learn long-distance dependencies

solve:

  • Direct modification of convolution operations: large convolution kernels, dilated convolutions, image/feature pyramids, etc.;
  • The attention module is introduced to model global context information for each pixel in the feature map.
    The structure of the above two methods still belongs to the FCN of Encoder-Decoder.

A property of Transformer is its ability to preserve the spatial resolution of the input and output while effectively capturing global contextual information. Therefore, the author here uses a ViT-like structure for feature extraction and combines Decoder to restore resolution.

2.2 Using transformers

Use an Encoder that only contains transformers to replace the original stacked convolution for feature extraction. This method is called SEgmentation TRansformer (SETR).
SETR's Encoder regards a picture as a sequence containing a set of image patches by learning patch embedding, and uses global self-attention to learn this sequence. Specifically:
first,
the image is decomposed
into a grid composed of small blocks of fixed size to form a series of patches;
then, each patch is straightened and then learned using a linear embedding layer to obtain a feature The sequence of embedding vectors is used as the input of transformers;
then, after the transformers Encoder , the highly abstract feature maps after learning are obtained ;
finally, a simple decoder is used to obtain the segmentation map of the original resolution size .
In the whole process of SETR, the key point is that there is no downsampling process , which is different from the traditional convolution-based backbone for feature extraction.

2.3 Main Contributions

(1) Redefining the semantic segmentation task as a sequence-to-sequence problem, which is another option in addition to the FCN model based on the Encoder-decoder structure; (2) Using pure transformers as the Encoder,
for Serialized pictures are represented by features;
(3) Three kinds of decoders are designed to conduct in-depth research on self-attention;

3. Related Work - Semantic Segmentation Model Development

3.1 Previous FCN-based improvements

FCN removes the fully connected layer of the classification model and adds a decoder , which opens the era of semantic segmentation of deep learning.

However, the prediction results of FCN are relatively rough , so the post-processing method of CRF/MRF has been developed to fine-tune the results .

In order to make up for the inherent contradiction between semantics and space , the deep and shallow layers are fused in the Encoder and Decoder , which also leads to a large number of variants with different fusion methods.

Recent research: all focus on solving the problem of limited receptive field and context information modeling .

In order to increase the receptive field:
DeepLab and Dilation introduce hole convolution; while PSPNet and DeepLabv2 are for better context modeling. PSPNet proposes the PPM module to obtain the context information of different regions, and DeepLabv2 proposes a pyramid hollow convolution module with different expansion rates - ASPP.

In addition, GCN obtains a large receptive field by decomposing a large convolution kernel ; PSANet develops a point-by-point spatial attention module to dynamically capture long-distance context; DANet embeds both spatial attention and channel attention; CCNet focuses on To reduce the amount of computation while introducing global attention; DGMN constructs a dynamic graph message passing network for scene modeling, which greatly reduces the computational complexity.

It should be noted that these methods are all improvements based on FCN: the feature extraction part (that is, Encoder) is based on classification models , such as convolutional networks such as VGG and ResNet, and the part obtained after removing the full connection .

The SETR proposed in this paper is completely different from these methods, and a new solution is given.

3.2 Transformer-based improvement

Transformers and self-attention models have revolutionized machine translation and natural language processing. In recent years, some explorations have been made on the application of transformer structure in image recognition.
The Non-local network attaches the transformer's attention to the convolutional backbone.
AANet mixes convolution and self-ateention for backbone training.
LRNet and stand-alone networks explore local self-attention to avoid the heavy computation brought by global self-attention.
SAN explores two kinds of self-attention modules. Axial-Attention decomposes the global spatial attention into two separate axial attentions, which greatly reduces the amount of computation.
cnn-transformer hybrid model.
DETR and the following variants utilize transformers for object detection, which are attached inside the detection head.
STTR and LSTR use transformers for disparity estimation and lane shape prediction, respectively.
ViT is the first work to demonstrate that pure transformer image classification models can achieve state-of-the-art. It provides direct inspiration for developing pure transformer-based encoder designs in semantic segmentation models.

Axial-Attention also uses attention for image segmentation. But their model still follows the traditional FCN design, which adopts the encoding method with gradually decreasing spatial resolution of feature maps. Whereas SETR always maintains the same spatial resolution.

4. Method

4.1 Semantic Segmentation Based on FCN

The Encoder of FCNs semantic segmentation is stacked by a series of convolutional layers. The
first layer accepts an input image of size H W
3, and then the input size of several convolutional layers is h w d, where h and w are feature maps. Height and width, d is the number of channels.

The position of each tensor in the deep layer is calculated layer by layer from all the previous shallow layers, that is, the receptive field.
Due to the locality of the convolution operation, the receptive field increases linearly with the superposition of the layers, and the speed of the increase depends on the size of the convolution kernel.
Finally, in the FCN architecture, only deep layers can have a large receptive field and can model long-distance dependencies. **
However, some studies have shown that with the deepening of the level, the benefits such as the increase of the receptive field are gradually decreasing.
Therefore, for common FCN architectures, its limited receptive field becomes an inherent limitation in terms of context modeling.

Recently, some SOTA methods have also shown that combining FCNs with attention mechanisms is a more effective strategy for extracting long-distance contextual information .
However, due to the power complexity of the attention mechanism, the higher the resolution of the feature maps, the greater the amount of computation required, so these methods all place the attention mechanism behind the deep layer with a smaller resolution to take advantage of its Small-sized inputs to reduce computational complexity .
This also means that the learning of attention dependencies in shallow layers is missing . The extraction of SETR is to solve this problem.

4.2 SETR

The entire SETR structure: input → conversion → output
Figure 1 Schematic illustration of the proposed SEgmentation TRansformer (SETR).  (a) Input preprocessing and feature extraction; (b) Progressive upsampling; (c) Multi-level feature aggregation.
1. Image to sequence Image serialization method
First, the original input image needs to be processed into a format that Transformer can support. Here, the author refers to ViT's practice, that is, slices the input image , treat each 2D image slice (patch) as a "1D" sequence and input it to the network as a whole. Generally speaking, the input received by the Transformer is a 1-dimensional feature embedding sequence Z∈R (L×C), where L is the length of the sequence and C is the channel size of the hidden layer. Therefore, for image sequences, we also need to transform the input x ∈ RH×W×3 into Z.
The article uses slices, and each slice is 16X16 in size, so for a 256X256 picture, it can be sliced ​​into 256 pieces. To encode the spatial information of each slice, a specific embedding is learned for each local location and added to a linear projection function to form the final input sequence. In this way, although the Transofomer is disordered, it can still retain the corresponding spatial location information because the original location information is associated.

2 Transformer
can extract features by inputting the sequence into the Transformer architecture, which mainly includes two parts Multi-head Self-Attention (MSA) and Multilayer Perceptron (MLP) blocks.

3. Decoder
Regarding the decoder, the article gives three structures. The input and output dimensions of the features extracted by TF are the same. In order to perform the final pixel-level segmentation, it needs to be reshaped into the original spatial resolution .

  • Naive upsampling (Naive) reduces
    the feature dimension output by Transformer to the number of classification categories and restores the original resolution through bilinear upsampling, that is, 2-layer: 1 × 1 conv + sync batch norm (w/ ReLU) + 1 × 1 conv

  • Progressive UPsampling (PUP)
    requires 4 operations in order to restore from H/16 × W/16 × 1024 to H × W × 19 (19 is the number of categories of cityscape), alternately using convolutional layers and double upsampling operations to restore to native resolution.

  • Multi-Level feature Aggregation (MLA)
    first divides the Transformer output {Z1, Z2, Z3...ZLe} into M equal parts , and each part takes a feature vector. As shown in the figure below, the output of the 24 transformers is divided into 4 parts, and the last one is taken for each part, namely {Z6, Z12, Z18, Z24}, and the subsequent Decoder only processes these extracted vectors.
    Specifically, first restore ZL from 2D (H × W)/256 × C to 3D H/16 × W/16 × C, and then after 3-layer convolution 1 × 1, 3 × 3, and 3 × 3 After bilinear upsampling 4× top-down fusion. In order to enhance the interconnection between Zl, the last Zl in the figure below theoretically has all the information of the above three features, fuses, and then restores to the original resolution by bilinear interpolation 4× after 3 × 3 convolution.

insert image description here
insert image description here

4.3 Decoder Design

insert image description here

5 summary

Overall, several important contributions of SETR are as follows:

  • It mainly regards the segmentation task as a sequence-sequence prediction task, thus proposing a new model design perspective for semantic segmentation.
    Different from the existing FCN-based models that use dilated convolutions and attention modules to increase the receptive field, SETR uses the transformer as the encoder, and performs global context modeling in each layer of the encoder, which perfectly removes the need for convolution in the FCN. accumulated dependence.

  • Combining the designed three decoders of different complexity, a powerful segmentation model is formed without the fancy operations in some recent models.

  • A large number of experiments also show that SETR achieves competitive results in the new SOTA: ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU), Cityscapes, and ranks first in the ADE20K test server leaderboard.

But SETR also has many shortcomings. Like ViT, SETR has a large dependence on pre-training and dataset size to achieve good results.

6. Experiment-Reproduction

paper experiments

The official Segmentation Transformer source code is based on the MMSegmentation framework, which is not easy to read and learn. If you want to use the official version, you don't need to refer to this blog.
1 Note: Use the Segmentation Transformer (SETR) (Pytorch version) to train the CityScapes dataset steps
Code: Big Brother reproduces - non-mm - incomplete

2. Concise-Big Brother Reappearance-Non-MM
Notes:
Big Brother Reappears Original Notes
Model Magic Modification—Edge Detection Based on SETR Model

The given code can keep the size of the input and output images unchanged. **Equivalent to the block code of the entire SETR has been written, and you only need to set the corresponding parameters to use it directly. ** Replace the entire model part of the Dense Extreme Inception Network with the above SETR model. Other parameters can remain unchanged, still the training settings of the Dense Extreme Inception Network.

from SETR.transformer_seg import SETRModel
import torch 

if __name__ == "__main__":
    net = SETRModel(patch_size=(32, 32), 
                    in_channels=3, 
                    out_channels=1, 
                    hidden_size=1024, 
                    num_hidden_layers=8, 
                    num_attention_heads=16, 
                    decode_features=[512, 256, 128, 64])
    t1 = torch.rand(1, 3, 256, 256)
    print("input: " + str(t1.shape))
    
    # print(net)
    print("output: " + str(net(t1).shape))

Then there is the part of the model, only need to replace the original model with the SETR model, and nothing else needs to be modified. Since there are no pre-trained parameters, there is no need to import parameters and start training from scratch.

原始的代码
# Instantiate model and move it to the computing device
model = DexiNedVit().to(device)

修改之后的
model = SETRModel(patch_size=(32, 32),
                 in_channels=3,
                 out_channels=1,
                 hidden_size=1024,
                 sample_rate=5,
                 num_hidden_layers=1,
                 num_attention_heads=16,
                 decode_features=[512, 256, 128, 64, 32]).to(device)

Guess you like

Origin blog.csdn.net/zhe470719/article/details/124301829