[Deep Learning] Semantic Segmentation - Overview (Convolution)

0. Note reference

1. Introductory summary
2. The most comprehensive review of semantic segmentation in history in July 2019 (FCN, UNet, SegNet, Deeplab, ASPP...)
3. Neural network - review of semantic segmentation network
4. A review of the latest semantic segmentation that can be understood after reading it
5 . FCN Detailed Explanation
6. Semantic Segmentation—SegNet: A Deep Convolutional Encoder-Decoder Architecture... (Paper Interpretation 1)
7. Semantic Segmentation Overview
8. Image Semantic Segmentation Overview
9. DeepLab Series Summary
10. Research Progress in Image Semantic Segmentation Methods
11. 2021 Summary of Semantic Segmentation Guide
12. Semantic Segmentation Model Architecture Evolution and Related Papers Reading----Strong Push! !
13. Research progress on image semantic segmentation methods

1. Purpose

Given an image, we need to classify each pixel on the image one by one, and the results are shown in the following figure:Please add a picture description

2. Difficulties

1. Data problem: Unlike detection and other tasks, segmentation requires only one category to be used. Segmentation requires precise pixel-level annotation, including information such as the outline of each target, which makes the cost of making data sets too high;

2. Computing resource issues: Now if you want to get a higher precision semantic segmentation model, you need to use a deep network like ResNet101. At the same time, the segmentation predicts every pixel, which requires the resolution of the feature map to be as high as possible, which shows the problem of computing resources. Although there are some lightweight networks, the accuracy is still too low;

3. Fine segmentation : The current method can classify large-volume objects in the image very well, but for small categories, because the outline is too small, it is impossible to accurately locate the outline, resulting in low accuracy;

4. Context information : Context information is very important in segmentation, otherwise it will cause a target to be divided into multiple parts, or different categories of targets will be classified into the same category;

3. Datasets and Evaluation Indicators

3.1 Dataset

VOC2012: There are 20 types of targets, including humans, motor vehicles and other classes, which can be used for target category or background segmentation

MSCOCO: It is a new image recognition, segmentation and image semantic dataset, and it is a large-scale image recognition, segmentation and annotation dataset. It can be used in several competitions, the most relevant to this field is the detection part, since part of it is dedicated to solving the segmentation problem. The competition included more than 80 object categories

Cityscapes: A Dataset for Semantic Understanding of Urban Scenes in 50 Cities

Stanford Background Dataset: A set of outdoor scenes with at least one foreground object.

Pascal Context: There are more than 400 categories of indoor and outdoor scenes

3.2 Evaluation indicators

1. Execution time: Speed ​​or running time is a very valuable measure, because most systems need to ensure that the inference time can meet the requirements of hard real-time. However, its influence is not obvious in common experiments, and this indicator is very dependent on hardware equipment and background implementation, making some comparisons useless.

2. Memory usage: In the case of the same running time, it is extremely valuable to record the extreme and average values ​​of memory usage in the running state of the system.

3. Accuracy: This refers to the accuracy measurement of pixel-by-pixel marking, assuming that there are k classes in total (one of the classes from l0 to lk belongs to the background.), Pij means that it belongs to class i but is predicted to be class j The number of pixels, Pii is represented as the number of true pairs, and Pij and Pji are called false positive samples and false negative samples, respectively.

1) Pixel Accuracy (PA, pixel accuracy): the proportion of correctly marked pixels to the total pixels.
Pixel-based accuracy calculation is the most basic and simplest indicator in the evaluation index. It can be known literally, PA refers to prediction Proportion of correct pixels to total pixels.
insert image description here
2) Mean Pixel Accuracy (MPA, average pixel accuracy): Calculate the proportion of correctly classified pixels in each class, and then calculate the average of all classes.
insert image description here
3) **Mean Intersection over Union (MIoU, average intersection and ratio): ** is a standard measure of semantic segmentation, which calculates the ratio of the intersection and union of two sets, which are ground truth and predicted segmentation , calculate the IoU on each class and then average them.

---- -IoU is the number of real samples/(number of real samples + number of false positive samples + number of false negative samples)
such evaluation indicators can judge the capture degree of the target (make the prediction label and label coincide as much as possible) , and can also judge How accurate the model is (make unions as coincident as possible).
IoU is generally calculated based on classes, and also based on image calculations. Be sure to see the evaluation criteria of the dataset clearly.

The IoU calculated based on the class is to accumulate the IoU of each class after calculation, and then average it to obtain a global evaluation. Therefore, the IoU we seek is actually the IoU with the mean value, that is, the average crossover ratio (mean IoU)
insert image description here
4) Frequency weighted Intersection over Union (FWIoU, Frequency Weight Intersection): It is an improvement of MIoU. This method sets the weight according to the frequency of occurrence of each class.
insert image description here

4. Realize the architecture

(1). Encoder-decoder architecture (FCN, SegNet, U-Net)

  • The encoder generally uses the network obtained by image classification pre-training, and the continuous max pooling and strided convolution is beneficial to obtain long-range context information and thus obtain better classification results.
    However, in this process, the feature resolution is continuously reduced, and the image detail information is lost, which poses a huge challenge to the segmentation task. Therefore, after the encoder, it is necessary to use the decoder to restore the image resolution ;
  • The task of the decoder is to semantically project the recognition features (low resolution) learned by the encoder onto the pixel space (high resolution), resulting in a dense classification.

(2). Context module (Multi-scale context aggregation, DeepLab V1, V2 and CRF-RNN, etc.)
**The context module is generally cascaded behind the model to obtain long-distance context information. **Understand after DeepLab with DenseCRF cascade, DenseCRF can model the relationship between pixels in any long distance, thus improving the segmentation results obtained by item-by-item classification;

(3). Pyramid pooling method (PSPNet, DeepLab V2, V3, V3+, etc.)
The pyramid pooling method acts on convolutional features, and can obtain corresponding context information at any scale . Generally, parallel multi-scale atrous convolution (ASPP) or multi-region pooling (PSPNet) is used to obtain the features of the corresponding scale context information, and finally they are fused to form a feature vector that integrates multiple scale contexts.

The first method uses image pyramids to extract features of each scale, then puts images of all scales into CNN to obtain segmentation results of different scales, and finally fuses the segmentation results of different resolutions to obtain the original resolution segmentation results.

The second approach is an encoder-decoder structure, which extracts multi-scale information from the encoder structure and recovers the spatial resolution in the decoder structure. Thus longer range information can be extracted in deeper encoder outputs.

The third method is to insert an additional structure on top of the original network structure to capture long-range information. Generally, dense CRF is used for information extraction.

The fourth method is the spatial pooling pyramid , which uses multiple convolution kernels or pooling layers of different sizes and receptive fields to scan the input image to capture the multi-scale information
insert image description here

5. Model Development

In the traditional semantic segmentation problem, there are three challenges:

  • Continuous pooling and downsampling in traditional classification CNNs lead to a decrease in spatial resolution. (Solution: Remove the downsampling and maximum pooling of the last few layers, and use the upsampling filter to get features with a higher sampling rate)
  • For the object-to-scale detection problem, rescaling and aggregating feature maps are used, but the calculation is relatively large. (Solution: Resample the feature layer to obtain multi-scale image text information, use multiple parallel ACNNs for multi-scale sampling, Chen Wei ASPP)
  • For object-centric classification, spatial transformation invariance needs to be guaranteed. (Solution: skip layer structure, extract high-level features from multiple network layers for prediction; use fully connected conditional random field for boundary prediction optimization)
    link: https://www.jianshu.com/p/9184455a4bd3

5.1 Symmetric Semantic Segmentation Model Based on Full Convolution

When solving image classification problems, a fully connected layer is often added at the end of the convolutional neural network model, and the output feature map of the convolutional layer is mapped into a feature vector, which can only complete the classification of the entire image.
When solving the semantic segmentation problem, the convolutional layer is used to replace the fully connected layer in the convolutional neural network model to obtain a fully convolutional network , which can complete the classification of each pixel in the image.

5.1.1FCN(2014/11/14)

Fully Convolutional Networks for Semantic Segmentation

Replace the fully connected layer in the convolutional neural network model with 1×1 convolution , and then obtain the probability value of each pixel belonging to each category through the Softmax layer. The true category of each pixel is the category with the largest corresponding probability value. Finally, a segmented image with the same resolution as the original image is obtained. Therefore, all layers in the FCN network are convolutional layers, so it is called a fully convolutional network.

main contribution

Extend the end-to-end convolutional network to semantic segmentation;
1. Fully Convolutional:
used to solve pixel-wise prediction problems. By replacing the last few fully connected layers of the basic network (such as VGG) with convolutional layers, an image input of any size can be realized, and the output image size corresponds to the input; 2. Upsampling through deconvolutional
layers : upsampling
operation , used to restore the size of the picture, which is convenient for subsequent pixel-by-pixel prediction;
3. The roughness of upsampling is improved through skip connection,
which is used to fuse high-level and low-level feature information. Through the structure of cross-layer connections, it combines the fine-grain information of the shallow layer of the network and the coarse information of the deep layer to achieve accurate segmentation tasks.

5.1.1.1 Specific process

1. After inputting an RGB image into the convolutional neural network, a series of feature maps are obtained through multiple convolution and pooling
processes. 2. Then use the deconvolution layer to obtain the feature map of the last convolution layer. Upsampling is carried out so that the size of the feature map after upsampling is the same as that of the original image, so as to realize the prediction of each pixel value on the feature map while retaining its spatial position information in the original image. 3. Finally, the upsampled feature
map Perform **pixel-by-pixel classification,**calculate the softmax classification loss pixel by pixel.

Oversampling restores the feature map to the same size as the input image, and can classify all pixels.

After the input image is conv1 convolution and the first pooling of the full convolutional network, the poo1 feature map is obtained, and the height and width are reduced to 1/2 of the original; the
poo1 feature map is conv2 convolution and the second pooling of the full convolutional network. After pooling, the poo2 feature map is obtained, and the height and width are reduced to 1/4 of the original;
the poo2 feature map is obtained after the conv3 convolution of the full convolutional network and the third pooling, and the height and width are reduced to the original 1/8;
the poo3 feature map is obtained after the conv4 convolution of the full convolutional network and the fourth pooling, and the height and width are reduced to 1/16 of the original; the poo4 feature map is passed through the conv5 of the full convolutional
network After convolution and the fifth pooling, the poo5 feature map is obtained, and the height and width are reduced to 1/32 of the original.

However, due to the loss of part of the information in the pooling operation, even the upsampling operation with the deconvolution layer will produce a rough segmentation map. Therefore, this paper also introduces skip connections from high-resolution feature maps.

   最后的语义分割结构有三种:
   pool5 特征图进行 32 倍上采样得到与输入图像相同分辨率的特征图,通过 Softmax 层获得 FCN-32s 分割图;
   pool5 特征图进行 2倍上采样后与 pool4 特征图进行拼接,然后进行 16 倍上采样得到与输入图像相同分辨率的特征图,通过 Softmax 层获得到 FCN-16s 分割图;
   pool5 特征图进行 2 倍上采样后与 pool4 特征图进行拼接,其结果再进行 2 倍上采样后与 pool3 特征图拼接,然后进行 8 倍上采样得到与输入图像相同分辨率的特征图,通过 Softmax 层获得 FCN-8s 分割图。 

Since multi-feature fusion is beneficial to improve the accuracy of semantic segmentation, FCN-8s performs better than FCN-16s, which in turn performs better than FCN-32s.
insert image description here

insert image description here
As shown in the figure below, FCN converts the fully connected layer in the traditional CNN into a convolutional layer. Corresponding to the CNN network, FCN converts the last three fully connected layers into a three-layer convolutional layer. In the traditional CNN structure, the first 5 layers are convolutional layers, the 6th and 7th layers are respectively a one-dimensional vector with a length of 4096, and the eighth layer is a one-dimensional vector with a length of 1000, corresponding to 1000 different class probabilities.
FCN represents these 3 layers as a convolutional layer, and the size of the **convolution kernel (number of channels, width, height) is (4096,1,1), (4096,1,1), (1000,1,1) ). It seems that there is no difference in numbers, but the concept and calculation process of convolution and full connection are different. It uses the weights and biases that CNN has trained before, but the difference is that the weights and biases are It has its own scope and belongs to its own convolution kernel. Therefore, all layers in the FCN network are convolutional layers, so it is called a fully convolutional network.
insert image description here
The picture below is a fully convolutional layer. The difference from the picture above is the size subscript corresponding to the image. The size of the
input image in CNN is fixed and resized to 227x227. After the first layer of pooling, it is 55x55, and the second layer of pooling The size of the post image is 27x27, and the size of the image after the fifth layer of pooling is 13 13.
**The image input by FCN is HW
size, the first layer becomes 1/4 of the original image size after pooling, the second layer becomes 1/8 of the original image size, and the fifth layer becomes 1/1 of the original image size /16, the eighth layer becomes 1/32 of the size of the original image (errata: in fact, the first layer in the real code is 1/2, and so on).

insert image description here
After multiple convolutions and pooling, the obtained image becomes smaller and smaller, and the resolution becomes lower and lower. When the image reaches H/32∗W/32, when the image is the smallest layer, the generated image is called a heatmap heatmap. The heatmap is our most important high-dimensional special diagnosis image. After obtaining the heatmap of high-dimensional features, it is the most important The first and last step is to perform upsampling on the original image, and enlarge, enlarge, and enlarge the image to the size of the original image.
insert image description here
The final output is 1000 heatmaps that have been upsampled to the size of the original image. In order to classify and predict each pixel into an image that has been semantically segmented, here is a small trick, which is to find its position pixel by pixel. The maximum numerical description (probability) of the pixel position in 1000 images is used as the classification of the pixel. Therefore, a picture that has been classified is produced, as shown in the picture below, with pictures of dogs and cats on the right.
insert image description here

5.1.1.2 CNN and FCN

The difference between FCN and CNN is that the last fully connected layer of CNN is replaced by a convolutional layer, and the output is a picture that has been labeled.

The function of the fully connected layer:
Usually, the CNN network will be connected with several fully connected layers after the convolutional layer, and map the feature map generated by the convolutional layer into a fixed-length feature vector .
The classic CNN structure represented by AlexNet is suitable for image-level classification and regression tasks, because they all expect to get a numerical description (probability) of the entire input image. For example, AlexNet's ImageNet model Probability of belonging to each class (softmax normalized).

Chestnut: The cat in the figure below, input AlexNet, get an output vector with a length of 1000, indicating the probability that the input image belongs to each category, among which the statistical probability of "tabby cat" is the highest.
insert image description here
FCN classifies images at the pixel level, thereby solving the problem of semantic segmentation at the semantic level.
Unlike the classic CNN that uses a fully connected layer to obtain a fixed-length feature vector after the convolutional layer for classification (fully connected layer + softmax output ),
FCN can accept input images of any size. The multilayer feature map is upsampled to restore it to the same size as the input image, so that a prediction can be generated for each pixel while retaining the spatial information in the original input image, and finally the upsampled features Pixel-by-pixel classification is performed on the image.

Finally, the loss of softmax classification is calculated pixel by pixel, which is equivalent to each pixel corresponding to a training sample.

5.1.1.3 Fully connected layer -> Convolutional layer

The only difference between a fully connected layer and a convolutional layer:
** Neurons in a convolutional layer are only connected to a local region of the input data, and neurons in a convolutional column share parameters. **However, in both types of layers, neurons are computing dot products, so their functional form is the same.
Therefore, it is possible to convert the two into each other:

  • For any convolutional layer, there is a fully connected layer that implements the same forward propagation function as it. The weight matrix is ​​a huge matrix that is zero except for some specific blocks. In most of these blocks, the elements are equal.
  • Conversely, any fully connected layer can be converted into a convolutional layer .
    For example, a K=4096 fully connected layer, the size of the input data body is 7∗7∗512, this fully connected layer can be equivalently regarded as a F=7, P=0, S=1, K=4096 the convolutional layer. In other words, the size of the filter is set to match the size of the input data volume . Since only a single depth column covers and slides through the input data volume, the output will become 1∗1∗4096, which is the same as using the original fully connected layer.

Converting fully-connected layers to convolutional layers:
Of the two transformations, converting fully-connected layers to convolutional layers is more useful in practice.
Assuming that the input of a convolutional neural network is a 224x224x3 image , a series of convolutional and downsampling layers transform the image data into an activation data volume of size 7x7x512. AlexNet uses two fully connected layers of size 4096, and the last fully connected layer with 1000 neurons is used to calculate the classification score.
We can convert any of these 3 fully connected layers into a convolutional layer :

For the fully connected layer whose first connection area is [7x7x512], let its filter size be F=7, so that the output data volume is [1x1x4096].
For the second fully connected layer, let its filter size be F=1, so that the output data volume is [1x1x4096].
Do the same for the last fully connected layer, make it F=1, and the final output is [1x1x1000]

In practice, each such transformation requires reshaping the weight W of the fully connected layer into a filter of the convolutional layer .

So what does such a transformation do?
It can be more efficient in the following cases: **Let the convolutional network slide over a larger input image to get multiple outputs, **This transformation allows us to do the above in a single forward pass operation.

Take a chestnut: If we want a floating window with a size of 224×224 to slide on a 384×384 picture with a step size of 32, bring each stop position into the convolutional network, and finally get 6×6 The category score for the location.
The above method of converting the fully connected layer into a convolutional layer will be more convenient.
If the 224×224 input picture gets an array of [7x7x512] after going through the convolutional layer and the downsampling layer, then the 384×384 large picture will get [12x12x512] directly after the same convolutional layer and downsampling layer array. Then go through the 3 convolutional layers converted from the 3 fully connected layers above, and finally get the output of [6x6x1000] ((12 – 7)/1 + 1 = 6). This result is exactly the score of the 6×6 positions where the floating window stopped in the original image!

5.1.1.4 upsampling

Compared to using the original convolutional neural network before being transformed to iteratively calculate all 36 positions, it is much more efficient to use the transformed convolutional neural network to perform a forward propagation calculation, because 36 calculations are sharing computing resources . This trick is often used in practice, one at a time to get better results. For example, the size of an image is usually made larger, and then the transformed convolutional neural network is used to evaluate many different locations in space to obtain classification scores, and then the average of these scores is calculated.

Finally, what if we want to use floating windows with a step size smaller than 32? It can be solved by multiple forward propagation. For example, we want to use a floating window with a step size of 16. Then first use the original image to perform forward propagation in the converted convolutional network, then move the original image by 16 pixels along the width, along the height, and finally along the width and height at the same time, and then bring these translated images to the into the convolutional network.

As shown in the figure below, when the picture becomes smaller after being processed in the network, its characteristics are more obvious, as shown by the color in the image. Of course, the picture in the last layer is no longer a 1-pixel picture The picture is a picture of the size of the original image H/32xW/32, which is drawn as one pixel for simplicity.
insert image description here
As shown in the figure below, the original image is reduced to 1/2 after convolution conv1 and pool1 are performed on the original image; after the second conv2 and pool2 are performed on the image, the image is reduced to 1/4; then continue to perform the third volume on the image The product operation conv3 and pool3 are reduced to 1/8 of the original image, and the featureMap of pool3 is retained at this time; then continue to perform the fourth convolution operation conv4 and pool4 on the image, which is reduced to 1/16 of the original image, and the featureMap of pool4 is retained; Finally, the fifth convolution operation conv5 and pool5 are performed on the image, which is reduced to 1/32 of the original image, and then the full connection in the original CNN operation is converted into convolution operation conv6 and conv7. The number of featureMaps of the image changes but the image size remains the same. It is 1/32 of the original image. At this time, the image is no longer called featureMap but heatMap.

Now we have 1/32 size heatMap, 1/16 size featureMap and 1/8 size featureMap, 1/32 size heatMap after upsampling operation, because the image restored by such operation is only the convolution kernel in conv5 The features in the image are limited to the accuracy problem and cannot restore the features in the image well, so it iterates forward here. Deconvolute the image after the last upsampling with the convolution kernel in conv4 to add details (equivalent to a difference process), and finally deconvolve the image with the convolution kernel in conv3 to add details to the image just after upsampling, and finally The restoration of the entire image is completed.

5.1.1.5 Limitations

1. The result obtained is still not fine enough. Although the effect of 8 times upsampling is much better than that of 32 times, the result of upsampling is still relatively blurred and smooth, and it is not sensitive to the details in the image.
2. Classify each pixel without fully considering the relationship between pixels. The spatial regularization step used in the usual pixel classification based segmentation methods is ignored and lacks spatial consistency.

personal comment

This article is regarded as a pioneering work in semantic segmentation, and its fully connected layer structure is still used in the most advanced segmentation models.

5.1.2 SegNet(2015/11/2)

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
PAMI2017(IEEE Transactions on Pattern Analysis and Machine Intelligence)

encoder-decoder architecture
encoder-decoder is based on FCN architecture.
The encoder gradually reduces the spatial dimension due to pooling, and the decoder gradually restores the spatial dimension and detail information. Usually there is a shortcut connection (shortcut connection, that is, a cross-layer connection) from the encoder to the decoder.

Aiming at the problem:
FCN fixes the receptive field in semantic segmentation and the details of the segmented objects are easily lost or smoothed. Maximum pooling and downsampling will reduce the resolution of the feature map (that is, reducing the resolution of the feature map will lose boundary information)

Main contribution:
using pooling indices to enhance position information and reduce the number of parameters.
The novelty of SegNet lies in the upsampling method of the decoder stage
. Specifically, the upsampling of the decoder uses the largest pooling index (indices) of the downsampling in the encoder stage. **Considering that the upsampling is sparse, and then cooperate with the filter to generate the final segmentation map.

5.1.2.1 Structure

The ideas of SegNet and FCN are very similar.
The encoding part:
mainly consists of the first 13 convolutional layers and 5 pooling layers of the VGG16 network.
Function: This part extracts input features for target classification. This is the principle of using pre-trained VGG Therefore, as for discarding the FC layer, it is to maintain a higher resolution and reduce parameters.

Decoding part :
It also consists of 13 convolutional layers and 5 upsampling layers.
Each encoder corresponds to a decoder, so the decoder has 13 layers, and maps the low-resolution feature map back to the same size classifier as the input (mask ) .

Pixelwise classification layer:
The output of the decoder (high-dimensional features) will be sent to the classification layer (in a trainable softmax classifier), and finally generate a category probability independently for each pixel Encoder network
insert image description here
:
The Encoder network is divided into 5 block, each block is composed of Conv+BN + MaxPooling, MaxPooling realizes the downsampling operation, the kernel length is 2, and the step size is 2.

Because the first 13 layers of the pre-train VGG16 model are used, the parameters of the model will be reduced a lot (the FC layer is gone, and the parameters are much less). Of course, this is different from the original VGG16, as shown in the picture above. Convolutional layers use Conv + Batch Norm + ReLUa structure. ,

The indices of each max pooling layer in the encoder are stored for later use in the decoder to unpool the corresponding feature maps using those stored indices.

Decoder network:
When the model uses Pooling in the encoder network, it will record Pooling Indices (corresponding positions before and after pooling) , and
in the decoder network willRestore with the previously recorded position, which is also the innovation of the paper.The decoder network also has 5 blocks, and each block consists of Upsampling + Conv + BN. It should be noted that the decoder stage does not have nonlinear activation (that is, there is no ReLU).

Classification layer:
add a convolution layer on the output of the decoder, the number of convolution kernels is the number of channels for classification, that is, each channel represents a class of segmentation results

5.1.2.2 decoder variant SegNet-Basic

Smaller version of SegNet, 4 encoders and 4 decoders,

1. The encoder stage is LRN + (Conv+BN +ReLU + MaxPool)x4 paper given when convolution does not use bias

2. The decoder stage is (UpPool+Conv+ BN)x4 + Conv (segmentation layer)

The size of the convolution kernel has always been 7×7, and the feature map receiving field of the highest layer is the size of 106×106 of the original image.

The formula for the pooling layer and the convolutional layer is:
insert image description here
where Srf is iteratively calculated from the high-level feature to the bottom feature, Stride is the step size, and Ksize is the convolution kernel size.

5.1.2.3 Comparing SegNet and FCN to implement decoder

Upsamping is the inverse process A of Pooling
: Upsamping makes the picture 2 times larger. But after Pooling, each filter will lose 3 weights, and these weights cannot be recovered.
B: Deconvolution is used to fill in missing content, which helps to maintain the integrity of high-frequency information.
insert image description here
SegNet uses index information when UpPool , directly puts the data back to the corresponding position, and then connects to Conv training and learning. This upsampling does not require training and learning (it just takes up some storage space).
FCN adopts the transposed convolutions strategy,
which is
to obtain upsampling after feature deconvolution. This process needs to be learned. At the same time, the feature corresponding to the encoder stage is reduced in channel dimension, so that the channel dimension is the same as upsampling, so that the pixel can be added to get the final The decoder output.
insert image description here

5.1.2.4 Conclusion and evaluation

The model mainly lies in the pooling information in the encoder stage used by upsampling in the decoder stage, which effectively improves memory utilization and improves the model segmentation rate.
However, the inference of SegNet has not been significantly improved compared with FCN, so the end-to-end model ability needs to be improved.

Although SegNet is not as good as FCN-8s in evaluation indicators, its encoding-decoding idea has influenced many subsequent models.
Both FCN and SegNet are encoder-decoder architectures.
SegNet's benchmark performance is too poor, it is not recommended to use this network.

5.1.3 Unet and its variants (2015/5/18)

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net is designed to be applied to the segmentation of medical images. However, due to the particularity of medical image processing itself, the amount of data that can be used for training is still relatively small. The method proposed in this paper effectively improves the use of a small number of data sets . The effect of training detection. An efficient method for processing large-scale images is proposed.

  • overlap-tile strategy
  • Random Elastic Deformation for Data Augmentation
  • weighted loss is used

structure

The Unet network is very simple. The first half is feature extraction, and the second half is upsampling. Such a structure is also called an encoder-decoder structure in some literatures. Because the overall structure of this network is similar to the uppercase English letter U, it is named U-net.

(1). U-net adopts a completely different feature fusion method: concat splicing
A: The features are spliced ​​together in the channel dimension to form thicker features.
B: In FCN, the combination of Skip connection is through the summation of corresponding pixels, while U-Net is the concat process of its channel.

(2). The upsampling part will fuse the output of the feature extraction part . This is actually a fusion of multi-scale features.
A: Take the last upsampling as an example, its features come from the first convolution block The output of (same-scale features), also comes from the output of upsampling.
B: An architecture that uses skip-and-splice connections, allowing the decoder to learn relevant features that were lost in the encoder pooling at each stage
insert image description here

Combination with swin-Swin-UNet

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation
notes:
1. Swin-Unet colloquial explanation
2. Very good! ! A more detailed
overview
of the paper notes is different from the previous transformer used in the field of image segmentation.In the past, the transformer was used in the encoder part, that is, the encoder of UNet was replaced with a transformer. No matter how you change it, you can't jump out of this range. I have never seen a transformer used in a decoder.

The main point
is how the swin transformer becomes a decoder, it is a patch expanding layer

Contribution
1 Based on the Swin Transformer block, a symmetrical encoding and decoding structure with skip connections is established. In the encoder, self-attention from local to global is realized. At the decoder, the global features are upsampled to the input resolution to perform corresponding pixel-level segmentation predictions.
2 expansion layers are developed to achieve upsampling without using convolution and interpolation operations to achieve feature dimensionality increase. [This can be used]
3 Experiments show that skip connections are still effective for Transformer, so a U-shaped encoding and decoding structure with skip connections based on pure transformers is finally realized, called Swin-Unet

Abstract
CNN Disadvantages:
Convolutional Neural Networks (CNNs) have achieved landmark progress in medical image analysis. Especially, deep neural networks based on U-shaped architecture and skip connections have been widely used in various medical image tasks. However, although CNN achieves excellent performance, it cannot learn global and long-range semantic information interactions well due to the limitations of convolution operations.

Proposed:
Swin-Unet is proposed, which is a Unet-like pure Transformer for medical image segmentation. Tokenized image patches are fed into a Transformer-based U-shaped En-Decoder architecture via skip connections for local global semantic feature learning.

Specifically, we use a hierarchical Swin Transformer with offset windows as an encoder to extract contextual features.
And a symmetric Swin Transformer-based decoder with patch expansion layers is designed to perform upsampling operations to recover the spatial resolution of feature maps.

Network structure
Specific process:

  • The input medical image is segmented into non-overlapping image patches. Each block is regarded as a token, which is sent to the Transformer-based encoder to learn deep feature representation.
  • Then, the extracted contextual features are up-sampled by a decoder with a block expansion layer, and the multi-scale features from the encoder are fused by skip connections in order to recover the spatial resolution of the feature map for further segmentation prediction.

insert image description here
As shown in Figure 1, Swin-UNet consists of Encoder, Bottleneck, Decoder and skip connections.

The encoder
aggregates the image features and downsamples at the same time, the WH is halved, and the channel is increased synchronously == (since Swin inputs as much as it outputs, so the downsampling function is realized through the linear layer of torch) == the input image is first patch
partitioned , the size of each patch is 4x4, the input dimension is H/4 x W/4 x 48, after linear embedding and feature map size is H/4 x W/4 x C, it is sent to two consecutive Swin Transformer blocks, where the feature dimension and resolution remain unchanged.
Downsampling in patch merging reduces the resolution and doubles the feature dimension. Repeated three times to form the encoder.

Bottleneck uses two consecutive Swin Transformer blocks. Here, in order to prevent the network from being too deep to converge, only two blocks are used. In Bottleneck, the feature size remains unchanged at H/32 x W/32 x 8C.

Decoder
Sampling the image to the size of the original image is convenient for pixel classification. The
Swin-UNet decoder mainly implements upsampling by patch expanding . As a completely symmetrical network structure, the decoder is also expanded by 2 times each time for upsampling. The core The module consists of Swin Transformer block and patch expanding.


The feature map obtained by the deeper the skip connection network layer has a larger receptive field. The shallow convolution focuses on texture features, and the deep network focuses on the essential features. Through skip connections, the feature vector can have both deep and surface features ( cat method),
skip connections fuse the multi-scale features of the encoder with the upsampled features . Concatenate shallow features and deep features to reduce the loss of spatial information caused by downsampling.
Finally there is a linear layer, where the dimensionality of the connected features remains the same as that of the upsampled features.

The patch input by the Patch merging layer
is divided into 4 parts and spliced ​​together. This processing reduces the feature resolution from H/4 x W/4 to H/8 x W/8, and the feature dimension is increased by 4 times due to the splicing operation, and then the linear layer is used to reduce the feature dimension to 2 times. (Same as in swin.)

The Patch expanding layer
takes the first one in the decoder as an example. Before upsampling, a linear layer is applied on the input features (W/32×H/32×8C) to increase the feature dimension to 2×original size (W/32 ×H/32×16C). Then, using the rearrange operation, the resolution of the input features is expanded to 2 times the input resolution, and the feature dimension is reduced to a quarter of the input dimension (W/32×H/32×16C→W/16× H/16×4C). (The inverse operation of Patch merging)

Swin Transformer block
insert image description here
The Swin Transformer block is based on sliding windows, as shown in Figure 2, showing two consecutive Swin Transformer blocks.

  • Each swin Transformer is composed of layer normalization (LN), multi-head self-attention module, residual connection, two-layer MLP with GELU nonlinearity.
  • A window-based multi-head self-attention module (W-MSA) and a sliding window-based multi-head self-attention module are used in two continuous transformer modules, respectively.
  • Based on this window partitioning mechanism, consecutive swin transformer blocks can be formulated as:
    insert image description here

Summary:
Swin-Unet only replaces the 2D convolution of Unet with the Swin structure in each feature extraction module. There is basically no change in the Swin structure and Unet structure, and the loss function has not changed. It once again explained the powerful feature extraction capability of the Swin module (I feel that the innovation is not enough, but the code is quite refreshing)

5.2 Dilated Convolution Semantic Segmentation Model Based on Full Convolution

The segmentation results based on the fully convolutional symmetric semantic segmentation model are rough, ignoring the spatial consistency relationship between pixels. So Google proposed a new dilated convolution semantic segmentation model, which considers the spatial consistency relationship between pixels and pixels, and can increase the receptive field without increasing the number of parameters.

5.2.1Dilated convolution(2015/11/23)

Multi-Scale Context Aggregation by Dilated Convolutions

Innovation

  1. Use dilated convolutions for dense prediction.
  2. A context module is proposed , and Dilated Convolutions are used to integrate multi-scale information.

Solve the problem
(1). The commonly used segmentation method is to perform convolution and re-pooling on the image, reduce the image resolution and increase the receptive field. Since the image segmentation prediction is a pixel-by-pixel output, it is necessary to upsample the smaller image size after pooling to the original image size for prediction.

(2). Since adding a pooling layer to the network will lose information and reduce accuracy. Then not adding a pooling layer will make the receptive field smaller, and the global features cannot be learned .
If the pooling layer is removed and the convolution kernel is expanded, such a pure expansion of the convolution kernel will inevitably lead to an increase in the amount of calculation.

Key:
The pooling operation can increase the receptive field, which is very good for image classification tasks, but because the pooling operation reduces the resolution, it is not good for semantic segmentation . Therefore, the author proposes an operation called dilated convolution to solve this problem. dilated convolution (called atrous convolution in deeplab). It can improve the receptive field well while maintaining the spatial resolution.

Dilated Convolution
A: Hole convolution can integrate multi-scale context information without losing resolution and without analyzing rescaled images.
B: Hole convolution without pooling or other downsampling . Atrous convolution supports exponential growth of the receptive field without loss of resolution.
insert image description here
When l = 1, it is a standard convolution.
When l > 1, it is a dilated convolution.

(a) The graph corresponds to a 3x3 convolution with an expansion rate of 1, which is the same as the normal convolution operation

(b) The picture corresponds to a 3x3 convolution with an expansion rate of 2, and the actual convolution kernel is still 3x3. That is, for a 7x7 image block, only 9 red points, that is, 3x3 convolution kernels, undergo convolution operations, and the rest of the points are skipped, that is, the weights are filled with 0. Although the size of the convolution kernel is only 3x3, the receptive field of this convolution has been increased to 7x7.

© The picture is a 4-dilated convolution operation, which can reach a receptive field of 15x15.

Atrous convolution is suitable for dense prediction because it can enlarge the receptive field without loss of resolution. A: The problem of discontinuous receptive field jumps B: The problem of small-scale object detection

Network architecture
insert image description here
There are two types of network architecture, one is the front-end network, and the other is the front-end network + context module, which are introduced as follows:

  • Front-end network:
    The last two pooling layers of the VGG network are removed, and the subsequent convolutional layers are replaced by dilated convolutions. And the hole rate of the hole convolution between pool3 and pool4=2, and the hole rate of the hole convolution after pool4=4. The author refers to this architecture as the front-end.

  • Front-end network + context module:
    In addition to the front-end network, the author also designed an architecture called a context module, which is added after the front-end network. A variety of dilated convolutions with different dilated rates are cascaded in the context block , so that multi-scale context information can be integrated, thereby improving the effect of front-end network prediction. It should be noted that the front-end network and the context block are trained separately, because the author found in the experiment that if they are combined for end-to-end training, the performance cannot be improved.

Applying
dilated/dilated convolutions introduces another parameter to the convolutional layer, the dilation rate. It can expand the receptive field without increasing the computational cost. Dilated convolutions have been widely used in real time periods, some of the most important ones include DeepLab family, Multiscale Context Aggregation, Dense Upsampled Convolution and Hybrid Dilated Convolution (DUC-HDC), Densespp and ENet.

5.2.2 DeepLab series

  • Semantic segmentation is an intensive segmentation task for images, segmenting each pixel into a specified category;
  • Segment the image into several meaningful objects;
  • Assigns the specified class label to the object.

main contribution

Hole convolution is used to expand the receptive field and maintain resolution.

Atrous spatial pyramid pooling (ASPP) is proposed to realize the pyramid- and integrate multi-scale information.

A fully connected CRF is used for post-processing to improve segmentation results.

Summary
DeepLabv1
is a semantic segmentation model formed by cascading a deep convolutional network and a probabilistic graph model.
1. Since the deep convolutional network will lose a lot of detailed information in the process of repeated maximum pooling and downsampling , the expanded convolutional network is used The product algorithm increases the receptive field to obtain more contextual information .
2. Considering that the spatial insensitivity of deep convolutional networks in image labeling tasks limits its localization accuracy , a fully connected Conditional Random Field (CRF) is used to improve the ability of the model to capture details .

The DeepLabv2
semantic segmentation model adds an ASPP (Atrous spatial pyramid pooling) structure, which uses multiple dilated convolutions with different sampling rates to extract features, and then fuses the features to capture contextual information of different sizes .

The DeepLabv3
semantic segmentation model adds global average pooling to ASPP, and adds batch normalization after parallel expansion convolution (ASPP) , which effectively captures global context information.

The DeepLabv3+
semantic segmentation model adds an encoder-decoder module and an Xception backbone network on the basis of DeepLabv3.
The main purpose of adding the codec module is to restore the original pixel information, so that the detailed information of the segmentation can be better preserved, and at the same time encode rich context information.
The purpose of adding the Xception backbone network is to use deep convolution to further improve the accuracy and speed of the algorithm.
In the inception structure, first perform 1 1 convolution on the input, then group the channels, use different 3 3 convolutions to extract features, and finally concatenate the results of each group as the output.

5.2.2.1DeepLabV1(2014/12/22)

Original: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
Backbone : VGG16
Contributions:
Atrous convolution (hole convolution)
CRF (conditional random field)

Since semantic segmentation is a pixel-level classification, highly abstract spatial features are not suitable for low-level, so the size and spatial invariance of the feature map must be considered.

  • The feature map becomes smaller because of the existence of stride. Stride>1 is to increase the receptive field. If stride=1, to ensure the same receptive field, the size of the convolution kernel must become larger. Therefore, the paper uses the hole algorithm to increase The kernel size then achieves the same receptive field, that is, hole convolution.

  • 图像输入CNN后是一个倍逐步抽象的过程,原来的位置信息会随着深度而减少甚至消失。条件随机场在传统图像处理上做一个平滑,也就是说在决定一个位置的像素值时,能够考虑周围邻居的像素值,抹茶一些噪音。

  • insert image description here
    具体的操作:
    DeeplabV1是在VGG16的基础上做了修改:
    移除原网络最后两个池化层,使用rate=2的空洞卷积采样。标准的卷积只能获取原图1/4的内容,而新的带孔的卷积能够在全图上获取信息。
    1.首先,去掉了最后的全连接层。做语义分割使用全卷积网络是大势所趋,
    2.然后,去掉了最后两个池化层。
    insert image description here
    pooling作用:

    • 缩小特征层的尺寸
    • 快速扩大感受野,为了利用更多的上下文信息进行分析。

    去掉pooling具体原因:
    语义分割是一个end-to-end的问题,需要对每个像素进行精确的分类,对像素的位置很敏感,是个精细活儿。而pooling是一个不断丢失位置信息的过程,而语义分割又需要这些信息,矛盾就产生了。没办法,只好去掉pooling喽。全去掉行不行,理论上是可行的,实际使用嘛,一来显卡没那么大的内存,二来费时间。所以只去掉了两层。

3. After going to two pooling, what should I do if the receptive field is not enough? Borrow the atrous convolution, which is also the last modification of VGG16. Atrous convolution is called atrous convolution (it seems to be called dilation convolution). Compared with traditional convolution, it can expand the receptive field without increasing the amount of calculation. The difference between atrous convolution
insert image description here
and traditional convolution is that
traditional convolution It is three consecutive pumping, the receptive field is 3, and the empty convolution is jumping pumping, that is, using the rate in the figure, the receptive field is expanded to 5 (rate=2), which is equivalent to two traditional convolutions, and by adjusting the rate The receptive field can be freely chosen. In this way, the problem of receptive field is solved.
In addition, the original text pointed out that the advantage of hole convolution is to increase the density of features . Staring at the picture above, I thought about this question for a long time. Although you draw densely, convolutions are all one-to-one. How big is the input and how big is the output? How can the characteristics of the hollow convolution be dense? ~~ After a flash of lightning, ~~ I finally figured it out. You can’t look at this picture alone. The traditional convolution on the top is the first convolution layer after pooling, and the light pink triangle input by the convolution on the bottom is just Is the pixel being pooled. Therefore, the output below is twice that of the above, and the features are twice as dense, of course.

dilated convolution

In a fully convolutional network**, the receptive field of pixels on the Feature Map depends on the convolution and pooling operations**. The receptive field of ordinary convolution can only increase by two pixels at a time, and the growth rate is too slow. The increase of the receptive field of the traditional convolutional network is generally completed by pooling operation, but the pooling operation will reduce the resolution of the image while increasing the receptive field, thus losing some information . Moreover, upsampling the image after pooling will make it impossible to restore a lot of detail information , which ultimately limits the accuracy of segmentation.

So how to expand the receptive field without using pooling ?
Dilated convolution came into being. As the name implies, hole convolution is to add "holes" (points with a value of 0) to the convolution operation to increase the receptive field.
Hole convolution introduces the dilated ratio hyperparameter to specify the distance between two effective values ​​​​on the hole convolution: the hole convolution with the expansion rate r, there are r-1 holes between the two effective values ,as shown in picture 2. Among them, the red dots are effective values, and the green dots are voids. As shown in Figure 2.(a), r = 1 r=1r=1 is that the hole convolution becomes a normal convolution.
insert image description here
The receptive fields of r=1 and r=3 hollow convolutions are 7x7 and 15x15 respectively, but their number of parameters is still 9. The current deep learning framework supports dilated convolutions very well, and only one hyperparameter of expansion rate can be set.

Fully Connected Conditional Random Field
Use conditional random field CRF to improve classification accuracy. The effect is shown in the figure below, and it can be seen that the improvement is very obvious. What is the specific principle of CRF? I didn't study it, because I abandoned CRF when I was too lazy to V3.

5.2.2.2 DeepLabV2(2016/6/2)

原文DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
收录:TPAMI2017 (IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017)
Backbone:ResNet-101
Contributions:ASPP

Change:
Deeplabv2 is improved on v1:

  • Use multiple scales for better segmentation (using Atrous Spatial Pyramid Pooling (ASPP))
  • The base layer is converted from VGG16 to ResNet
  • use different learning strategies
  • insert image description here
Atrous Spatial Pyramid Pooling (ASPP)

Core idea: Gather receptive fields of different scales
Function: Solve the problem of different scales of different segmentation targets .
Because the same thing has a difference in scale in the same picture or in different images. Still take this picture as an example. There are many sizes of trees in the picture, and ASPP can better classify these trees.
insert image description here
insert image description here
How ASPP is integrated into ResNet, look at the picture and speak. Replace the conv6 (hole convolution) of VGG16 with different rate hole convolutions, then keep up with conv7, 8, and finally make a big fusion (corresponding to addition or 1*1 convolution) and it will be OK.
insert image description here

5.2.2.3 DeepLabV3 (2017/6/17)

Original text: Rethinking Atrous Convolution for Semantic Image Segmentation
**Backbone: **ResNet-101
Contributions:
Optimize the structure of ASPP, including adding BN, etc.
This module cascades multiple hollow convolution structures (Going deeper with atrous convolution)
Remove CRF ( random vector field)

Specific improvements:
1. Abandoned CRF, because the accuracy of classification results has been improved to the point that CRF is not needed (or CRF may not work anymore)

The other two contributions, one is to improve ASPP, and the other is to use hole convolution to deepen the network. These two can be regarded as a choice between the two, whether to expand the width of the network or increase the depth of the network. ** Generally speaking, the DeepLab V3 model refers to the former , because from the results and subsequent development given by the great god, it is obvious that the former is better.

2. For ASPP, two improvements have been made.

  • One is to use batch normalization (batch normalization operation) after the hole convolution ,
  • The second is to increase the 1*1 convolution branch and the image pooling branch.

These two branches are added to solve the problems caused by using hole convolution:
as the rate increases, the effective pixels covered by a hole convolution (the pixels of the feature layer itself, and the corresponding zero-filled pixels are non-effective pixels) It will gradually decrease to 1 (there is no picture here, it is all based on brain supplement).
Because as the expansion rate becomes larger, there will be more and more pixel calculations that cannot use all the weights. When the expansion rate is large enough, only the weight in the middle is effective, and the hole convolution degenerates into a 1x1 convolution. The disadvantage of losing the weight here is second, and the important thing is to lose the global information of the image.

This deviates from our original intention ( to obtain a wider range of features) . So to solve this problem,

  • One is to use 1 1 convolution, that is, when the rate increases , the degenerated form of 3 3 convolution replaces 3*3 convolution and reduces the number of parameters;
  • Another point is to increase image pooling, which can be called global pooling, to supplement global features . The specific method is to average the pixels of each channel, and then upsample to the original resolution.
    insert image description here
    3. Going deeper with atrous convolution. Why deepen the network? What I understand is to obtain a larger receptive field. It is mentioned that the receptive field is naturally inseparable from the hollow convolution. Still looking at the picture, it is obvious that the feature layer is almost too small to be seen when the pooling is used too much, so the god gave a way to use the hollow convolution to continuously deepen the network
    insert image description here

5.2.2.4 DeepLabV3+(CVPR2018)

Original text : Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
paper interpretation
code : pytorch
Backbone: Xception
Contributions:
Xception
Encoder-decoder structure

Motivation : Improve segmentation accuracy

Problem:
In order to obtain contextual information on multiple scales, DeepLab V3 uses several parallel Atrous convolutions (Atrous Spatial Pyramid Pooling, ASPP) with different rates. Although rich, multi-scale semantic information is contained in the final feature map, due to the pooling and convolutional layers with strided operations in the backbone network , this leads to feature maps lacking detailed information related to object boundaries .
Therefore, as shown in Figure 1, in DeepLab V3+, a decoder module
insert image description here
innovation is added on the basis of ASPP

  • A novel encoder-decoder model is proposed: using DeepLab v3 as an efficient encoder, accessing a simple but effective decoder module to refine the segmentation results, especially the boundary information of objects.
  • The backbone network is changed to the Xception model to speed up the calculation of the model and reduce its calculation consumption.
  • Introduce depthwise separable convolutions in ASPP and decoder models to build faster and more efficient encoder-decoder models.
brief description

The biggest improvement of v3+ is to regard the DCNN part of DeepLab as the Encoder, and
regard the feature map output by the DCNN as the Decoder , which constitutes the Encoder+Decoder system. Bilinear interpolation upsampling is a simple Decoder , and strengthening the Decoder can make the model as a whole achieve good results in the edge part of image semantic segmentation.
Specifically, after DeepLabV3+ samples 4x the output of the DeepLabv3 model with stride = 16, the output of 0.25x in the DCNN is reduced by 1x1 convolution and then connected (concat) to it, and then bilinear interpolation is performed using 3x3 convolution After upsampling by 4 times, we get finer results than DeepLabv3.

network structure

Spatial pyramid pooling (SPP): capture rich context information by pooling operations at different resolutions
encoder-decoder architecture: gradually obtain clear object boundaries
1) an encoder module for gradually reducing features graph, while capturing higher-level semantic information; (
2) a decoder module for progressively recovering spatial information.
Based on this, as shown in the figure, DeepLab V3 is used as an encoder module, and atrous convolutions of different rates are applied at different scales to
encode multi-scale context information
; at the same time, a decoder module is added to follow the object boundary. Optimized to obtain sharper segmentation results.
insert image description here
Encoder
1, Backbone (backbone network) - upgrade ResNet-101 to Xception

Corresponding to the DCNN (deep convolutional neural network) part in the network structure diagram above

Backbone is mainly for extracting features

Improvements:
(1) Deeper Xception structure, the difference is that the structure of the entry flow network is not modified, which is for fast calculation and effective use of memory

(2) Replace all convolutional layers and pooling layers with Depthwise separable convolution , which is the Sep Conv in the figure below

(3) Each 3x3 depth convolution is followed by BN and Relu
insert image description here
Xception
cores: using Depthwise separable convolution (depth separable convolution ) *.
The idea of ​​Depthwise separable convolution comes from the inception structure, which is a limit case of the inception structure.
Inception first gives a hypothesis:
the correlation and spatial correlation between convolutional layer channels can be decoupled, and mapping them separately can achieve better results.
In the inception structure, first perform 1 1 convolution on the input, then group the channels, use different 3 3 convolutions to extract features, and finally concatenate the results of each group as the output.
insert image description here

Depthwise separable convolution: Treat each channel as a group. First perform 3 3 convolutions on each input channel , concatenate the results of each channel, and then adjust to the target number of channels through 1 1 convolution .
insert image description here
Advantages:
greatly reduce the number of parameters. How big is the range?
To give a simple chestnut, assuming that the input and output are 64 channels, and the convolution kernel uses 3*3, then the number of parameters of the traditional convolution is 3x3x64x64=36864
and the Depthwise separable convolution is 3x3x64+1x1x64x64=4672

The Xception architecture
introduces three flows of Entry/Middle/Exit, and each flow uses different repeating modules inside. Of course, the core is the Middel flow that continuously analyzes and filters features in the middle.
The entry flow is mainly used to continuously down-sample and reduce the spatial dimension;
in the middle, it is to continuously learn the association relationship and optimize the features;
finally, it is to summarize and organize the features, which are used for expression by FC.

After talking about the backbone, let's talk about the overall structure of V3+ . The first three versions are directly bilinearly upsampled to the original resolution by backbone (ASPP) output, a very simple and crude method, (a) in the figure below. After using three versions, Dashen also felt that this was too rough,
so he absorbed the structure of Encoder-Deconder, (b) in the figure below, and
added a shallow layer to the output skip layer, c in the figure below .
insert image description here
Let's talk about the specific skip method .

  • First, select the second convolution output in block2 (see the code, this is fixed), use 1*1 convolution to adjust the number of channels to 48 (reducing the number of channels is to reduce its proportion in the final result), and then Resize to the specified size, which is the output stride
  • Then, **resize the output of ASPP to the output stride. **Finally, connect the two parts in series to do two 3 3 convolutions. Finally, do another 1 1 convolution to get the classification result. The last, the last, and the last resize the classification results to the original resolution, well, it is still the familiar bilinear sampling.

2.
Advanced features of ASPP (Atrous Spatial Pyramid Pooling) After 5 different operations of ASPP, 5 different outputs are obtained. The 5
operations include 1 1×1 convolution, 3 dilated convolutions with different rates, and 1 ImagePooling ( Global average pooling and then upsampling to the original size).
Convolution can extract features locally, and ImagePooling can extract features globally, so that multi-scale features are obtained. Feature
fusion is superimposed by concatenate method instead of direct addition.

3. The final output of the Encoder.
We can look at the "Upsample by 4" and "Concat" in the Decorer in the network structure diagram to introduce two outputs of the backbone: one
is the low-level feature (low-level feature), which is an output of output=4x;
The other is an advanced feature, input to ASPP, which is an output of output=16x

Decorder
**Low-level features (low-level features from the backbone network) **After 1x1 convolution adjustment dimension (used to reduce the number of output channels. output stride=4x) (the paper shows that the low-level features are adjusted to 48 channels The best effect)

Advanced features (features output by the encoder) are upsampled by 4 times (bilinear interpolation), so that the output stride changes from 16x to 4x

Then concatenate the two 4x features in series, followed by some 3×3 convolutions (refinement features. The paper shows that there are 2 3x3 convolutions with output channels=256, the output effect is better), and then upsample 4 times (double Linear interpolation) to get the output Dense Prediction

5.2.3 PSPNet pyramid-type scene analysis network (2016/12/4 CVPR 2017)

PSPNet: Pyramid Scene Parsing Network
PSPNet Model Study Notes
Semantic Segmentation - PSPNET
Motivation:
Improvement Based on the Insufficiency of the FCN Model

  • In the first line, the FCN algorithm mistakenly divides the boat into a car. Obviously, the probability of a car on the water is very small. This is a wrong segmentation of wrong matching.
  • In the second line, the FCN algorithm mistakenly divides skyscrapers into buildings. The two categories of skyscrapers and buildings are relatively close. This is a mis-segmentation that belongs to similar categories.
  • In the third row, the FCN algorithm mistakenly divides the pillow into a bed. The area of ​​the pillow itself is small, and the texture is close to the bed. This is a mis-segmentation that is not obvious.

The author believes that these mis-segments can be solved by introducing more context information and multi-scale information . When the segmentation layer has more global information and more reasonable scale information, the probability of the above-mentioned mis-segmentation will be relatively low. .

Innovation point: A pyramid pooling module (SPP)
is proposed to aggregate background information; in simple terms, feature map X is obtained after the encoder, and kernels of different sizes are used for pooling (avepooling) operations, and then the obtained feature map is uploaded Sampling, so that the size is the same as the size of X, and then cascaded, and then the map is obtained through the convolution operation. The difference from the classic FCN is that a PSP module is added between the encoder and decoder.

Network Architecture

In segmentation tasks, the size of the receptive field can roughly indicate our ability to use contextual information.
Based on these problems, the paper designed the model architecture shown in the figure below, including pyramid pooling and a more effective ResNet training loss function. The model architecture is shown in the figure below:
insert image description here

The model extracts feature maps through ResNet with dilated convolutions as the backbone network , and the size is 1/8 of the input image.
The semantic information is obtained through the pyramid pooling module , and four different levels of pyramids are used and upsampled to fuse them into the size of the original feature map . The predicted output is obtained through a convolutional layer.

( Different from spatial pyramid pooling, which smoothly stitches feature maps of different levels for image classification, the above pyramid pooling module combines features of four scales.

Suppose the input feature map size is H×W×C. The red part represents the pooling of the roughest global scale, and then the feature map is divided into different sub-regions according to the proportion to form the information expression of different regions (as shown in the above figure, the orange part is 2×2, then 3×3, 6× 6).

After pooling, 1×1 convolution is performed, and the number of channels is adjusted to 1/4 of the number of input channels. Then, the features of different levels are bilinearly up-sampled to the size of the input feature map. Then the features of all levels are concatenated with the original feature map to form a global prior representation for subsequent segmentation prediction.

Analyzing the above structure, it can be seen that scale information comes from operations at different levels; context information also combines more receptive field areas through different scales; 1×1 convolution combines channel information. The entire information expression combines local and global feature expressions.
)

And the second point, the more effective ResNet training loss function is shown in the figure below, which is an improvement on the basis of ResNet101. In addition to using the subsequent softmax classification as loss, an additional auxiliary loss is added in the fourth stage. The two losses are propagated together, using different weights to jointly optimize the parameters. Subsequent experiments proved that this is beneficial to fast convergence.
insert image description here

5.2.4UPerNet(ECCV2018)

Original text: Unified Perceptual Parsing for Scene Understanding
notes reference:
1. UPerNet study notes

Problem Solved
Considering the possibility of simultaneous multitasking in visual recognition, a new task (Unified Perceptual Parsing – UPP) is proposed, and new learning methods are proposed to solve it.

The UPP task mainly has the following problems:

1 There is no single image dataset that contains all levels of visual information annotation, and a large number of datasets are only designed for a certain type of task, such as texture detection, scene analysis, surface recognition, etc.
2 Different recognition tasks correspond to different annotations, such as pixel-level and image-level annotations.

how to solve

1. We propose a new parsing task, Unified Perceptual Parsing, which requires the system to parse multiple visual concepts at once

2. We propose a novel network named UPerNet, which has a hierarchical structure and can learn heterogeneous data from multiple image datasets.

3. The model shows the ability to jointly infer and discover rich visual knowledge underlying images.

Define the task:
Unified Perceptual Parsing, which refers to identifying as many different visual concepts as possible from a given image.

structural analysis

insert image description here

  • The network is based on the FPN architecture (Feature Pyramid Network), note that it is different from FCN. The FPN architecture was briefly mentioned in BiSeNet before, using jumper connections to fuse high, medium, and low-level semantic features. However, there are still problems with the FPN architecture. Although the theoretical receptive field is large enough, the actual receptive field is often much smaller than the theoretical receptive field.
  • The pyramid pooling module in PSPNet is introduced into the last layer of the backbone network , and then sent to the branch of FPN to overcome the limitation of the receptive field.
  • The backbone of the network is ResNet . The feature maps output by each stage in ResNet are recorded as {C2, C3, C4, C5}, and the feature maps output by FPN are recorded as {P2, P3, P4, P5}, where P5 is also passed through PPM The feature map that is directly output after. The downsampling rate is {4,8,16,32} Note here: the feature map of each stage has entities drawn in the figure, and the corresponding sampling ratio is marked.

本文使用多个语义层次的特征。由于图像级信息更适合场景分类,
Scene head 直接被附加到 PPM 模块之后的特征图。
Object head 和 Part head 被附加到与来自 FPN 的所有层相融合的特征图。
Material head 被附加到 FPN 中带有最高分辨率的特征图。
Texture 被附加到 ResNet 中的 Res-2 模块,并在整个网络完成其他任务的训练之后进行优化,

这一设计背后的原因有 3 个:
纹理是最低级的感知属性,因此它纯粹基于明显的特征,无需任何高级的信息;
正确预测纹理的核心特征是在训练其他任务时被隐式学习的;
这一分支的感受野需要足够小,因此当一张正常大小的图像输入网络,它可以预测不同区域的不同标签。

各任务分类实现:

  • 场景分类:场景标签(最高级语义属性)的标注是图像级别的,它的预估方式是通过对P5的feature maps进行全局均值池化,再加上线性分类器实现的
    注意:P5特征图的降采样率相对较大,这样可以使特征在全局均值池化后更加集中于高等级语义信息
  • 目标检测:实验发现融合FPN所有的特征图的效果,比只使用最高分辨率特征图(P2)的效果更好
  • 目标分离:

关于语义分割

网络结构如下,可以看到我红框框出的一部分是做语义分割的。上下两支是做场景分类和纹理解析的。我分析了下他的语义分割部分的代码。

一般语义分割像PSPNET,一般resnet50,做个8倍下采样,做conv5做 PPM(pyramid pooling),然后融合出结果,这样的话conv2-conv4丰富的特征并没有利用上。

And their **upernet is equivalent to an improvement based on pspnet. **The features after PPM fusion are combined with conv2-conv5 for 4 times respectively. The fusion method is similar to fpn, and finally a fused feature is fused so many times. map, and convolve the segmentation results. The fusion of sensory features is very sufficient, and the auxiliary loss of pspnet is removed.
insert image description here

Feature map pyramid network FPN (Feature Pyramid Networks)

Detailed explanation of FPN network

application:

1 Dry goods | UperNet-SwinTransformer model upload practice based on PIE-Engine AI - Extraction of photovoltaic targets
2. Combination with swin
3. mmsegmentation tutorial 2: How to modify the loss function, specify the training strategy, modify the evaluation index, specify iterators for val Indicator output
This part is mainly for configs/_ base _/models/upernet_swin.py
and swin
The final split head is connected to uppernet, based on the improvement of pspnet, the characteristics after PPM fusion, and then 4 times of fusion with conv2-conv5 respectively , the fusion method is similar to fpn, and finally merged so many times to form a fused feature map

Specifically, in 4 stages. After each stage, the feature map of the output will be appended into a list

For each stage:

They are:

1x128x128x128

1x256x64x64

1x512x32x32

1x1024x16x16

Four sets of feature maps are fed into the branch corresponding to supernet for feature fusion in the figure below. It can be seen that it is very similar to the detected FPN

# model settings
norm_cfg = dict(type='BN', requires_grad=True)
backbone_norm_cfg = dict(type='LN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained=None,
    backbone=dict(
        type='SwinTransformer',
        pretrain_img_size=224,
        embed_dims=96,
        patch_size=4,
        window_size=7,
        mlp_ratio=4,
        depths=[2, 2, 6, 2],
        num_heads=[3, 6, 12, 24],
        strides=(4, 2, 2, 2),
        out_indices=(0, 1, 2, 3),
        qkv_bias=True,
        qk_scale=None,
        patch_norm=True,
        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.3,
        use_abs_pos_embed=False,
        act_cfg=dict(type='GELU'),
        norm_cfg=backbone_norm_cfg),
    decode_head=dict(
        type='UPerHead',
        in_channels=[96, 192, 384, 768],
        in_index=[0, 1, 2, 3],
        pool_scales=(1, 2, 3, 6),
        channels=512,
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=384,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    # model training and testing settings
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))

Copying is actually easy to understand. uppernet_swin.py is just the initial definition of the model. If you use the model, you need to modify some things. For example,
your classification number num_classes may be different. If you want to change to a loss for training, you need to modify uppernet_swin .py is copied, here you must pay attention to the format of the copy, and you must ensure that the hierarchical relationship of each component in the copy must be the same as that of uppernet_swin.py. For example, the
backbone is under the dict of the model, and the decode_head is at the same level as the backbone.
Usually, replication only needs to replicate three places:

pretrained: Whether to use the loss function in the type under loss_decode
under the num_classes under decode_head of the pre-trained data set

6 summary graph

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/zhe470719/article/details/124592765