MMSegmentation
Open source code warehouse: https://github.com/open-mmlab/mmsegmentation
Rich algorithms: 600+ pre-training models, 40+ algorithm reproductions
Modular design: easy to configure and easy to expand
Unified hyperparameters: a large number of ablation experiments, supporting fair comparison
Ease of use: training tools, debugging tools, inference API
semantic segmentation
The basic idea
split by color
The color inside the object is similar, and the color changes when the object is handed over
Based on image processing methods, segmented by color
pixel by pixel
Advantages: Can make full use of existing image classification models
Problem: Inefficiency, overlapping convolutions are repeatedly calculated
Upsampling of prediction graphs
question:
The image classification model uses the downsampling layer (step size convolution or pooling) to obtain high-level features, resulting in the full volume and network output size being smaller than the original image, while segmentation requires the same size output
Solution:
Upsampling the predicted segmentation image, restoring the resolution of the original image, and the upsampling scheme:
-
bilinear interpolation
-
Transposed Convolutions: Learnable Upsampling Layers
Upsampling based on multi-layer features
Problem: Based on the top-level feature prediction, the prediction map obtained by upsampling 32 times is relatively rough
Analysis: After multiple downsampling of high-level features, the details are seriously lost
Solution idea: combine low-level and high-level feature maps
Solution FCN:
Generate category predictions based on low-level and high-level feature maps, upsample to the size of the original image, and then average to obtain the final result
context information
Original image-backbone network-feature map-prediction map
PSPNET
Original image-feature map-multi-scale pooling-feature splicing-category prediction
Redize the feature map at different scales to obtain contextual features at different scales
The contextual features are stitched back to the original feature map after channel compression and spatial upsampling -> contain both local and contextual features
Generating prediction maps based on fused features
DeepLab series
DeepLab is another series of work on semantic segmentation
Main contributions:
-
Using Atrous Convolutions to Solve Downsampling in Networks
-
Use conditional random field CRF as a post-processing method to refine the segmentation map
-
Capturing context information using multi-scale atrous convolution (ASPP module)
DeepLab V1 was published in 2014, and V2, V3, and V3+ versions were proposed in 2016, 2017, and 2018
Atrous convolution solves downsampling problem
Downsampling layer of image classification model makes output size smaller
If the steps in the pooling layer and convolutional layer are removed:
-
The number of downsampling can be reduced
-
The feature map will become larger, and the convolution kernel needs to be increased accordingly to maintain the same receptive field, but a large number of parameters will be added
-
Use Dilated Convolution/Atrous Convolution to increase the receptive field without increasing parameters
Standard convolution:
Feature map-downsampling-convolution operation-convolution kernel-result
Atrous convolution:
The feature map is unchanged and the convolution kernel is expanded and then the convolution operation is performed - the expansion convolution kernel does not generate additional parameters - the same calculation result downsampling plus standard convolution is equivalent to empty convolution
Atrous convolution and downsampling
The feature map obtained by using the upsampling scheme only has a response of 1/4 of the original image, and interpolation is required
Feature maps of the same resolution can be obtained using dilated convolutions without additional interpolation operations
DeepLab model
DeepLab has made modifications based on the image classification network:
-
Remove the second half of the downsampling layer in the classification model
-
The subsequent convolutional layer is changed to expansion convolution, and the rate is gradually increased to maintain the receptive field of the source network
Conditional Random Field (CRF)
The segmentation map directly output by the model is relatively rough, especially at the object boundary, which cannot produce good segmentation results
DeepLab V1&V2 uses Conditional Random Field (CRF) as a post-processing method, combining the original image color and the predicted category of the neural network to obtain refined segmentation results
CRF is a probabilistic model. DeepLab uses CRF to model the segmentation results, and uses the energy function to represent the quality of the segmentation results. By minimizing the energy function, better segmentation results can be obtained.
Spatial Pyramid Pooling
PSPNet uses pooling of different scales to obtain contextual information of different scales
DeepLab V2&V3 use different scales of spatial convolution to achieve similar effects
Hole convolution with greater expansion rate - larger receptive field - more contextual features
DeepLab V3+
-
DeepLab V2/V3 models use ASPP to capture context features
-
Encoder/Decoder results (such as UNet) are integrated into low-level feature maps during upsampling to obtain finer segmentation maps
-
DeepLab V3+ combines the two ideas and adds a simple decoder structure to the original model structure
Encoder generates multi-scale high-level semantic information through ASPP
Decoder mainly fuses low-level features to produce surprising segmentation results