[cv notes] image segmentation

Image segmentation task, the purpose of which is to achieve pixel-level classification, the output result is a two-dimensional matrix with the same size as the input image, and the value in the matrix corresponds to the category to which the pixel belongs.

1 Overview

1.1 Task Type

 

  • Semantic segmentation: the category to which each pixel belongs (including stuff and object categories), and it is indistinguishable between adjacent instances of the same category

  • Instance segmentation: each object category and mask (only object category)

  • Panoramic segmentation: the instance id of the category and object category to which each pixel belongs, and adjacent instances of the same category can be distinguished

1.2 Evaluation indicators

  • mIOU: the average of the segmentation intersection ratio for each category

  • mAcc: the average of classification accuracy

 

2 Semantic Segmentation Model

2.1 FCN

FCN, full name fully convolutional networks

 

The output of the segmentation is a two-dimensional matrix of the same size as the input image, and the values ​​in the matrix correspond to the class to which the pixel belongs. In FCN, 1x1 conv is used instead of fc layer to adjust the output feature dimension to the required dimension.

In the segmentation process, the image is first down-sampled to extract image features. At this time, the size of the feature map is continuously reduced; in order to output the feature map with the same size as the original image, up-sampling is required. There are generally three ways of upsampling: up-sampling, Transpose Conv, Un-pooling

  • up-sampling:Bilinear Interpolation

 

  • Transpose Conv, deconvolution

 

 

  • Un-pooling

 

2.1.1 Model structure

Upsampling strategy: use multi-size feature map for upsampling, and then element-wise add for feature fusion

 

2.1.2 Advantages and disadvantages

  • advantage

    • Any input size

    • Combining shallow information

  • shortcoming

    • The segmentation result is not fine enough (the shallow information is not fully considered)

    • Does not effectively consider contextual information (small receptive field)

2.2 U-Net

2.2.1 Model structure

U-Net adopts the U-shaped structure of encoder-decoder. During the upsampling process, the feature fusion is in the form of Concatenate; if the size does not match, the crop operation is used to obtain the feature map of the corresponding size.

 

 

2.2.2 Advantages and disadvantages

  • Advantages: Each time upsampling concats the upper layer of feature map, which makes full use of the shallow layer information and helps to improve the edge accuracy

  • Disadvantages: large memory usage

2.3 PSP-Net

PSP-Net (Pyramid Scene Parsing Network), on the basis of FCN, in order to better consider global information, introduces hole convolution and Spatial Pyramid Pooling modules to improve model performance.

2.3.1 Model structure

  • the whole frame

 

  • backbone

    Dilated ResNet, introduce hole convolution in the original resnet, increase the receptive field

 

Atrous convolution : A hyper-parameter called "dilation rate" is introduced, which defines the spacing of values ​​when the convolution kernel processes data.

1. Increase the receptive field

2. Do not reduce the resolution (stride=1, padding)

3. Does not introduce additional parameters and calculations

 

  • Spatial Pyramid Pooling module: By performing pooling operations of different scales on the input feature map, the feature information of multiple scales is fused, and concat with the original feature map, which better combines local features and global features. The module mainly includes the following operations:

    • Adaptive Pool

    • 1X1 Conv

    • Upsample

    • Concat

 

2.4 DeepLab series

The DeepLab series is a series of semantic segmentation algorithms proposed by the Google team.

 

  • DeepLab V1

    • the whole frame

       

       

  • DeepLab v2

    • the whole frame

     

     

    • ASPP

     

     

The purpose of the ASPP module is similar to that of the SPP module in PSP-Net, which integrates feature information of different scales and comprehensively considers local features and global features. The difference is that ASPP uses dilation coefficients of dilated convolution + elementwise add operations, and SPP uses pooling + concat operations of different sizes.

  • DeepLab V3

    • the whole frame

     

    • upgraded ASPP module

      The upgraded aspp module has made some adjustments compared to aspp to better integrate multi-scale information.

     

    • multi-grid

      The Multi-Grid strategy of DeepLab v3 refers to the idea of ​​HDC (hybrid dilated convolution). Its idea is to continuously use multiple dilated convolutions with different expansion rates in one block. The proposal of HDC is to solve the gridding problem that hole convolution may produce. This is because when the dilation rate used by the dilated convolution at the upper layer becomes larger, its sampling of the input will become very sparse, resulting in the loss of some local information. Moreover, some local correlations will be lost, but some semantically irrelevant information in the long distance will be captured.

      The reason for Gridding is that consecutive dilated convolutions use the same dilation rate. In Figure (a), three dilated convolutions are used continuously, so the impact on the center point classification results comes from the surrounding continuous pixel points. The principle of HDC is to use different expansion rates for continuous dilated convolutions. The expansion rates used in Figure (b) are sequentially, then the area that affects the center point category is a continuous area, so it is easier to produce continuous segmentation. Effect.

       

      The original structure of the residual network directly copied from Block-1 to Block-4, and then block4 was copied three times to obtain block5-7, which used different expansion rates to increase the receptive field while avoiding Gridding problems .

       

  • DeepLab V3+

    The overall architecture of the DeepLabv3+ model is shown in the figure below. The main body of its Encoder is a backbone network with atrous convolution, and then connected to the Atrous Spatial Pyramid Pooling module (Atrous Spatial Pyramid Pooling, ASPP) with atrous convolution. Scale information; Compared with DeepLabv3, v3+ introduces the Decoder module, which further integrates shallow information and deep information to improve the accuracy of segmentation boundaries.

    • the whole frame

       

    • backbone:Dilated Xception

    • decoder

      For DeepLabv3, the output_stride of the feature map obtained by the ASPP module is 8 or 16, which is directly bilinearly interpolated to the original image size after passing through the 1x1 classification layer. This is a very violent decoder method, especially output_stride=16. However, this is not conducive to obtaining finer segmentation results, so the v3+ model uses the EncoderDecoder structure for reference and introduces a new Decoder module. First, bilinearly interpolate the features obtained by the encoder to obtain 4x features, and then concat the low-level features of the corresponding size in the encoder, such as the Conv2 layer in ResNet. Since the number of features obtained by the encoder is only 256, the low-level feature dimensions may be very high. , in order to prevent the high-level features obtained by the encoder from being weakened, first use 1x1 convolution to reduce the dimensionality of the low-level features (the output dimension in the paper is 48). After the two features are concat, 3x3 convolution is used to further fuse the features, and finally bilinear interpolation is performed to obtain a segmentation prediction of the same size as the original image.

2.5 HRNet series

HRNet is a brand-new neural network proposed by Microsoft Research Asia in 2019. Unlike the previous convolutional neural network, the network can still maintain high resolution in the deep layer of the network, so the predicted semantic information is more accurate, and the spatial information is more accurate. is also more precise.

The segmented network architecture described before mainly includes two parts: Encoder and Decoder. The Encoder part, mainly through resolution compression (downsample--downsampling), makes semantic aggregation, obtains rich semantic features, and is suitable for classification, but loses a lot of spatial information in the continuous downsampling process, which is not conducive to segmentation. Position-sensitive tasks; in order to improve the accuracy of segmentation, the Decoder part gradually increases the resolution, and finally obtains a high-resolution feature map. Such high-resolution features are more friendly to position-sensitive tasks and can retain more spatial information. However, in the process of downsampling and then upsampling, the resolution of the feature map decreases first and then increases, and spatial information is still lost. Based on this, HRNet designed a network that maintains high-resolution feature maps, so that better and more accurate location information can be obtained.
  • recover high resolution(encoder- decoder, 如PSP-Net、DeepLab)

 

  • maintain high resolution (HRNet)

 

Feature map fusion methods of different resolutions:

 

Head structure diversification:

 

2.5.1 MScaleOCR

MscaleOCRNet belongs to the HRNet series. Compared with the HRNet network structure, it calculates a relationship weight between each pixel and other pixels in the image on the result of HRNet segmentation, superimposes with the original features to form an OCRNet network, and then performs classification based on OCRNet. Layer multi-scale training forms the final MscaleOCRNet.

2.6 transformer series

2.6.1 SegFormer

3 Instance Segmentation/Panoramic Segmentation Model

 

4 Model performance overview

 

Guess you like

Origin blog.csdn.net/j_river6666/article/details/125507098