Paddle image segmentation 7-day punch-in camp learning summary

PaddlePaddle image segmentation 7-day punch-in camp learning summary

Course address: Course

1.FCN

FCN paper address: https://arxiv.org/abs/1411.4038
code address: https://github.com/shelhamer/fcn.berkeleyvision.org

1.1 What is FCN?

For general classified CNN networks, a fully connected layer will be added at the end of the network, and category probability information can be obtained after softmax. But this probability information is one-dimensional, that is, the category of the entire picture is intelligently identified, and the category of each pixel cannot be identified. The image semantic segmentation is a pixel-level classification task, so the classification CNN network needs to be improved. FCN=Fully Convolution Networks, that is, the network is fully convolutional without a fully connected layer. After such a change, the network can output a two-dimensional image. But after the previous CNN, the size of the feature map will become smaller, so how to output the same size as the original image? This requires the adoption of the above method. Let's first look at the FCN network structure.
Insert picture description here

1.2 FCN network structure

After the input image is downsampled by conv1-5, the feature map becomes 1/32 of the original size. For FCN-32s, the pool5 feature is directly up-sampled by 32 times, and then softmax prediction is performed on each point to obtain the segmentation map. Multi-layer feature maps can also be integrated and combined in an element-wise add manner. The figure below is a schematic diagram of the FCN structure. It should be noted here that in order to solve the problem that the output feature map after 1/32 downsampling is too small after the image is too small, the original author of FCN added pad=100 to the first conv1_1 layer.
Insert picture description here

1.3 sampling on feature map

FCN needs to sample the feature map to the original input size after the backbone is output. The general up-sampling methods include resize, transpose conv and un-pooling.

  • Resize
    the original text resize it comes to the use of bilinear interpolation. The principle of interpolation is shown in the figure below.
    Insert picture description here
  • Traspose conv (deconvolution) reverses
    the ordinary convolution, input 2x2, the convolution kernel is 3x3, set the padding value, as shown in the figure below, to get a 4x4 output. Among them, the convolution kernel is equivalent to flipping vertically and then horizontally.
    Insert picture description here

  • The reverse operation of Un-pooling Pooling is not used much. You can understand it by looking at the schematic diagram.
    Insert picture description here

2. U-Net

Paper: https://arxiv.org/pdf/1505.04597.pdf
The disadvantage of FCN is that it does not consider contextual information and the segmentation results are not detailed enough. U-Net adopts the structure of encoder and decoder, and integrates features of different levels to obtain context information, as shown in the figure below. First, perform conv+conv+pooling down-sampling, and then deconvolve to up-sampling. The low-level feature map before Crop, and finally the segmentation map is obtained through softmax.
There are two methods for semantic segmentation in feature fusion: FCN-style point-by-point addition, corresponding to tensorflow's add(); U-Net channel dimension splicing and fusion, corresponding to tensorflow's concat(). However, in class, the teacher said that the difference between the two is not very different. Concat has a large amount of calculation but the effect is better.
Insert picture description here

3. PSPNet

Paper: https://arxiv.org/pdf/1612.01105.pdf, https://hszhao.github.io/projects/pspnet/
PSPNet uses a different strategy to solve the problem of contextual information acquisition-increasing the receptive field. Specifically, first use a feature extraction network such as ResNet to extract features. Then use the pyramid pooling module to aggregate the context information of different regions. This pyramid pooling uses adaptive pooling, that is, given the input feature map HxWxC and the input size binsize, the output feature map of the binsize size can be obtained. The binsize in the paper is [1,2,3,6] ; Use 1X1 convolution to change the feature map channel to C/4; use upsampling to change the feature map size to the original input size HxW; connect the output of the pyramid pooling module with the output of the backbone to obtain a feature map of the size of HxWx2C. Finally, 1x1 convolution and softmax are sampled to get the prediction map.
It is worth mentioning that this backbone can be replaced with dilated ResNet, and the effect may be better.
Insert picture description here

4. DeepLab

DeepLab has four series. The following table is a comparison of different series. Put the model structure drawing drawn by the teacher directly.
Insert picture description here
Insert picture description here
Here we directly look at ASPP (Astrous Spatial Pyramid Pooling), as shown in the figure below. Different from PSPNet's adaptive pooling, dilated conv is used here, that is, hole convolution. The red question mark indicates that the length and width of the feature map after the hole convolution is consistent with the input. There is no need to do upsampling, that is, the padding value of the hole convolution needs to be set to the dilation size, and the stride is set to 1. Kernel represents the size of the convolution kernel.
Insert picture description here
Take a look at the ASPP upgrade module again, that is, conv1x1 and adaptive pool+interpolate have been added.
Insert picture description here
In class, the teacher gave a deeplabv3 model implementation process. I think it is very clear. You can write it yourself according to this, and it will be released here.
Insert picture description here

5. Summary

This time I learned a lot by clocking in on the 7th camp, thanks to the official flying paddle course. At the same time, seeing the gap with others, we still need to work hard!

Guess you like

Origin blog.csdn.net/qq_43265072/article/details/109284584