Semantic segmentation (FCN, UNET, DEEPLAB)

Semantic segmentation (FCN, UNET, DEEPLAB)

The purpose of semantic segmentation is to classify each pixel in the image, so as to determine the category to which each pixel belongs. (Classification from pixel level)

1.FCN(Fully Convolustional Networks)

CNN networks such as VGG and Resnet add a fully connected layer at the end of the network, and then obtain the probability information of the category through softmax. The obtained one-dimensional probability information can identify the category of the entire picture. FCN proposes to replace all the fully connected layers behind the network with convolutions to obtain a two-dimensional feature map (feature map), and use the deconvolution layer to upsample the feature map and restore it to the same size as the image, so that each Each pixel produces a prediction result to solve the segmentation problem (FCN is to restore the category of each pixel in the abstract feature map).
Insert picture description here
The structure and operation process of FCN are as follows:
Insert picture description here

  • First, use full convolution to extract features (the part above the dotted line). The blue block in the figure is the convolution block, and the green block is the max pooling block. The input can be a color image of any size, and the size of the output image is the same as the input size.
  • Then predict the classification results from different stages of the convolutional network (the part below the dotted line). After the original image is subjected to multi-layer convolution and pooling operations, the resulting image is getting lower and lower, the resolution rate is getting lower and lower, and the image is It is called Heatmap (ie feature map). Use the deconvolution operation to upsample the feature map until it is restored to the same size as the input image, thereby generating a prediction for each pixel. Assuming that the input image size is n n c and the number of categories is C, the restored image size is n n C, and the maximum numerical description (probability) of the pixel in C images is obtained pixel by pixel as the classification of the pixel. In other words, the finally restored image has been classified.

About the up-sampling operation
Up-sampling operation includes two kinds of resizing and deconvolution.
Resize is mainly achieved directly through bilinear interpolation.
The principle of deconvolution operation is similar to the principle of convolution, which is equivalent to the reverse of ordinary convolution. As shown in the figure below, the convolution operation is to use the 3 3 convolution kernel to convolve the 4 4 original input into a 2 2 feature map; deconvolution is to restore the 2 2 feature map to the original size of 4*4. It is equivalent to an Encode-Decode process.
Insert picture description here
Regarding the skip structure
(such as the lower part of the dotted line in the network structure diagram). The data-by-data addition method is used to fuse the prediction results of three different depths: the shallower results are more refined, and the deeper results are more robust.

2. Sleep

Insert picture description here
The Unet structure is shown in the figure above, and the shape is similar to the letter U. The overall structure is down-sampling first, then up-sampling, returning to the same size as the input image. Specifically:

  • The left half is the encoding Encoder part, which is composed of two 3 3 convolutional layers (Relu) and 2 2 max pooling layers (stirde=2) repeatedly
  • The right half is the decoding Decoder part, which is composed of a 2 2 up-sampling convolutional layer and Concatenation (the feature map output by the Encoder layer corresponding to Crop is added to the up-sampling result of the Decoder layer) + 2 3 3 convolutional layers repeatedly
  • The last layer turns the number of channels into the desired number of categories through 1*1 convolution

Unet performed a total of 4 upsampling, and used skip connection in the same stage, instead of directly supervising and loss inversion on high-level semantic features, ensuring that the feature map of the final response incorporates more low-level features
Unet A feature fusion method different from FCN is used: FCN is the addition of feature corresponding points; Unet splices the features together in the channel dimension to form a thicker feature.

3.Deeplab

Deeplabv1

Deeplab combines the methods of Deep Convolutional Neural Networks (DCNNs) and Probabilistic Graphic Models (DenseCRFs). Because the translation of DCNNs advanced features is not deformed, DCNNs have the problem of insufficient accuracy in semantic segmentation tasks. Deeplab combines the response of DCNNs with fully connected conditional random fields, and innovatively applies hole convolution to the DCNN model.

DCNN has two technical obstacles in image labeling tasks: one is signal down-sampling. Repeated maximum pooling and down-sampling in DCNN bring resolution reduction, resulting in loss of details. Deeplab uses hole convolution to expand the receptive field. In order to obtain more context information; the second is spatial insensitivity. The invariance of spatial transformation is required for classifiers to obtain object-centric decisions, which limits the positioning accuracy of DCNN. Deeplab uses DenseCRF to improve the ability of the model to capture details.

Hole convolution

Traditional CNN uses convolution and pooling to reduce the image size while increasing the receptive field (the size of the area mapped on the input image by the pixels on the feature map output by each layer of the convolutional neural network). However, some information is lost in the process of reducing the image size and then increasing the size of the pooling operation. The proposal of the hole convolution makes it possible to increase the receptive field without pooling.

Insert picture description here
As shown in the figure above, (a) is 3 3 ordinary convolution, (b) is 3 3, and the hole is 1 convolution. Only the red dot is involved in the convolution operation, and the remaining points are skipped; © 3*3, the hole is Convolution of 3. The advantage of the hole convolution is that without pooling loss information, the receptive field is enlarged, so that each convolution output contains a larger range of information.

DenseCRF

The result generated by the convolutional network can represent the position of the rough object as much as possible, and the object cannot be described very conveniently. The use of the fully connected CRF can restore the object boundary well.

For each pixel i, its category is xi, the observation value is yi, each pixel is taken as a node, and the relationship between pixels and pixels is taken as an edge. A conditional random field can be formed, and the category of i can be inferred based on the observed variable yi Label xi.
Insert picture description here
The energy function of CRF is E(x), which is composed of the hospital potential function and the binary potential function. The
Insert picture description here
unary potential function means that the current pixel is observed as yi, and its corresponding label is the probability value of xi. The
Insert picture description here
binary potential function is used to describe the variable The correlation between
Insert picture description here
CRF and the influence of its observation sequence on it CRF is a post-processing process

Deeplabv2

Compared with the v1 version, the hole convolution method has been adjusted, and the ASPP method is proposed, and the base layer is replaced from VGG16 to ResNet

Here mainly talk about some ASPP. ASPP (Atrous spatial pyramid pooling) is produced based on He Kaiming's SPP idea, mainly to solve the multi-scale problem in image processing, the pyramid idea.
Insert picture description here

Deeplabv3

The v3 version mainly improves the ASPP in v2, and good results have been obtained after the abolition of CRF.
Insert picture description here
As shown in the figure above, parallel hole convolution is added, and different hole convolutions are used in the same layer of block to achieve multi-scale.

Reference

[1]https://blog.csdn.net/xxiaozr/article/details/74159614
[2]https://blog.csdn.net/qq_36269513/article/details/80420363
[3]https://zhuanlan.zhihu.com/p/31428783
[4]https://zhuanlan.zhihu.com/p/90418337
[5]http://hellodfan.com/2018/01/22/%E8%AF%AD%E4%B9%89%E5%88%86%E5%89%B2%E8%AE%BA%E6%96%87-DeepLab%E7%B3%BB%E5%88%97/

Guess you like

Origin blog.csdn.net/space_dandy/article/details/107905194