[Study Notes] Overview of Semantic Segmentation

Semantic segmentation is image segmentation, which is an image pixel-level classification, that is, classifying each pixel of the image. A concept close to it is called instance segmentation, which is semantic segmentation + target detection. Semantic segmentation can only segment all similar pixels, and target detection can separate different individuals, such as Mask RCNN.

Evaluation index

Before understanding the evaluation indicators, let's take a look at the confusion matrix . The confusion matrix comes from the classification evaluation index, which indicates whether the A category is correctly classified as the A category.

For example, our model predicts 15 samples, and the results are as follows.

True value: 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0

Predicted value: 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1

The confusion matrix is

 Obviously, with the confusion matrix, you can easily calculate the accuracy rate, recall rate, F1 score and other indicators.

1. Pixel Accuracy

Pixel accuracy is the simplest indicator of image segmentation, which is the number of correctly classified pixels divided by the total number of pixels, that is, the sum of the diagonals of the confusion matrix divided by the sum of the matrix. Right now:

Pixel Accuracy = \frac{TP+TN}{TP+TN+FT+FN}=\frac{diag(confusion)}{sum(confusion)}

2. IOU(Intersection Over Union)

As the name implies, IOU is the intersection and union ratio, the intersection and union ratio of the Ground Truth and Prediction areas.

 Suppose the person's pixel class is 1, and the background's pixel class is 0. Corresponding to the confusion matrix, for category 1, the intersection is the diagonal element, and the union is the pixel with the real label of 1 and + all pixels predicted to be 1 and - intersection. Right now

IOU = \frac{diag(confusion)}{sum(confusion, axis=1)+sum(confusion, axis=0)-diag(confusion)}

3. mIOU(mean IOU) 

As the name implies, mIOU is the average of all categories of IOU.

4. Dice Score(F1 Score)

F_1=2\frac{precison\cdot recall}{precison+recall}

Introduction to network models

  1. FCN
  2. UNet
  3. SegNet
  4. deeplab(V1 V2)
  5. RefineNet
  6. PSPNet
  7. Deeplab v3
  8. EncNet
  9. ......

1. FCN

https://arxiv.org/abs/1411.4038   (2014)

FCN is the beginning of the application of deep learning in image segmentation, laying the foundation for the principle of semantic segmentation.

  • Fully Convolution: The fully connected layer is converted to a convolutional layer. The feature map obtained by VGG restores the size by upsampling
  • Transpose Convolution: The way of upsampling is similar to the inverse process of convolution, but it is not the real inverse, but the size is restored in the form size, and the parameters of the transpose convolution are obtained through learning.
  • Skip Architecture: It is used to fuse the feature maps of the convolutional layer and the transposed convolutional layer at the same scale to improve the fineness of the segmentation.

2. UNet

 https://arxiv.org/abs/1505.04597v1 (2015)

The design of UNet is the segmentation of applications and medical images. Due to the small amount of data in medical image processing, the method proposed in this paper effectively improves the effect of using a small amount of data set training and detection, and proposes an effective method for processing large-scale images.

The network architecture of UNet is inherited from FCN, and some changes have been made on this basis. The concept of Encoder-Decoder is proposed, which is actually the idea of ​​FCN that first convolutes and then upsamples.

  • U-shaped structure: completely symmetrical.
  • Skip Architecture
  • Transpose Convolution 

3. SegNet

https://arxiv.org/abs/1511.00561 (2015)

SegNet is inherited from FCN, and some changes have been made on the basis of FCN, and the network is smaller in memory.

  • Encoder-Decoder concept: The encoder-decoder concept is proposed, encoding is the convolution process, and decoding is the upsampling process.
  • Skip Architecture: Use contact to integrate links.
  • Maxpooling-INdices: Also called Unpooling (anti-pooling), it is to store the index when pooling in the Encoder. When sampling on the Decoder, the size is enlarged according to the index, and the other parts are filled with 0, and then filled with the conv operation (same conv) Under the sparse matrix, smooth the feature map. As shown below:

preview

It can be seen from the Maxpooling-Indices structure that SegNet has fewer parameters, so the speed is relatively fast, but Maxpooling-Indices, like algorithms such as interpolation, does not require learning parameters, so there is no Transpose Convolution accuracy. High, so the effect of SegNet is not as good as FCN8s. 

4. Deeplab

Deeplab v1: https://arxiv.org/abs/1412.7062 (2014)

Deeplab v2:  https://arxiv.org/abs/1606.00915  (2016)

Before introducing Deeplab, first introduce Dilated Convolution and Conditional Random Field CRF .

Hole convolution is to fill 0 in the middle of the convolution kernel, or equally spaced sampling on the input map, and its calculation method is the same as the standard convolution.

standard convolution 

dilated convolution 

The figure above shows standard convolution and dilated convolution. So why use dilated convolutions?

The statement in the paper is that FCN uses the traditional CNN to first convolve and then pool the image, reduce the image size while increasing the receptive field, and expand the image size by upsampling. However, in the process of reducing and increasing the size, some information must be lost, resulting in a rough segmentation effect. The dilated conv is an operation that can obtain a larger receptive field without pooling. At the same time, dilated conv can also obtain features similar to those at different scales by setting different holes. (Just look at these theoretical things. Most of the empirical things seem to have found a reasonable statement, but the black box is still a black box. This statement is difficult to generalize in deep learning) 

conditional random field

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atro_Taylor Guo-CSDN Blog_Image Semantic Segmentation Based on Conditional Random Fields

Overall, Deeplab does two things:

  • The last two Pooling layers are removed, and Dilated Convolution is used to expand the receptive field and avoid information loss caused by pooling.
  • For the heatmap obtained by the model, considering the prior knowledge of the segmentation edge in the image, such as gradient change, color, etc., the conditional random field CRF is used to refine the heatmap to further optimize the segmentation accuracy.

5. RefineNet

RefineNet: https://arxiv.org/pdf/1611.06612.pdf  (2016)

A variant of UNet that uses the Encoder-Decoder structure. The Encoder uses ResNet101, and the Decoder uses the RefineNet structure.

The figure above is the overall architecture of the network. The role of the RefineNet block is to fuse feature maps of different resolutions. The leftmost column uses ResNet. First divide the pretrained ResNet into four ResNet blocks according to the resolution of the feature map, and then use the four blocks as four paths to the right to fuse them through the RefineNet block, and finally get a refined feature map( Then connect to the softmax layer, and then bilinear interpolation output). Except RefineNet-4, all RefineNet blocks are two-input, used to fuse different levels for refine, and the single-input RefineNet-4 can be regarded as a task adaptation of ResNet first.

Write picture description here

The above figure is the structural details of refineNet. details as follows:

  • RCU : is the unit structure extracted from the residual network
  • Multi-resolution fusion : Firstly, a convolutional layer is used to adapt the multi-input feature map (to minimize the size of the feature map), then upsample, and finally add element-wise. If it is a single-input block like RefineNet-4, this part is not used.
  • Chained residual pooling: The convolutional layer is used as the weight of the subsequent weighted summation. Relu is very important for the effectiveness of the subsequent pooling, and makes the model less sensitive to changes in the learning rate. This chain structure can obtain background context from a wide range of areas. In addition, this structure uses a lot of connections such as identity mapping, regardless of long-distance or short-distance, this structure allows gradients to propagate directly from one block to any other block
  • Output convolutions : Add an RCU before the output.

 6. PSPNet

[Paper Notes] PSPNet: Pyramid Scene Parsing Network bazyd

7. Deeplab V3 

..... 

Guess you like

Origin blog.csdn.net/Eyesleft_being/article/details/121798034