Image segmentation based on deep learning methods

作者:WeisongZhao
来源:CSDN

链接:https://blog.csdn.net/weixin_41923961/article/details/80946586
编辑:王萌(深度学习冲鸭公众号)
著作权归作者所有,本文仅作学术分享,若侵权,请联系后台删文处理

CNN image semantic segmentation is basically this routine:

1. Downsampling + Upsampling: Convlution + Deconvlution/Resize

2. Multi-scale feature fusion: feature point-by-point addition / feature channel dimension stitching

3. Obtain the segment map at the pixel level: judge the category of each pixel

Even the more complex DeepLab v3+ is still this basic routine.

e43a7ea35e927ab9aa11b2e5d81bfd02.png

Figure 13 DeepLab v3+


Image Segmentation (image segmentation) network structure comparison

d9ab5a30f2a639cc42bb3b6272f7d5d3.png


FCNImage Segmentation Family Tree

FCN

  • DeepLab

  • DeconvNet

    • SegNet

  • PSPNet

  • Mask-RCNN


By segmentation purpose

Ordinary division

Separate pixel regions that belong to different objects. 
For example, the foreground is separated from the background, and the dog's area is separated from the cat's area and the background.

Semantic segmentation

On the basis of ordinary segmentation, the semantics of each area (that is, what object is this area) are classified. 
For example, all objects in the picture are pointed out to their respective categories.

instance segmentation

On the basis of semantic segmentation, each object is numbered. 
For example, this is the dog A in the picture, and that is the dog B in the picture.

Paper recommendation

Semantic segmentation of images is a very important task in computer vision. Its goal is to classify each pixel in the image. If image segmentation can be done quickly and accurately, many problems will be solved easily. Therefore, its application areas include but are not limited to: autonomous driving, image beautification, 3D reconstruction, etc.

f1ac6241f4025f9a579de3cbc8f66023.png

Semantic segmentation is a very difficult problem, especially before deep learning. Deep learning has improved the accuracy of image segmentation a lot. Below we summarize the most representative methods and papers in recent years.


Fully Convolutional Networks (FCN)

The first paper we introduce is Fully Convolutional Networks for Semantic Segmentation, or FCN for short.

This paper is the first paper to successfully use deep learning for image semantic segmentation. There are two main contributions of the paper:

A fully convolutional network is proposed. The fully connected network is replaced by a convolutional network, so that the network can accept images of any size and output a segmentation image of the same size as the original image. Only then can a classification be done for each pixel.

Deconvolution layers are used. The feature map of the classification neural network is generally only a fraction of the size of the original image. To map back to the original image size, the feature map must be upsampled, which is the role of the deconvolution layer.

Although the name is deconvolution layer, it is not actually the inverse operation of convolution. A more appropriate name is Transposed Convolution, which is used to roll out large feature maps from small feature maps.

This is the pioneering work of neural network for semantic segmentation, which needs to be thoroughly understood.


DeepLab

DeepLab has v1 v2 v3, the first title is DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.

This series of papers introduces the following important methods:

The first is convolution with holes, the English name is Dilated Convolution, or Atrous Convolution. Convolution with holes is actually a few holes inserted in the middle of the ordinary convolution kernel, as shown in the following figure.

5dc791a8dbaefb36e1f206f024f16756.png

Its calculation amount remains the same as that of ordinary convolution, and the advantage is that its "field of view is larger". For example, the field of view of the result of ordinary 3x3 convolution is 3x3, and the field of view after inserting a hole is 5x5. The effect of increasing the field of view is that when the feature map is reduced to the same multiple, more global information of the image can be grasped, which is very important in semantic segmentation.

6b9f972f48cc3fb51c4efc4f4f7c8bdc.png


Pyramid Scene Parsing Network

The core contribution of Pyramid Scene Parsing Network is Global Pyramid Pooling, which is called Global Pyramid Pooling when translated into Chinese. It scales the feature map to several different sizes, so that the features have better global and multi-scale information, which is very useful in improving the accuracy.

51c0e0974f7a7860d4895681655440e6.png

In fact, not only semantic segmentation, but also pyramid multi-scale features are very useful for various visual problems.

Mask R-CNN

Mask R-CNN is the masterpiece of the great god He Kaiming, which combines Object Detection and Semantic Segmentation. Its contributions are mainly as follows.

First , the neural network has multiple branch outputs. Mask R-CNN uses a framework similar to Faster R-CNN. The output of Faster R-CNN is the bounding box and category of the object, while Mask R-CNN has an additional branch to predict the semantic segmentation map of the object.

That is to say, the neural network learns two tasks at the same time and can promote each other.

Second , use Binary Mask in semantic segmentation. The original semantic segmentation prediction category needs to use numbers such as 0 1 2 3 4 to represent each category. In Mask R-CNN, the detection branch predicts the class. At this time, the segmentation only needs to use 0 1 to predict the shape mask of the object.

Third , Mask R-CNN proposes RoiAlign to replace RoiPooling in Faster R-CNN. The idea of ​​RoiPooling is to map any region in the input image to the corresponding region in the neural network feature map.

RoiPooling uses a rounded approximation to find the corresponding regions, resulting in a deviation of the correspondence from the actual situation. This offset is tolerable in classification tasks, but has a greater impact on finer segmentation.

To solve this problem, RoiAlign no longer uses rounding operations, but uses linear interpolation to find more accurate corresponding regions. The effect is to get a better correspondence.

Experiments have also proved that the effect is good. The comparison with the previous method is shown below. The picture below is Mask R-CNN, which can be seen to be much finer.

e72627871f5354acb188cd68fae70d8b.png

U-net

U-Net is a segmentation network proposed by the original author to participate in the ISBI Challenge, which can adapt to a very small training set (about 30 images). Both U-Net and FCN are small segmentation networks, neither using atrous convolution nor followed by CRF, and the structure is simple.

82505c0c7c8aab4499ed4c42bfed0ca2.png

Figure 9 U-Net network structure diagram

The entire U-Net network structure is shown in Figure 9, which is similar to a big U letter: first perform Conv+Pooling downsampling; then Deconv deconvolution performs upsampling, and the low-level feature map before crop is fused; then upsample again.

Repeat this process until the feature map of output 388x388x2 is obtained, and finally the output segment map is obtained through softmax. Generally speaking, it is very similar to the FCN idea.

Why bring up U-Net?

It is because U-Net adopts a completely different feature fusion method from FCN: splicing!

2d48a75635ccf0a72c192e4f9adf3140.png

Figure 10 U-Net concat feature fusion method

Different from the point-by-point addition of FCN, U-Net uses stitching features together in the channel dimension to form more "thick" features.

Therefore, the semantic segmentation network also has two methods for feature fusion:

1. FCN-style point-by-point addition, corresponding to caffe's EltwiseLayer layer, corresponding to tensorflow's tf.add()

2. U-Net-style channel dimension splicing and fusion, corresponding to caffe's ConcatLayer layer, corresponding to tensorflow's tf.concat()

Overview Introduction

 Image semantic segmentation, in simple terms, is to classify each pixel on the image given a picture

From the image, we need to split the actual scene graph into the following segmentation graphs:

31c947d48677a6a1d575c4968a104120.png

Different colors represent different categories. After reading "a lot" of papers and looking at the PASCAL VOC Challenge performance evaluation server, I found that image semantic segmentation was introduced into this task (FCN) from deep learning until now, and a general framework has been roughly determined. which is:

  • FCN - Fully Convolutional Network

  • CRF - Conditional Random Field

  • MRF - Markov Random Field

7f0bb4ac71de53f56ede6562d1018ffd.png

The front-end uses FCN for rough feature extraction, and the back-end uses CRF/MRF to optimize the output of the front-end, and finally obtains the segmentation map.

front end

Why do you need FCN?

The network we use for classification usually connects several layers of fully connected layers at the end, it will flatten the original two-dimensional matrix (picture) into one-dimensional, thus losing spatial information, and finally training output a scalar, this is our classification Label.

The output of image semantic segmentation needs to be a segmentation map, regardless of size, but at least two-dimensional.

Therefore, we need to discard the fully connected layer and replace it with a fully convolutional layer, and this is a fully convolutional network.

For specific definitions, please refer to the paper:

Fully Convolutional Networks for Semantic Segmentation

Front-end structure

FCN

The FCN here refers specifically to the structure proposed in the Fully Convolutional Networks for Semantic Segmentation paper, rather than a generalized fully convolutional network.

The author's FCN mainly uses three techniques:

  • Convolutional

  • Upsampling (Upsample)

  • Skip Layer

convolution

Convolution is to discard the fully connected layer of the ordinary classification network, such as VGG16, ResNet50/101 and other networks, and replace it with the corresponding convolutional layer.

1cfc821c640487e7ce7017cf9f888970.png

upsampling

The upsampling here is deconvolution. Of course, the name is different for different frameworks. It is called Deconvolution in Caffe and Kera, and conv_transpose in tensorflow.

In the CS231n course, it is said that it is more appropriate to call conv_transpose.

As we all know, ordinary pooling (please see below for why this is ordinary pooling) will reduce the size of the image. For example, after VGG16 pooling five times, the image is reduced by 32 times. In order to get a segmented image as large as the original image, we need upsampling/deconvolution.

Similar to convolution, deconvolution is a multiplication and addition operation. But the latter is many-to-one, and the former is one-to-many. The forward and backward propagation of deconvolution can only be used to reverse the forward and backward propagation of convolution.

So there is no problem with either optimization or backpropagation algorithm. The diagram is as follows:

15c8ef36f01b1a5024ad37324a947d7b.png 

However, although the text says it is a learnable deconvolution, the author's actual code does not allow it to learn, probably because of this one-to-many logical relationship. code show as below:

layer {
  name: "upscore"
  type: "Deconvolution"
  bottom: "score_fr"
  top: "upscore"
  param {
    lr_mult: 0
  }
  convolution_param {
    num_output: 21
    bias_term: false
    kernel_size: 64
    stride: 32
  }
}

You can see that lr_mult is set to 0.

jump structure

(This strange name is translated by me, it seems to be called ignore the connection structure in general) The function of this structure is to optimize the results, because the results obtained by directly upsampling the results after the full convolution are very rough, so the author will be different. The result of the pooling layer is upsampled to optimize the output. The specific structure is as follows:

f595e707f98cedb07d30d69fd64dd0b5.png

The results obtained with different upsampling structures are compared as follows:

49ab82a41b3fc7a804e1c4365b694be9.png

Of course, you can also upsample the output of pool1 and pool2. However, the authors say that the improvement in results obtained in this way is not large.

This is the first structure and the pioneering work of deep learning applied to image semantic segmentation, so it won the best paper of CVPR2015. However, there are still some rough places to deal with, and the specific comparison with the latter will be known.

SegNet/DeconvNet

Such a structure is summarized here, but I think the structure is more elegant, and the result it gets is not necessarily better than the previous one.

SegNet

fdc2b2bba13bc9996578484dd9370201.png

 DeconvNet

e9d2ef1418f2a89f2b0a5f7580dde297.png 

Such a symmetric structure has the feeling of an auto-encoder in it, first encoding and then decoding. Such structures mainly use deconvolution and up-pooling. which is:

40ad574a97f55e8596c2fd343e4ba2b2.png  a2cffd533d58ea64e7479c0aa4cb3a16.png

Deconvolution as above. The implementation of pooling is mainly to remember the position of the output value during pooling, and then fill in the value back to the original position during pooling, and fill in 0 for other positions to be OK.

DeepLab

Next, a very mature and elegant structure is introduced, so that many of the current improvements are based on this network structure.

First of all, we will point out the roughness of the first structure FCN: in order to ensure that the size of the output will not be too small, the author of FCN directly added 100 padding to the original image in the first layer. Noise will be introduced.

And how can we ensure that the output size is not too small without adding 100 padding?

Some people may say that it is not enough to reduce the pooling layer, which is theoretically possible, but this directly changes the original available structure, and the most important point is that the previous structure parameters cannot be used for fine-tune.

Therefore, Deeplab uses a very elegant approach here: change the stride of pooling to 1, and add 1 padding. In this way, the size of the pooled image is not reduced, and the characteristics of pooled integration features are still retained.

However, things are not over yet. Because the pooling layer has changed, the receptive field of the subsequent convolution has also changed correspondingly, so fine-tune cannot be performed. Therefore, Deeplab proposed a new convolution, convolution with holes: Atrous Convolution. That is:

9e29c383b51a39b75cd78cd1e9680ff6.png

 The specific receptive field changes are as follows:

f80c93c921eb27798a234800344cbb5e.png

a is the result of ordinary pooling, and b is the result of "elegant" pooling. We imagine that a normal convolution with a convolution kernel size of 3 is performed on a, and the corresponding receptive field size is 7. When the same operation is performed on b, the corresponding receptive field becomes 5. The receptive field is reduced.

But if you use Atrous Convolution with a hole of 1, the receptive field is still 7.

Therefore, Atrous Convolution can ensure that the receptive field after such pooling remains unchanged, so that it can fine tune and ensure that the output results are more refined. which is:

3011f2b35c0569d50a4b2bcf799b7ce4.png

Summarize

Three structures are introduced here: FCN, SegNet/DeconvNet, DeepLab. Of course, there are some other structural methods, such as using RNN to do, there are more practical weakly-supervised methods and so on.

rear end

Finally got to the back end. The back end will talk about a few fields, involving some mathematical things. My understanding is not particularly profound, so complaints are welcome.

Fully Connected Conditional Random Field (DenseCRF)

For each pixel , there outside_default.pngis a class label outside_default.pngand a corresponding observation value outside_default.png, so that each pixel is used as a node, and the relationship between pixels is used as an edge, which constitutes a conditional random field.

And we infer the category label corresponding to the outside_default.pngpixel by observing the variable . The conditional random field is as follows:outside_default.pngoutside_default.png

76fdc0870eddcf4cfc83fc5d0a558278.png 

The conditional random field conforms to the Gibbs distribution: (here outside_default.pngis the observed value mentioned above)

outside_default.png

where outside_default.pngis the energy function. For simplicity, the global observations are omitted below outside_default.png:

outside_default.png

The unary potential function outside_default.pngis the output from the front-end FCN. The binary potential function is as follows:

outside_default.png

The binary potential function describes the relationship between pixels and encourages similar pixels to be assigned the same label, while pixels with a large difference are assigned different labels, and the definition of this "distance" is related to the color value and the actual relative distance.

So this CRF can make the picture as far as possible to be divided at the boundary.

The difference of the fully connected conditional random field is that the binary potential function describes the relationship between each pixel and all other pixels, so it is called "full connection".

Feel free to understand this bunch of formulas... and it is more troublesome to directly calculate these formulas (I think it is also troublesome), so the mean field approximation method is generally used for calculation.

And the mean field approximation is a bunch of formulas, I won't give it here (I don't think everyone is willing to read it), students who are willing to understand can directly read the paper.

CRFasRNN

At first, DenseCRF was used directly after the output of FCN, which is relatively rough.

And in deep learning, we are all pursuing end-to-end systems, so this article CRFasRNN really combines DenseCRF into FCN.

This article also uses the mean field approximation method, because each step of the decomposition is some multiplication and addition calculations, and ordinary addition and subtraction (see the paper for the specific formula), so it is convenient to describe each step as a Layers are computed like convolutions.

This can be incorporated into the neural network, and there is no problem with forward and backward propagation.

Of course, the author also iterates it here, and the results obtained by different iterations are also different in degree of optimization (generally the number of iterations within 10), so the article says it is as RNN.

The optimization results are as follows:

d04a9c774b998fc4a2740e60c8eae3a9.png

Markov Random Field (MRF)

MRF is used in Deep Parsing Network, and its formula is similar to CRF, but the author has modified the binary potential function:

outside_default.png

Among them, the author added the outside_default.pnglabel context, because it outside_default.pngonly defines the frequency of two pixels appearing at the same time, and outside_default.pngsome situations can be punished, for example, people may be next to the table, but the possibility of being under the table is less.

So this quantity can learn the probability of different situations. The original distance outside_default.pngonly defines the relationship between two pixels. The author adds a triple penalty here, that is, the outside_default.pngnearby ones are also introduced outside_default.png, so that the description of the tripartite relationship is convenient to obtain more sufficient local context. The specific structure is as follows:

c6f3e07c5c7f62877d1799c82e1aeae2.png

The advantages of this structure are:

  • Constructing the mean field into a CNN

  • Joint training and one-pass inference without iteration

Gaussian Conditional Random Field (G-CRF)

This structure uses CNN to learn the unary potential function and the binary potential function respectively. This structure is what we prefer:

8bfc9f50b38a27d7cd23e7a011b0b637.png

And the energy function here is different from before:

outside_default.png

And when it outside_default.pngis symmetric positive time, finding outside_default.pngthe minimum value is equal to solving:

outside_default.png

The advantages of G-CRF are:

  • Secondary energy has a clear global

  • Much easier to solve linearly 

comprehension

FCN is more of a trick. Continuous progress is made as the performance of basic networks (eg VGG, ResNet) improves.

Deep Learning + Probabilistic Graphical Models (PGM) is a trend. In fact, DL is to extract features, and PGM can explain the relationship between the essence of things well from mathematical theory.

Networking of probabilistic graphical models. Because PGM is usually inconvenient to add to the DL model, after the PGM is networked, the PGM parameters can be self-learned, and an end-to-end system can be formed at the same time.

quote

[1]Fully Convolutional Networks for Semantic Segmentation

[2]Learning Deconvolution Network for Semantic Segmentation

[3]Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

[4]Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

[5]Conditional Random Fields as Recurrent Neural Networks

[6]DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

[7]Semantic Image Segmentation via Deep Parsing Network

[8]Fast, Exact and Multi-Scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs

[9] SegNet

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

39cabf69105b482f2b5fde380f69dadc.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123492137