Segmentation Methods for Deep Learning

FCN: A Semantic Segmentation Model Based on Deep Learning

Definition of Semantic Segmentation: Fine-grained classification of pixels.

Using deep learning to solve semantic segmentation, the main problems faced are:

  • Early deep models were used for classification, outputting one-dimensional vectors, which could not be divided

  • The depth model is not fine enough

motivation

  1. How to make the network available for segmentation?

Just let the network output two-dimensional features

How to make early neural networks output two-dimensional images?

Remove the fully connected layer.

  1. How to make the output of the model fine enough?

The reason for imprecise?

After multi-layer convolution pooling, the resolution of the feature map is low.

For example, a 224*224 image output is a 7*7 feature map, which obviously cannot be very fine

A feasible method is to increase the size of 7*7

The specific method is deconvolution.

model structure

Since the resolution of 1/32 is too low, the direct segmentation is very rough, so it is expanded first and then put together with 1/16... (The idea behind yolov3 is inspired by this)

pytorch implementation of FCN

The front is a regular CNN network, which turns an image into a feature map. Then, deconvolution is performed to obtain the original image size.

layer output size
input image 224×224
Convolution 1 224×224
Pooling 1 112×112
Convolution 2 112×112
Pooling 2 56×56
Convolution 3 56×56
Pooling 3 28×28
convolution 4 28×28
pooling 4 14×14
Convolution 5 14×14
Deconvolution 6 224×224
output 224×224

implementation details

deconvolution

Result display

Anti-pooling and deconvolution: only the size is restored, and the information lost during the convolution (or pooling) process is not restored

Downsampling (encoding) Downsampling (decoding)

UNet

DeepLab Series Semantic Segmentation

Problems with FCN

  1. The existence of the pooling layer leads to the loss of detailed information

  2. Spatial invariance ( no impact on classification results on the left and right ) is not friendly to segmentation tasks

Difficulty 1: Loss of detailed information

Essentially, semantic segmentation is a task that includes low-dimensional semantic features.

It is sensitive to information such as edges, textures, and colors.

The existence of the pooling layer leads to the loss of these details, which cannot be recovered even with upsampling.

As discussed before, the significance of pooling is to concentrate information, thereby expanding the receptive field and increasing the dimension of information.

So how can we maximize the receptive field without losing detailed information?

Solution 1: Hole Convolution

 Hole convolution can expand the receptive field as much as possible without pooling, thereby quickly improving the concentration of information.

Exercise: Calculate the final receptive field of the above model

Advantages of hole convolution: expand the receptive field and preserve detailed information

Disadvantages of dilated convolution: Small objects are not robust enough

Difficulty 2: Space invariance

A significant advantage of CNNs is spatial invariance.

The same object should output the same value at different positions, shapes, and angles of the image.

For example, an image of a cat will always output a one hot vector of the cat category.

But for segmentation, this spatial invariance will bring inconvenience.

Corresponding solution: CRF

The fully connected CRF is introduced, and the original image is used as input, combined with the feature map for optimization.

CRF

conditional random field

Pre-knowledge

CRF is a discriminative model that uses the conditional probability graph model to model the conditional probability ��(�|�) to complete the discrimination task.

In other words, CRF is an estimate of a conditional probability.

 CRF for Semantic Segmentation

motivation

The core purpose of the image segmentation task is to assign a label to each pixel.

However, due to the processing method of pooling convolution, the edges of the target area are blurred.

Then CRF is needed to provide new information for the edge, so as to get a better edge.

Create a random field

Given an image:

* 定义 $X=\{X_1, X_2, ..., X_N\}$, 其中$X_i$为第i个像素的预测标签;
* 定义 $L=\{L_1, L_2, ..., L_N\}$, 其中$L_i$为第i个像素的真实标签;
* 定义 $I=\{I_1, I_2, ..., I_N\}$, 其中$I_i$为第i个像素数据。

intuitive explanation

Each pixel (RGB is a vector) can be used as an observation, and we need to infer a label for each pixel based on the observation.

Therefore, the current mainstream semantic segmentation models are basically the following frameworks

How many times does the CRF cycle? How many categories, how many times to loop

shortcoming

However, there is a relatively big problem in this method, that is, the speed is slow

The reason for the slow speed mainly comes from CRF.

The solution complexity of CRF is higher, so it takes longer to optimize the result.

solution

Approximately solve the CRF by means of training

That is to say, the CRF process is decomposed into a series of convolution processes and solved by RNN.

step2

Secondly, in the message passing step, m Gaussian filters are used to filter on Q.

It is equivalent to blurring the feature map, which is equivalent to the convolution operation.

get the following result

step3

The third step: compatibility transform

step4

After that, unary potentials are performed, ie (subtraction)

That is to compare the last result with the increment of this result.

Finally, normalization is performed using softmax.

CRF as RNN

Treat the above as an RNN process, and perform CRF processing on each type of feature map to obtain better results.

accomplish

sample code

 Overall block diagram of DeepLab v1

Segmentation To get a two-dimensional thing, therefore, fc--->conv2d:

code

DeepLabV2

DeepLabV2 is an improvement to DeeplabV1, the main improvements include the following:

  • New bottelNet

  • ASPP

  • Improved CRF

ASPP

ASPP is inspired by SPPNet

Aims at fusing features of different scales together

network structure

DeepLab V3

Improvements over V2:

  1. The previous methods were all set-connected cnn, and no dilated convolution was used. V3 uses an ever-expanding hole convolution to get better details

  1. Improved ASPP

1*1 convolution : reduce parameters and change the number of channels .

A better way to fuse features is not to add them up directly: concat them and learn the mapping parameters of the new channel through 1*1 convolution

Overall structure: r incrementally increasing

 

 60*60-->480*480: realized by linear interpolation, which replaces the CRF in V2

Guess you like

Origin blog.csdn.net/qq_54809548/article/details/130992115