FCN: A Semantic Segmentation Model Based on Deep Learning
Definition of Semantic Segmentation: Fine-grained classification of pixels.
Using deep learning to solve semantic segmentation, the main problems faced are:
-
Early deep models were used for classification, outputting one-dimensional vectors, which could not be divided
-
The depth model is not fine enough
motivation
- How to make the network available for segmentation?
Just let the network output two-dimensional features
How to make early neural networks output two-dimensional images?
Remove the fully connected layer.
- How to make the output of the model fine enough?
The reason for imprecise?
After multi-layer convolution pooling, the resolution of the feature map is low.
For example, a 224*224 image output is a 7*7 feature map, which obviously cannot be very fine
A feasible method is to increase the size of 7*7
The specific method is deconvolution.
model structure
Since the resolution of 1/32 is too low, the direct segmentation is very rough, so it is expanded first and then put together with 1/16... (The idea behind yolov3 is inspired by this)
The front is a regular CNN network, which turns an image into a feature map. Then, deconvolution is performed to obtain the original image size.
layer | output size |
---|---|
input image | 224×224 |
Convolution 1 | 224×224 |
Pooling 1 | 112×112 |
Convolution 2 | 112×112 |
Pooling 2 | 56×56 |
Convolution 3 | 56×56 |
Pooling 3 | 28×28 |
convolution 4 | 28×28 |
pooling 4 | 14×14 |
Convolution 5 | 14×14 |
Deconvolution 6 | 224×224 |
output | 224×224 |
implementation details
deconvolution
Result display
Anti-pooling and deconvolution: only the size is restored, and the information lost during the convolution (or pooling) process is not restored
Downsampling (encoding) Downsampling (decoding)
UNet
DeepLab Series Semantic Segmentation
Problems with FCN
-
The existence of the pooling layer leads to the loss of detailed information
-
Spatial invariance ( no impact on classification results on the left and right ) is not friendly to segmentation tasks
Difficulty 1: Loss of detailed information
Essentially, semantic segmentation is a task that includes low-dimensional semantic features.
It is sensitive to information such as edges, textures, and colors.
The existence of the pooling layer leads to the loss of these details, which cannot be recovered even with upsampling.
As discussed before, the significance of pooling is to concentrate information, thereby expanding the receptive field and increasing the dimension of information.
So how can we maximize the receptive field without losing detailed information?
Solution 1: Hole Convolution
Hole convolution can expand the receptive field as much as possible without pooling, thereby quickly improving the concentration of information.
Exercise: Calculate the final receptive field of the above model
Advantages of hole convolution: expand the receptive field and preserve detailed information
Disadvantages of dilated convolution: Small objects are not robust enough
Difficulty 2: Space invariance
A significant advantage of CNNs is spatial invariance.
The same object should output the same value at different positions, shapes, and angles of the image.
For example, an image of a cat will always output a one hot vector of the cat category.
But for segmentation, this spatial invariance will bring inconvenience.
Corresponding solution: CRF
The fully connected CRF is introduced, and the original image is used as input, combined with the feature map for optimization.
conditional random field
Pre-knowledge
CRF is a discriminative model that uses the conditional probability graph model to model the conditional probability ��(�|�) to complete the discrimination task.
In other words, CRF is an estimate of a conditional probability.
CRF for Semantic Segmentation
motivation
The core purpose of the image segmentation task is to assign a label to each pixel.
However, due to the processing method of pooling convolution, the edges of the target area are blurred.
Then CRF is needed to provide new information for the edge, so as to get a better edge.
Create a random field
Given an image:
* 定义 $X=\{X_1, X_2, ..., X_N\}$, 其中$X_i$为第i个像素的预测标签;
* 定义 $L=\{L_1, L_2, ..., L_N\}$, 其中$L_i$为第i个像素的真实标签;
* 定义 $I=\{I_1, I_2, ..., I_N\}$, 其中$I_i$为第i个像素数据。
intuitive explanation
Each pixel (RGB is a vector) can be used as an observation, and we need to infer a label for each pixel based on the observation.
Therefore, the current mainstream semantic segmentation models are basically the following frameworks
How many times does the CRF cycle? How many categories, how many times to loop
shortcoming
However, there is a relatively big problem in this method, that is, the speed is slow
The reason for the slow speed mainly comes from CRF.
The solution complexity of CRF is higher, so it takes longer to optimize the result.
solution
Approximately solve the CRF by means of training
That is to say, the CRF process is decomposed into a series of convolution processes and solved by RNN.
step2
Secondly, in the message passing step, m Gaussian filters are used to filter on Q.
It is equivalent to blurring the feature map, which is equivalent to the convolution operation.
get the following result
step3
The third step: compatibility transform
step4
After that, unary potentials are performed, ie (subtraction)
That is to compare the last result with the increment of this result.
Finally, normalization is performed using softmax.
CRF as RNN
Treat the above as an RNN process, and perform CRF processing on each type of feature map to obtain better results.
accomplish
Overall block diagram of DeepLab v1
Segmentation To get a two-dimensional thing, therefore, fc--->conv2d:
DeepLabV2
DeepLabV2 is an improvement to DeeplabV1, the main improvements include the following:
-
New bottelNet
-
ASPP
-
Improved CRF
ASPP
ASPP is inspired by SPPNet
Aims at fusing features of different scales together
network structure
DeepLab V3
Improvements over V2:
- The previous methods were all set-connected cnn, and no dilated convolution was used. V3 uses an ever-expanding hole convolution to get better details
- Improved ASPP
1*1 convolution : reduce parameters and change the number of channels .
A better way to fuse features is not to add them up directly: concat them and learn the mapping parameters of the new channel through 1*1 convolution
Overall structure: r incrementally increasing
60*60-->480*480: realized by linear interpolation, which replaces the CRF in V2