[Deep Learning] Semantic Segmentation of Convolutional Neural Network|FCN, DeepLab v1 v2 v3, U-Net, Transposed Convolution, Expansion Convolution


1. Introduction to Semantic Segmentation

1. Common segmentation tasks
The fineness of the segmentation increases successively, and the difficulty also increases successively.

Common Classification Tasks meaning typical network
Semantic segmentation A class is assigned to each pixel, but objects of the same class are not distinguished FCN
instance segmentation Separate specific objects (specific examples) in a category Mask RCNN
Panoramic segmentation Segment the background while segmenting the object Panoptic FPN

2. Common dataset formats

PASCAL VOC: This data set mainly contains 20 target categories. In addition, the data set stores PNG images in palette mode (ignoring edges and ignoring pixel loss).

For example, pixel 0 corresponds to (0, 0, 0) black; pixel 1 corresponds to (127, 0, 0) dark red; pixel 255 corresponds to (224, 224, 119)

MS COCO: MS COCO is a very large and commonly used dataset, which includes target detection, segmentation, image description, etc.; polygon coordinates are recorded for each target in the image.


3. The specific form of the result obtained by semantic segmentation
uses the mask (with the effect after the palette) to make each pixel value correspond to the category index. For example, the background pixel value of 0 is displayed as black after the color palette, and the aircraft pixel value A value of 1 is displayed in red, and a person's pixel value of 15 is displayed in light red.


Q: Why not display grayscale images directly?
A: It is directly displayed in a grayscale image. Since the difference in pixel values ​​is small, it is not easy to observe, so use a mask to color it, and display different colors for easy observation and visualized prediction results.


4. Common Semantic Segmentation Evaluation Indicators

  • Pixel Accuracy(Global Acc): ∑ i n i i ∑ i t i \frac{\sum_{i}n_{ii} }{\sum_{i}t_i } itiinii

The numerator is the sum of the predicted correct pixels, and the denominator is the total number of pixels of the target category i.

  • mean Accuracy: 1 n c l s ∑ i n i i ∑ i t i \frac{1}{n_{cls}} \frac{\sum_{i}n_{ii} }{\sum_{i}t_i } ncls1itiinii
  • mean IoU: 1 n c l s ∑ i n i i t i + ∑ j n j i − n i i \frac{1}{n_{cls}} \sum_{i}\frac{n_{ii}}{t_i+\sum_{j}n_{ji}-n_{ii} } ncls1iti+jnjiniinii
    Among them,
    nij n_{ij}nij: Category IIi is predicted to be classjjThe number of pixels of j
    ncls n_{cls}ncls: Number of target categories (including background)
    ti = ∑ jnij t_i=\sum_{j}n_{ij}ti=jnij: target class iiThe total number of pixels of i (true labels)

In pytorch, the mean IoU is calculated by building a confusion matrix

The behavior of the matrix is ​​the real label, and the column is the predicted label. Let's look at the first element of the matrix: the real label is 0, and the number of predicted 0 is 16; and so on to fill the entire matrix.
  • For the accuracy of the entire image: global _ accuracy = 16 + 3 + 16 + 12 + 8 64 ≈ 0.859 global\_accuracy=\frac{16+3+16+12+8}{64} \approx 0.859global_accuracy=6416+3+16+12+80.859
    The numerator is the diagonal elements of the matrix (the number of correct predictions), and the denominator is the total number of pixels

  • Private default accuracy
    cls 0 _ acccls 1 _ acccls 2 _ acccls 3 _ acccls 4 _ acc 16 20 3 4 16 16 12 12 8 8 \begin{matrix} cls0\_acc & cls1\_acc & cls2\_acc & cls3 \_acc & cls4\_acc\\ \frac{16}{20} & \frac{3}{4} &\frac{16}{16} & \frac{12}{12}&\frac{8}{ 8} \end{matrix}cls0_acc2016cls1_acc43cls2_acc1616cls3_acc1212cls4_acc88
    The numerator is the number of correct predictions for each category, and the denominator is the sum of the corresponding true label column elements

  • Individual formulas iou:
    cls 0 _ ioucls 1 _ ioucls 2 _ ioucls 3 _ ioucls 4 _ iou 16 20 + 18 − 16 3 4 + 4 − 3 16 16 + 18 − 16 12 16 + 12 − 12 − 8 \begin{matrix} cls0\_iou & cls1\_iou & cls2\_iou & cls3\_iou & cls4\_iou\\ \frac{16}{20+18-16} & \frac{3}{4+ 4-3} &\frac{16}{16+18-16} &\frac{12}{16+12-12}&\frac{8}{8+12-8}\end{matrix}c l s 0_ i o u20+181616cls1_iou4+433cls2_iou16+181616cls3_iou16+121212cls4_iou8+1288
    Correct number/(Total number of columns + Total number of rows - Correct number)


5. Semantic segmentation labeling tool
Labelme: manual labeling;
EISeg: Baidu open source semi-automatic labeling tool, Paddle framework.


2. Two convolutions for target detection

In target detection, methods different from traditional convolution are usually used: transposed convolution and dilated convolution .

1. Transposed convolution

Transposed convolution Transposed Convolution, also known as fractionally-strided convolution, deconvolution. In addition, transposed convolution playsupsamplingrole.

First clear two points:

  • Transposed convolution is not the inverse operation of convolution
  • Transposed convolution is also convolution

Transpose convolution operation steps:

  1. Fill s − 1 s-1 between input feature map elementss1 row, column 0
  2. Fill k − p − 1 kp-1 around the input feature mapkp1 row, column 0
  3. Flip the convolution kernel parameters up and down, left and right
  4. Do normal convolution operation (fill 0, stride 1)

k : k: k : is the convolution kernel size
s: s:s : is the transpose step size, not the step size
p of normal convolution: p:ppadding

H o u t = ( H i n − 1 ) × s t r i d e [ 0 ] − 2 × p a d d i n g [ 0 ] + k e r n e l s i z e [ 0 ] W o u t = ( W i n − 1 ) × s t r i d e [ 1 ] − 2 × p a d d i n g [ 1 ] + k e r n e l s i z e [ 1 ] \begin{matrix} H_{out}=(H_{in}-1)\times stride[0]-2\times padding[0]+kernel_size[0]\\ W_{out}=(W_{in}-1)\times stride[1]-2\times padding[1]+kernel_size[1] \end{matrix} Hout=(Hin1)×stride[0]2×padding[0]+kernelsize[0]Wout=(Win1)×stride[1]2×padding[1]+kernelsize[1]


To understand transposed convolution in detail, let's first look at the calculation process of ordinary convolution.

Ordinary convolution
When understanding convolution, it is understood that the convolution kernel performs sliding calculations one by one on the original image, but in the actual calculation process, this is not the case, because the efficiency is very low. The computer converts the convolution kernel into an equivalent matrix and the input into a vector. The output vector is obtained by multiplying the input vector and the convolution kernel matrix, and then the output vector is integrated to output a two-dimensional feature. The specific operation steps are shown in the figure below:

  1. Convert the convolution kernel to the equivalent matrix of the input image (build a sparse matrix)
2. Pull the input matrix and the convolution kernel matrix into vectors, and stretch the four 4*4 convolution kernels into vectors for splicing;
3. Multiply the input image vector and the convolution kernel vector to obtain the output vector, and then reshape to obtain a two-dimensional output image. The quantized input image is

We will 1 ∗ 16 1*161The row vector of 16 is multiplied by 16 ∗ 4 16*4164 matrix, get1 ∗ 4 1*41A row vector of 4 .

If in turn a 1 ∗ 4 1*41A vector of 4 ✖️ a4∗16 4*164Can a matrix of 16 get a 1 ∗ 16 1*161What about the row vector of 16 ?This is​​transposed convolution!


transposed convolution

Corresponding to the above ordinary convolution, the transposed convolution formula can be obtained:
OT × CT = ITO^{T}\times C^{T} = I^{T}OT×CT=IT
is shown in the figure below:

Compared with ordinary convolution, we decompose this matrix and understand it more vividly:

  1. Revert the input to 2 ∗ 2 2*22The tensor of 2 , and the resulting4 ∗ 16 4*164Each column vector of the 16 matrix is ​​converted into 16 convolution kernels for convolution respectively;
  • A single convolution operation:
  • All results obtained:
  • Combined with the overall view, it can be found that there seems to be a larger convolution kernel at 2 ∗ 2 2*222- size input slide, each convolution corresponds to a part, we will complete it:
  1. It can be found that the large convolution kernel is obtained by rotating the original convolution kernel, so we rotate the convolution kernel 180 degrees in the visualized transposed convolution and then perform convolution:
* Since the input image is too small, the zero padding operation is performed according to the size of the convolution kernel.

2. Dilated convolution

Dilated convolution (Dilated convolution) is also known as hole convolution. This convolution method is proposed because in the field of image segmentation, the image is input into CNN, and FCN first performs convolution and then pooling on the image like traditional CNN to reduce the image size. At the same time, the receptive field is increased, but since the image segmentation prediction is a pixel-wise output, it is necessary to upsample the smaller image size after pooling to the original image size for prediction. The previous pooling operation makes each pixel prediction. A larger receptive field information can be seen. Therefore, image segmentation FCN has two key points, one is pooling to reduce the image size to increase the receptive field, and the other is upsampling to expand the image size.In the process of first reducing and then increasing the size, some information must be lost, so is there an operation that can have a larger receptive field and see more information without pooling? The answer is dilated conv.

Its main function is to:

  1. Increase the receptive field: The so-called increased receptive field, we can actually calculate it through the calculation formula of the receptive field. The calculation formula of the receptive field is as follows: rl = rl − 1 + ( kl − 1 ) × jl − 1 r_l=r_{l- 1}+(k_l-1)\times j_{l-1}rl=rl1+(kl1)×jl1, where rl r_lrlfor the llL layer receptive field,kl k_lklfor the llL layer convolution size,jl j_ljlfor the llL- layer convolution step size; generally, the size of the receptive field of the first layer isConv1the size of the convolution kernel.
    Take the network structure in the following table as an example, let’s calculate the receptive field:
No. Layers Kernel Size Stride
1 Conv1 3 x 3 3 times33×3 1
3 Conv2 3 x 3 3 times33×3 1
3 Conv3 3 x 3 3 times33×3 1

Initial value of receptive field l 0 = 1 l_0=1l0=1 , the calculation process of the receptive field of each layer is as follows:
l 0 = 1 l_0=1l0=1
l 1 = 1 + ( 3 − 1 ) = 3 l_1=1+(3-1)=3l1=1+(31)=3
l 2 = 3 + ( 3 − 1 ) × 1 = 5 l_2=3+(3-1)\times1=5 l2=3+(31)×1=5
l 3 = 5 + ( 3 − 1 ) × 1 = 7 l_3=5+(3-1)\times1=7 l3=5+(31)×1=7

  1. Keep the width and height of the original input feature map: use hole convolution to increase the receptive field without using the Max Pooling layer to change the size of the feature map.

Gridding effect:
In the process of realizing hole convolution, the problem of Gridding effect as shown in the figure below often occurs.

  • (a) uses 3 × 3 3\times3 respectively from left to right3×3 convolution, the expansion factor is fixed at 2. The impression of the receptive field after superposition of the three convolutions used. From the results, it will be found that the extraction of the receptive field by such a convolution is not continuous, but only a discrete value of a part of the pixels. .
  • (b) Also three 3 × 3 3\times33×3 convolutions, but the expansion is therefore set to1, 2, 3 1,2,31,2,3 , the size of the receptive field in this case is the same as the first case but covers all pixel values ​​in the receptive field.

Hybrid Dilated Convolution (HDC):
In "Understanding Convolution for Semantic Segmentation", a series of methods are proposed to avoid Gridding effect :

  1. Propose the calculation formula of the maximum distance between two non-zero elements in the convolution, and select the expansion factor through the calculation formula;
  2. The inflation factor is set to a sawtooth distribution;
  3. The common divisor of the expansion coefficient cannot be greater than 1;

3. FCN network

The full name of FCN (Fully Convolutional Networks for Semantic Segmentation), as the first end-to-end fully convolutional network for pixel-level prediction , replaces all fully connected layers with convolutional layers. It is a classic target detection network, simple and effective.

Network structure:

The FCN network structure is shown below. By replacing the fully connected network of the last three layers of the VGG16 network with a fully convolutional network, the output result is also turned into a heat map, and the heat map can finally obtain what we need by upsampling. The segmentation result of :

There are three specific network features, namely FCN-32s, FCN-16s, and FCN-8s, as shown in the figure below, which refer to the selection of different feature layers for 32 times, 16 times, and 8 times upsampling respectively. :

insert image description here


Convolutionalization

  • Use 3 fully connected layers in the original VGG network, and finally get the predicted value of 1000 categories. Through softMax, we can know the probability of each category, and visualize the picture of the histogram for 1000 values. If the probability is greater, the height of the histogram higher.

  • In the traditional network if the size of the input image is fixed (in the absence ofglobal pooling layerWhen), if we replace the fully connected layer with a convolutional layer, the input size is not limited, and this is how Convolutionalization is implemented. If our input is greater than 224 × 224 224 \times 224224×224 , then the obtained output width and height must be greater than 1. At this time, the obtained output is a 2D data, which can be visualized as a heat map.

  • When the last three layers of VGG16 are fully connected, input 7 × 7 × 512 7\times7\times5127×7×The feature map of 512flatten contains25088 2508825088 parameters, since each node must be connected to an output node, soFC1the amount of parameters passed is25088 × 4096 = 102760448 25088\times4096=10276044825088×4096=102760448 (while ignoring paranoia)

  • When the last three layers of VGG16 are convolutional layers, after passing Conv(7*7,s1,4096)(padding is adjustable), Convthe parameter amount is 7 × 7 × 512 × 4096 = 102760448 7\times7\times512\times4096=1027604487×7×512×4096=102760448 . The amount of parameters obtained by the two is the same, but the height and width information can be preserved through convolution processing, while flatten has lost the height and width information.


Cross Entropy Loss
calculates the cross entropy loss for each pixel, and then averages the cross entropy of all pixels to obtain the final loss.


1. FCN-32s

insert image description here

  • The VGG16 backbone corresponds to the part before the three fully connected layers. The image obtained through the backbone is down-sampled by 32 times, and the output obtained is h 32 × w 32 × 512 \frac{h}{32} \times \frac{w}{32} \times51232h×32w×512 ; due to the padding = 3 set during the convolution processpadding=3padding=3 Sofc6the obtained size ish 32 × w 32 × 512 \frac{h}{32} \times \frac{w}{32} \times51232h×32w×512 ; Sincefc7it is1 × 1 1\times 11×1 Convolution with a step size of 1 does not change the size of all output images; thefc8obtained output is determined by the classification category ash 32 × w 32 × num _ cls \frac{h}{32} \times \frac{ w}{32} \times num\_cls32h×32w×n u m _ c l s ; Finally, through transposed convolution, it will be upsampled 32 times and restored to the original image sizeh × w × num _ clsh \times w \times num\_clsh×w×n u m _ c l s (because the learning parameters of the upsampling method are frozen, please refer to the method of bilinear interpolation).

2. FCN-16s

insert image description here

  • BackboneThe , fc6, fc7and parts of the FCN-16s network Conv2dare the same as those of the FCN-32s. In Conv2dthe final transposed convolution, it is no longer 32 times, but upsampled by 2 times, and the output obtained is h 16 × w 16 × num _ cls \frac{h}{16} \times \frac{w}{16} \times num\_cls16h×16w×n u m _ c l sMax pooling4 ; the output inthe VGG-16 networkh 16 × w 16 × 512 \frac{h}{16} \times \frac{w}{16} \times 51216h×16w×512Conv2 , and geth 16 × w 16 × num _ cls \frac{h}{16} \times \frac{w}{16} \times num\_clsafter another layer16h×16w×The output of n u m _ c l s , and then add the two (matrix addition), and then upsample it by 16 times to geth × w × num _ clsh \times w \times num\_clsh×w×Segmentation map of n u m _ c l s .

3. FCN-8s

insert image description here

  • On the basis of FCN-16s, add Max pooling3the output from VGG-16 h 8 × w 8 × 256 \frac{h}{8} \times \frac{w}{8} \times 2568h×8w×256 , and thenConv2geth 8 × w 8 × num _ cls \frac{h}{8} \times \frac{w}{8} \times num\_cls8h×8w×The output of n u m _ c l s , and then add the downsampled eight times feature map from FCN-16 to it, and finally upsample eight times to geth × w × num _ clsh \times w \times num\_clsh×w×Segmentation map of n u m _ c l s .

4. DeepLab

DeepLab has three series, namely V1, V2, and V3. Now versions 1 and 2 are not mainstream, and version 3 is commonly used, so we briefly introduce version 1 and 2, and focus on version 3.

1. DeepLab V1

The DeepLab v1 paper proposes two difficulties, Signal Sampling and Spatial Insensitivity .

  • Signal Sampling means that when the network is downsampled, the resolution of the picture will be reduced. To solve this problem, the author proposes dilated convolution ;
  • Spatial Insensitivity refers to the problem of spatial insensitivity (also refers to spatial insensitivity), semantic segmentation requires spatial sensitivity, and different observations of the same object require different results.

The network structure of DeepLab v1 is mainly improved and upgraded from VGG-16. The network structure before the fully connected layer is similar, and there are three additional modules: CRF module , MSc module , and Large FOV module .

1.1 Expansion Convolution

In response to the low resolution of downsampling, the author proposes dilated convolution to obtain a larger receptive field.

1.2 FC-CRF(Conditional Random Field)

Fully connected CRF. DeepLab V1 finally uses fully-connected CRF post-processing to further refine the segmentation segmentation mask. CRF is a probabilistic model that predicts target pixels given surrounding pixels, taking all pixels of an image as conditional input. This is useful for the optimization of segmentation mask boundaries, exploiting the correlation between pixels.

1.3 MSc(Multi-Sclae):

Multi-scale feature aggregation. The input and the feature layer after the first four layers of Max pooling are aggregated through two layers of multi-layer perceptrons and the final output of the network. This module can slightly improve the mean IoU index but increases the number of parameters. The author does not recommend it.

1.4 LargeFOV(Large field of view):

Larger feature size. The DeepLab network, like the FCN network, converts the fully connected layer into a convolutional layer on the basis of VGG, and the maximum downsampling rate is 8 times. fc6Originally used 4096 7 × 7 7 \times 77×7 convolution kernel, but this has become the bottleneck of the network, the author changed it to a convolution kernel size of3 × 3 3 \times 33×3 , the expansion convolution with an expansion factor of 12, which reduces the number of network parameters, and the mean IoU is67.64 67.64%67.64 , training speed increased.

The DeepLab-LargeFOV structure is shown in the following table:

Differences from VGG16:

  1. The first Maxpool layer consists of a convolution kernel with a size of 2 × 2 2 \times 22×2 is changed to3 × 3 3 \times 33×3
  2. The step size of the fourth Maxpool layer is changed to 1, so that the maximum downsampling rate of the network is 8 times;
  3. Conv5Change to 3 × 3 3 \times 33×The expansion convolution of 3 , the expansion factor is 2;
  4. fc6Change to 3 × 3 3 \times 33×The expansion convolution of 3 , the expansion factor is 12;
  5. The rest of the fully convolutional layers are similar to FCN

Features:

  • maxpool changed to overlapped
  • Nullify both downsampling layers
  • dilated convolution

2. DeepLab V2

Problems and solutions raised by DeepLab V2:

  1. The resolution is reduced : set the stride of the last few Max Pooling layers to 1 (the resolution does not change), and use the expansion convolution;
  2. The multi-scale problem of the target: Scale the image to multiple scales to recommend through the network, and then fuse multiple structures. This is a large amount of calculation. To solve this problem, we proposeASPP module
  3. The invariance of DCNN will reduce the positioning accuracy: Similar to DeepLab V1, CRFs is used to solve it, but V2 usesfully connected pairwise CRF, which is more efficient than the fully connected CRF in V1.
    In addition to the above optimization methods, DeepLab V2 also replaces BackBone with ResNet, which makes the network faster, more accurate, and the model structure is simpler (through DCNN and CRF cascading).
2.1 ASPP(Atrous Spatial Pyramid Pooling):
ASPP adopts the idea of ​​SPP. SPP is to connect multi-scale block downsampling to obtain feature vectors of fixed dimensions, and ASPP is to connect expansion convolutions of different rates to obtain feature maps.

The comparison of DeepLab-LargeFOV and DeepLab-ASPP network structure is shown in the figure below:

2.2 Network structure

The output of DeepLab V2 is also multi-scale: 1.0, 0.75, 0.5 to achieve feature learning and aggregation of more scales. In addition, it uses ResNet as the basic feature extraction network, which improves network performance and convergence speed.


3. DeepLab V3

The differences between DeepLab V3 and DeepLab V2 are:

  1. Improved ASPP module
  2. Introducing the Cascaded module
  3. Introducing Multi-Grid
  4. cancel CRF
3.1 Improved ASPP

insert image description here
Compared with the ASPP module in DeepLab V2, a global pooling (Image Pooling) is added to the ASPP module here. In addition, 1 × 1 1\times1 is added later1×1 convolutional layer (including BN layer and ReLu) for further fusion.

3.2 Cascaded module

insert image description here
Among them, Block1-Block3 is the structure in the original ResNet network, and Block4-Block7 is the same as the Block structure of ResNet, but the convolutions are all expansion convolutions, and the expansion coefficient of each layer is different. This is the Multi-Grid structure. We Also notice that the parameter rate is used in the figure. The value multiplied by the rate and the expansion coefficient of each layer of expansion convolution is the final expansion coefficient of the convolution layer. We can see that the rate in Block4-Blcok7 is 2, 4, 6, 16 are incremented sequentially.

3.3 Multi-Grid

A new hyperparameter Multi-grid (MG) is introduced in V3 to adjust the rate of hole convolution. For example, when MG={1,2,4}, base rate=2, the rate of the three convolutional layers in a block will be set to {2,4,8}.

3.4 Cancellation of CRFs

The latest segmentation network is powerful enough that using CRF can no longer bring about any performance improvement, and it cannot perform end-to-end learning and takes a long time to calculate.


5. U-Net network

U-Net

UNet was published in MICCAI in 2015. The original paper was titled "U-Net: Convolutional Networks for Biomedical Image Segmentation". UNet is mainly used in the field of medical imaging.

The specific analysis of the network structure is as follows: 1. U-Net is mainly composed of two parts. The downsampled part is used for sign extraction, which is called `contracting path`, and the upsampled part is for restoring size and feature fusion, called `expensive path`; 2. In the original U-Net, the author has a crop operation in order to reduce the amount of calculation, resulting in the segmentation result of the final network output being inconsistent with the size of the original image, and the current mainstream implementation method does not perform the crop operation; by adding padding, Make the input and output images consistent. 3. Upsampling is implemented by transposed convolution, see above for details.

reference article

This article is based on the record article written by the up master of station B, Thunderbolt, and the picture is invaded and deleted
https://blog.csdn.net/tsyccnh/article/details/87357447
https://blog.csdn.net/qq_27586341/article/details/103131674
https://blog.csdn.net/u012862372/article/details/81045593
https://blog.csdn.net/LawGeorge/article/details/111655984
https://towardsdatascience.com/witnessing-the-progression-in-semantic-segmentation-deeplab-series-from-v1-to-v3-4f1dd0899e6e
https://zhuanlan.zhihu.com/p/75333140

Guess you like

Origin blog.csdn.net/m0_52427832/article/details/127621805