[Semantic Segmentation] DeepLab v1 network (semantic segmentation, signal downsampling, spatial insensitivity, LargeFOV, dilated convolution, dilated convolution, MSc, Multi-Scale)

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Insert image description here

  • DeepLab v1 is a convolutional neural network model for semantic segmentation. Its core idea is to incorporate global contextual information to better understand the semantic content in images .
  • Published in CVPR in 2014

Abstract

Deep convolutional neural networks (DCNNs) have recently demonstrated state-of-the-art performance in high-level vision tasks such as image classification and object detection. This study combines the methods of DCNNs and probabilistic graphical models to solve pixel-level classification tasks (also known as "semantic image segmentation"). We found that the response of DCNNs in the last layer is not localized enough for accurate object segmentation. This is due to the invariance properties that make DCNNs perform well in advanced tasks. We overcome this poor localization property of deep networks by combining the response of the last layer of DCNN with a fully connected conditional random field (CRF). Qualitatively, our "DeepLab" system is able to locate segmentation boundaries with an accuracy that exceeds previous methods. Quantitatively, on the PASCAL VOC-2012 semantic image segmentation task, our method achieves an IOU accuracy of 71.6%, creating new state-of-the-art results. We show how to obtain these results efficiently: careful network redirection and a novel application of the Bohr community's "hole" algorithm make it possible to densely compute neural network responses at 8 frames per second on modern GPUs.

DCNNs is the abbreviation of "Deep Convolutional Neural Networks". It is a deep learning model and an extended form of convolutional neural networks (CNNs). Deep convolutional neural networks are a type of neural network composed of multiple convolutional layers and other types of layers. They are widely used in computer vision tasks such as image classification, object detection, semantic segmentation, etc. DCNNs are called "deep" convolutional neural networks because they typically consist of many layers, allowing the model to learn more complex, abstract feature representations. These networks have a large number of parameters and are able to learn higher-level features from the raw input data and gradually extract higher-order features between layers, enabling efficient modeling of complex tasks. The key components of deep convolutional neural networks are convolutional layers, which perform feature extraction on input data through convolution operations. In addition, DCNNs usually include pooling layers, fully connected layers and activation functions to achieve dimensionality reduction, spatial downsampling and nonlinear transformation of features. DCNNs have achieved remarkable success in the field of computer vision and achieved excellent performance in various image processing tasks. These networks are able to automatically learn features through large-scale training data and backpropagation algorithms, and surpass traditional image processing methods in many vision tasks.

Simply put, DCNNs ↔ \leftrightarrow CNNs

1. Problems existing in semantic segmentation tasks

There are two technical hurdles in the application of DCNNs to image labeling tasks: signal down-sampling, and spatial ‘insensitivity’ (invariance).

When applying DCNNs to image annotation tasks, there are two technical obstacles: ① signal downsampling and ② spatial "insensitivity" (invariance).

1.1 Signal downsampling

In DCNN, signal downsampling is usually performed through pooling layers in order to reduce the size of the feature map and the number of parameters. However, the pooling operation will cause the spatial resolution of the feature map to be reduced, thereby losing some detailed information. In image annotation tasks, pixel-level detailed information is very important for accurate annotation, so signal downsampling may affect the quality of annotation .

The main emphasis here is that downsampling will lead to a reduction in the resolution of the image .

1.2 Spatial "insensitivity" (invariance)

One reason DCNNs perform well in high-level vision tasks is that they have a certain degree of spatial invariance to translation, rotation, scaling, etc. However, for pixel-level labeling tasks (such as semantic segmentation or pixel-level classification), we hope that the network can accurately label each pixel, which requires the network to have high spatial sensitivity. However, the invariant property of DCNNs may cause some spatial information to be lost during the feature extraction process, making the network not sensitive enough for pixel-level annotation tasks.

The reasons why CNNs have a certain degree of spatial invariance such as translation, rotation, and scaling are: ① convolution layer; ② pooling layer; ③ weight sharing; ④ data enhancement.


Q : Why is it said that "the invariance characteristics of CNNs may lead to the loss of some spatial information during the feature extraction process"?
A : This is mainly due to the following reasons:

  1. Pooling operation : Commonly used pooling layers in CNNs (such as max pooling or average pooling) reduce the spatial size of feature maps to reduce computational effort and enhance spatial invariance. However, this downsampling operation also leads to the loss of some spatial information. When the feature map is reduced, the subtle spatial structure and position information in the original image may be blurred or ignored, so the fine-grained spatial information is lost to a certain extent .

  2. Convolution kernel size : In convolution operations, the size of the convolution kernel used is usually small and only focuses on features within the local receptive field. This means that larger spatial structures may be ignored during feature extraction . Although the receptive field can be gradually expanded by stacking multiple convolutional layers, a certain degree of locality still exists.

  3. Weight sharing : Although weight sharing enhances the translation invariance of the model, it also leads to the loss of some spatial information. Since the convolution kernel is shared across the entire image, the features learned by the network may have the same response to the same feature at different locations, but the difference in location information for different features is smaller .

1.3 Solution

In order to overcome these technical obstacles, some strategies can be adopted in the pixel-level annotation task, such as:

  • Avoid excessive signal downsampling : You can appropriately reduce the use of pooling layers, or use fewer steps for pooling to retain more spatial information.

  • Combined with upsampling techniques : Transposed convolution or other upsampling techniques can be used to restore the spatial resolution of feature maps to better handle pixel-level annotation tasks.

  • Combining multi-scale features : Multi-scale feature representations can be introduced into the network to capture information at different scales and improve the perception of targets of different sizes.

  • Use an appropriate loss function : For pixel-level annotation tasks, an appropriate loss function, such as cross-entropy loss or Dice loss, can be used to optimize the network and encourage more accurate pixel-level annotation results.

By comprehensively utilizing these strategies, DCNNs can achieve better performance in pixel-level annotation tasks and overcome technical obstacles such as signal downsampling and spatial "insensitivity."


In this paper, the solutions we mainly use are:

  1. 'atrous'(with holes) algorithm: namely dilated convolution/dilated convolution/dilated convolution
  2. fully-connected CRF(Conditional Random Field): Fully connected conditional random fields, used to post-process images to improve segmentation or annotation results. It is often used to refine and optimize the output of neural networks in image segmentation tasks.

Note❗️ : CRF was a very commonly used method in the field of semantic segmentation at the time (2014), but the DeepLab series no longer uses CRF after v3, so CRF does not require too much attention.

2. Advantages of DeepLab v1 network

Compared with some previous networks, the DeepLab v1 network proposed in this article has the following advantages:

  1. faster
  2. Higher accuracy
  3. The model is relatively simple

2.1 Faster

The paper states that it is because of the use of dilated convolution, but fully-connected CRF is still very time-consuming, and network inference takes about 0.5s.

2.2 Higher accuracy

Insert image description here

in:

  • DeepLab: The semantic segmentation model proposed in this article
  • MSc: Multi-Scale, multi-scale
  • CRF: Fully connected conditional random field, used to post-process images to improve segmentation or annotation results. It is often used to refine and optimize the output of neural networks in image segmentation tasks.
  • LargeFOV: Large field of view, which refers to the wide field of view that a camera or sensor can capture. In image processing and computer vision tasks, devices with a large field of view can often cover more scenes, allowing for a more comprehensive understanding and analysis of the content in the image.

It can be clearly seen from the figure that DeepLab v1 has improved the mean IoU indicator by about 7.2% compared to the previous best network (TTI-Zoomout-16).

2.3 The model is relatively simple

Insert image description here

As can be seen from the figure, DeepLab v1 is mainly composed of DCNN and CRF cascades.

The DCNN here mainly refers to the Backbone of the classification network

3. Detailed explanation of network structure

Backbone of DeepLab v1 uses VGG16 (Visual Geometry Group 16) as the main convolutional neural network architecture.

The structure of VGG16 contains 16 convolutional layers and fully connected layers, including 13 convolutional layers and 3 fully connected layers. The model was trained on the ImageNet dataset and achieved good performance on image classification tasks.

DeepLab v1 uses pre-trained VGG16 as Backbone, and builds a CNN based on it to perform semantic segmentation tasks. In DeepLab v1, some or all of the fully connected layers of VGG16 are removed, while only the convolutional layers are retained , and atrous convolution (Atrous Convolution) is used to increase the receptive field, thereby capturing the global context information of the image.

3.1 LargeFOV (Field of View, receptive field)

3.1.1 LargeFOV Overview

In DeepLab v1, LargeFOV (Field of View) refers to the operation of using atrous convolution (Atrous Convolution) to expand the receptive field.

In traditional convolutional neural networks, as the number of network layers increases, the receptive field also increases. However, as the receptive field increases, the computational and storage overhead of the network will also increase significantly. In order to increase the receptive field without adding additional computing and storage burdens, DeepLab v1 introduces the concept of LargeFOV (Field of View), which uses atrous convolution to increase the receptive field and help the network better understand the semantic information of the entire image.

For details on dilated convolution (dilated convolution), please see: Related knowledge and usage suggestions (HDC principle) of dilated convolution (dilated convolution)

By using the atrous convolution operation of LargeFOV, DeepLab v1 can achieve better performance in semantic segmentation tasks, and plays a positive role in identifying and segmenting objects and scenes in images.

The main purpose of LargeFOV proposed by the author of DeepLab v1 is to reduce the number of parameters of the model to speed up the model while ensuring that the mean IoU does not decrease.

3.1.2 LargeFOV effect analysis

After converting the network to a fully convolutional one, the first fully connected layer has 4,096 filters of large 7×7 spatial size and becomes the computational bottleneck in our dense score map computation. We have addressed this practical problem by spatially subsampling (by simple decimation) the first FC layer to 4×4 (or 3×3) spatial size.

After converting the network to a fully convolutional network, the original fully connected layer becomes a convolution kernel with a size of 7 × 7 7\times 77×7 , a convolution layer with a number of convolution kernels (number of output channels) of 4,096. If we use this convolutional layer directly, it will become a computational bottleneck. In order to solve this problem, the author spatially downsampled this convolution layer, and the size of the convolution kernel was changed from the original7 × 7 7 \times 77×7 becomes4 × 4 4\times 44×4 (or3 × 3 3\times 33×3 ) Space size.

Not talking about the human series, the convolutional layer downsampling mentioned here means reducing the convolution kernel, such as from the original kernel_size = (7, 7)to kernel_size = (4, 4)or kernel_size = (3, 3).

We can take a look at the effect of this conversion:

Insert image description here

Note❗️ _

  1. The convolution layer replacing the fully connected layer here is not an ordinary convolution layer, but an expansion convolution, which has an expansion coefficient rrr , which can expand the receptive field.
  2. In the picture, input strideit is actually the expansion coefficient rr.r

Let’s analyze them one by one:

  • DeepLab-CRF- 7 × 7 7 \times 7 7×7 : Simply user=4the expansion convolution of the expansion coefficient to replace the fully connected layer, and obtain a series of indicators as benchmark;
  • DeepLab-CRF : Downsample the convolution kernel parameters (the convolution kernel size is changed from the original 7 × 7 7 \times 77×7 becomes4 × 4 4\times 44×4 ). Because the convolution kernel size has become smaller, the receptive field has also become smaller; at this time, the number of model parameters is almost halved; the mean IoU has dropped significantly; the model training speed has doubled - the reduction in mean IoU here is not what the author expected ( As we said before, the purpose of introducing FOV is to maintain mean IoU and improve model speed)
  • DeepLab-CRF- 4 × 4 4 \times 4 4×4 : Compared with the previous one, the expansion coefficient has doubled; the model receptive field has returned to its original level; the number of model parameters has been halved; the mean IoU has quickly returned to its original level; the model speed remains unchanged——This shows that for semantic segmentation tasks, a large receptive field is very important!
  • DeepLab-CRF-LargeFOV : The convolution kernel is smaller; the expansion coefficient is larger; the receptive field remains unchanged; the model parameters are reduced by 6 times; the mean IoU remains at the original level; the speed is increased by more than 3 times - indicating that the use of dilated convolution can increase the model receptive field, reduce model parameters, improve model speed, and have less impact on model performance.

3.1.3 DeepLab v1-LargeFOV model architecture

PILIBALA WZ drew the DeepLab v1 model with LargeFOV added, as shown below.

Backbone is the same as FCN, still VGG-16

Insert image description here

DeepLab-LargeFOV model architecture

After upsampling, 224 × 224 × num classes 224 \times 224 \times \mathrm{num \ classes}224×224×The feature map of num classes  is not the final output result of the model. It has to go through a Softmax layer before the final output result of the model is obtained.

The function of the Softmax layer is to convert the category prediction of each pixel into the probability of the corresponding category. It num_classesnormalizes the class predictions for each pixel so that each predicted value falls between 0 and 1, and the predicted probabilities for all classes sum to 1. In this way, for each pixel, we can get the probability of each category, thereby determining which category the pixel belongs to with the highest probability. The final output result is usually a feature map processed by Softmax, in which each pixel contains num_classesprobability information of a category.

LargeFOV essentially uses dilated convolution

Through analysis, we found that although Backbone is VGG-16, the Maxpool used is slightly different. In the VGG paper, it is kernel=2, stride=2, but in DeepLab v1, it is kernel=3, stride=2, and padding=1. Then the strides of the last two Maxpool layers are all set to 1 (so the downsampling ratio changes from the original 32 to 8). The last three 3 × 3 3\times 33×The convolution layer of 3 uses dilation convolution, and the expansion coefficientr = 2 r=2r=2

Then regarding the process of convolution of the fully connected layer, the first fully connected layer (FC1) is directly converted into a convolution kernel size of 7 × 7 7\times 7 in the FCN network7×7 , the number of convolution kernels is4096 40964096 convolution layer (ordinary convolution), but in DeepLab v1 the author said that the parameters were downsampled and the final result was a convolution kernel size of3 × 3 3\times 33×3 , the number of convolution kernels is1024 10241024 convolution layer (dilated convolution) (this can not only reduce parameters but also reduce the amount of calculation. For details, please see Table 2 in the paper). For the second fully connected layer (FC2), the number of convolution kernels is also increased from4096 40964096 sampled into1024 10241024 (normal convolution).

After convolving FC1, an expansion coefficient (expansion convolution) is also set. Paper 3.1 says r = 4 r=4r=4But the setting in the Large of View chapter in Experimental Evaluation isr = 12 r=12r=12 corresponds to LargeFOV. For FC2, after convolution, the convolution kernel is1 × 1 1\times 11×1 , the number of convolution kernels is1024 10241024 ordinary convolutional layers. Then pass a convolution kernel1 × 1 1\times 11×1 , an ordinary convolution layer with the number of convolution kernels num_classes (including background). Finally, it is restored to the original image size through 8 times upsampling.

Note❗️The bilinear interpolation (Bilinear Interpolation) strategy is used to achieve upsampling.

Bilinear interpolation is a commonly used image interpolation method that uses the values ​​of known surrounding pixels to estimate the value of a target pixel. During upsampling, bilinear interpolation calculates the pixel value at the target position based on the pixel values ​​in the existing feature map, thereby expanding the spatial size of the feature map.

Specifically, 8x upsampling means expanding the height and width of the feature map by 8x respectively. For each pixel at the target position, bilinear interpolation will consider the four nearest pixels around it and perform interpolation calculations based on distance weights. This can effectively restore the feature map to the size of the original input image, making the network's output and input consistent in spatial size.

3.2 MSc (Multi-Scale, multi-scale (prediction))

In fact, Multi-Scale Prediction is also mentioned in 4.3 of the paper, which is the fusion of the output of multiple feature layers. This is what the paper on the structure of MSc (Multi-Scale) says:

Specifically, we attach to the input image and the output of each of the first four max pooling layers a two-layer MLP (first layer: 128 3x3 convolutional filters, second layer: 128 1x1 convolutional filters) whose feature map is concatenated to the main network’s last layer feature map. The aggregate feature map fed into the softmax layer is thus enhanced by 5 * 128 = 640 channels.

Specifically, the author combines a two-layer MLP (the first layer: with 128 convolution kernels and a size of 3 × 3 3\times 33×Convolution of 3 , second layer: with 128 convolution kernels and size1 × 1 1\times 11×1 ) are respectively attached to the input image and the output of the first four max pooling layers, and then their feature maps are concatenated with the last layer feature map of the main network. Therefore, the aggregated feature map fed into the Softmax layer will increase by5 × 128 = 640 5 \times 128 = 6405×128=640 channels.

MLP is the abbreviation of Multilayer Perceptron, also known as Feedforward Neural Network. It is a common artificial neural network model used to solve various machine learning tasks, especially widely used in supervised learning.

That is, in addition to using the output on the previous main branch, DeepLab v1 also integrates the output from the original image scale and the first four Maxpool layers. For a more detailed structure, refer to the figure below.

Insert image description here

DeepLab-LargeFOV-MSc model architecture

After upsampling, 224 × 224 × num classes 224 \times 224 \times \mathrm{num \ classes}224×224×The feature map of num classes  is not the final output result of the model. It has to go through a Softmax layer before the final output result of the model is obtained.

The paper says that using MSc can improve it by about 1.5 points, and using fully-connected CRF can improve it by about 4 points. However, in the source code, the author recommends using the version without MSc, and some open source implementations on github do not use MSc. My guess is that MSc is not only time-consuming but also consumes a lot of video memory .

Insert image description here

Table 1: (a) Performance of our proposed model on the ‘val’ set of the PASCAL VOC 2012 dataset (trained on the augmented ‘train’ set). The best performance is achieved by simultaneously exploiting multi-scale features and a large field of view. (b) Performance comparison of our proposed model with other state-of-the-art methods on the 'test' set of the PASCAL VOC 2012 dataset (trained on the augmented 'trainval' set)

knowledge source

  1. https://www.bilibili.com/video/BV1SU4y1N7Ao
  2. https://blog.csdn.net/qq_37541097/article/details/121692445

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/131961813