My understanding of DeepLab V3 (based on V1 and V2)

I. Overview

1 Introduction

1.1 DeepLab v1

insert image description here

Innovation:

  1. Hole convolution (Atrous Conv);
    <Solve the problem that the signal is continuously down-sampled and the details are lost during the encoding process>

  2. Fully-connected Conditional Random Field.
    <Because the features extracted by the conv layer are translation invariant, this limits the positioning accuracy. Therefore, fully connected CRF is introduced to improve the ability of the model to capture structural information and solve the problem of fine segmentation. >

1.2 DeepLab v2

Differences from v1:

  1. Empty space pyramid pooling ASPP (Atrous spatial pyramid pooling)

  2. Replaced the backbone VGG16 used by v1 with Resnet101

insert image description here

ASPP can be used to solve the problem of differences in the size of different detection targets: by using dilation dilation dilated convolutions on a given feature layer, effective resampling can be performed. Construct convolution kernels with different receptive fields to obtain multi-scale object information.

1.3 DeepLab v3

Innovation:

Improved ASPP modules for v2:

  1. Added BN layer;

  2. Replace the hole convolution with size 3×3 and dilation=24 in ASPP in v2 with an ordinary 1×1 convolution to retain the effective weight of the middle part of the filter; (as the hole rate increases, the filter The number of effective weights in is decreasing)

  3. Added global average pooling to better capture global information.

insert image description here

2. Overall structure

1. Pre-knowledge

1.1 Astrous conv hollow convolution
  • Atrous Convolution is well understood literally. It injects holes into the standard convolution map to increase the
    reception field.

  • Compared with the original normal convolution, dilated convolution has one more hyper-parameter called
    dilation rate, which refers to the number of kernel intervals (eg normal convolution is dilatation rate
    1).

It can be well understood with this animation

1.2 ASPP empty space pyramid pooling

as above

2. The structure proposed by Deeplab v3

Deeplab v3 is to solve the problem of capturing multi-scale context

insert image description here

Fig2 in the paper draws several common methods of capturing multi-scale context:

(a) Image pyramid. Scale transform the input image to obtain input with different resolutions, then put images of all scales into CNN to obtain segmentation results of different scales, and finally fuse the segmentation results of different resolutions to obtain the original resolution segmentation results, a similar method is DeepMedic ;

(b) Encoding-decoding. Structures such as FCN and UNet;

(c) The tandem structure proposed in this paper.

(d) Deeplab v3 structure proposed in this paper. The right side of the last two structures actually needs an 8×/16× upsample, which is reflected in deeplab v3+. Of course, Sec 4.1 of the paper also mentioned that downsampling GT is easy to lose details in backpropagation, so upsampling feature map is better.

DeepLab V3 has two forms: cascade and parallel:

1.1 "Series" structure

DeepLab V3 applies dilated convolutions to cascaded modules. Specifically, we take the last block in ResNet, which is block4 in the figure below, and add a cascade module behind it.
insert image description here

1.2 Improvement of ASPP

Because in the cascade model, if the network is too deep, the effect will decrease, so ASPP is quoted.
insert image description here

Improved ASPP includes:

  1. A 1 × 1 1 × 11 × 1 convolution and three 3 × 3 3 × 33 × 3 hole convolutions with a sampling rate of rates={6,12,18}, the number of filters is 256, and the BN layer is included . For the case of output_stride=16. Atrous Spatial Pyramid Pooling as shown in part (a) above

  2. Image-level features, that is, feature global average pooling, convolution, and then fusion. Image Pooling in part (b) of the figure below. The improved ASPP module is shown in the figure above.

Subsequent experiments proved that:

The structural combination of the two methods will not bring about improvement. In comparison, the vertical structure of ASPP is better. So deeplab v3 generally refers to the ASPP structure.

3. Summary

  1. In order to avoid the problem of information loss caused by pooling, DeepLabV1 proposes a method of hollow convolution, which can increase the receptive field without increasing the number of parameters, and at the same time ensure that information is not lost. To further optimize segmentation accuracy, CRF (Conditional Random Field) is also used.

  2. On the previous basis, DeepLab V2 added multi-scale parallelism to solve the problem of simultaneous segmentation of objects of different sizes.

  3. DeepLab V3 applies the dilated convolution to the cascade module and improves the ASPP module.

Reference:
1. Summary of deeplab series (deeplab v1& v2 & v3 & v3+)
2. Deeplab v3 paper translation
3. DeepLabv3 paper analysis
4. Deep learning|semantic segmentation: DeepLab series

Guess you like

Origin blog.csdn.net/m0_58770526/article/details/125873104