Understand Dilated Convolution (dilated convolution, dilated convolution) in a simple way

If you review the past and learn the new, you can become a teacher!

1. Reference materials

Github repository: Multi-Scale Context Aggregation by Dilated Convolutions
Image source: Convolution arithmetic
understanding of Dilation convolution
Dilated Convolution - dilated convolution (expanded convolution)
dilated convolution study notes

2. Dilated ConvolutionIntroduction to dilated convolution ( )

1 Introduction

1.1 Increase the receptive field

When using CNN for image processing, it usually requires multiple convolution and pooling operations. The pooling operation can reduce the size of the image, and then using a convolution kernel can increase the receptive field; stacking multiple convolution kernels can also increase the receptive field. However, in the image segmentation task, such as FCN [3], since the image segmentation prediction is the pixel-wiseoutput of , it is necessary to restore the smaller size after the pooling operation to the original image size feature mapthrough upsamplingthe method (such as Conv2DTransposetransposed convolution), and then perform predict. As shown below:
Insert image description here

Therefore, there are two keys in image segmentation FCN. One is the pooling operation to reduce the image size and increase the receptive field, and the other is upsamplingto expand the image size. In the process of first reducing and then increasing the size, some detailed information is lost. So can we design a new operation that can have a larger receptive field and see more information without using pooling operations? The answer is yes Dilated Convolution.

1.2 up-samplingand pooling layerexisting problems

up-samplingIn image segmentation tasks, the designs of and are more famous pooling layer, but these designs have some fatal flaws. The main problems are:

  • Parameters are not learnable: Up-sampling / pooling layer (eg bilinear interpolation) is deterministic.
  • Internal data structures are lost; spatial hierarchical information is lost.
  • Small object information cannot be reconstructed: If an object occupies 4x4 pixels, the object information cannot be reconstructed after 4 pooling operations. In other words, assuming there are 4 pooling layers, anything less than 2 4 = 16 pixel 2^4=16 pixel24=The object information of 16 p i x e l will theoretically be impossible to reconstruct.
  • The pooling operation is irreversible, and information is lost when restoring the image size by performing the feature mapoperation upsampling.

In the presence of such problems, the image segmentation problem has been in a bottleneck period and cannot significantly improve the accuracy, and Dilated Convolutionthe design of has well avoided these problems.

2. Dilated ConvolutionThe concept of

Dilated Convolution, called dilated convolution or dilated convolution in Chinese, adds some holes during the convolution process. As shown in the figure below, the three convolution kernel sizes are all 3x3.
Insert image description here

(a) The convolution in the picture is a standard convolution, that is dilation rate=1, the receptive field size of this convolution kernel at this time is 3x3.
(b) In the figure Dilated Convolution, , is used, dilation rate=2that is, the hole of the convolution is 1. At this time, the receptive field of each convolution operation point is 3x3, and the receptive field of the entire convolution kernel is 7x7.
(c) In the figure Dilated Convolution, , is used, dilation rate=4that is, the hole of the convolution is 3. At this time, the receptive field of each convolution operation point is 7x7, and the receptive field of the entire convolution kernel is 15x15.

3. dilation rate

Dilated ConvolutionIt is to inject holes on the basis of standard convolution Convolution mapto increase the receptive field ( reception field). Therefore, Dilated Convolutionon the basis of standard convolution, there is an additional hyper-parameter called the expansion rate ( dilation rate), which meansConvolution kernel interval

As shown in the figure below, Dilated Convolutionthe standard convolution uses a 3x3 convolution kernel . The standard convolution has a value dilation rateof 1 and Dilated Convolutiona dilation ratevalue of 2.
Insert image description here

Standard convolution dilation rate=1, as shown in the figure below:
Insert image description here

Dilated Convolution, as shown in the figure below dilation rate=2:
Insert image description here

4. Dilated ConvolutionThe role of

Dilated ConvolutionInstead of the traditional max-poolingand strided convolution, it can increase the receptive field and maintain feature mapthe size of and the original image size.

Dilated ConvolutionThe advantage is that it increases the receptive field without losing information by pooling, so that each convolution output contains a larger range of information. It can be well applied in problems where images require global information or speech text requires long sequence information dependenceDilated Convolution , such as image segmentation [3], speech synthesis WaveNet [2], and machine translation ByteNet [1].

5.The difference between Conv2DTransposeandDilated Convolution

Conv2DTransposeOne of its uses is upsamplingto increase image size. And Dilated Convolutionit's not about doing it upsampling, but about increasing the receptive field.

Dilated ConvolutionInstead of padding blank pixels between pixels, skip some pixels on existing pixels, or keep the input unchanged, and insert some weight of 0 into the kernel parameters of conv to achieve the spatial range seen by a convolution. The purpose of getting bigger.

Of course, setting the step size of the standard convolution strideto greater than 1 will also achieve the effect of increasing the receptive field, but strideif it is greater than 1 downsampling, the image size will become smaller.

6.Existing Dilated Convolutionproblems

  • The Gridding Effect

    If we superimpose multiple identical ones Dilated Convolution, we will find that many pixels in the receptive field are not utilized, resulting in a large number of holes. At this time, the continuity and integrity between data will be lost, which is not conducive to learning. As shown in the figure below, it shows the same Dilated Convolutioneffect performed three times in succession (the convolution kernel size is 3x3, dilation rate=2).
    Insert image description here

  • Long-ranged information might be not relevant

    Dilated ConvolutionIt is designed to obtain long-range-information, some long-distance information is completely irrelevant to the current point, which will affect the consistency of the data. Moreover, using only large dilation rate information may have a better segmentation effect on large objects, but may be detrimental to small objects. How to handle the relationship between large objects and small objects at the same time is Dilated Convolutionthe key to designing a good network.

7. Hybrid Dilated Convolution (HDC)

Hybrid dilated convolution ( Hybrid Dilated ConvolutionHDC).

  1. Different convolutional layers use different ones dilation rate. For a group Dilated Convolution, the settings are different dilation rateand dilation rategradually increase. For example, the three convolution kernels can be set dilation rateto [1, 2, 4] respectively. In this way, the last layer has a relatively large receptive field without losing a lot of local information. As shown below:
    Insert image description here

  2. Dilated ConvolutionSo that there are no holes in the receptive fields of multiple posteriors. Suppose there are n dilated convolution kernels, dilation ratewhich are [r1, r2,…, rn]. If [r1, r2,…, rn] can make the following formula hold, it means that there will be no holes in the receptive field. ​M i M_i
    Insert image description here
    in the above formulaMiIt refers to the largest one that can be used by the i-th layer dilation rate, and K is the convolution kernel size.

3. Relevant experience

1. GRABAR

GrabAR: Occlusion-aware Grabbing Virtual Objects in AR
附件:Supplementary material, A. GrabAR-Net architecture details

4. References

[1] Kalchbrenner N, Espeholt L, Simonyan K, et al. Neural machine translation in linear time[J]. arxiv preprint arxiv:1610.10099, 2016.
[2] Oord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio[J]. arxiv preprint arxiv:1609.03499, 2016.
[3] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[4] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[J]. arXiv preprint arXiv:1511.07122, 2015.

Guess you like

Origin blog.csdn.net/m0_37605642/article/details/135452066