If you review the past and learn the new, you can become a teacher!
1. Reference materials
Github repository: Multi-Scale Context Aggregation by Dilated Convolutions
Image source: Convolution arithmetic
understanding of Dilation convolution
Dilated Convolution - dilated convolution (expanded convolution)
dilated convolution study notes
2. Dilated Convolution
Introduction to dilated convolution ( )
1 Introduction
1.1 Increase the receptive field
When using CNN for image processing, it usually requires multiple convolution and pooling operations. The pooling operation can reduce the size of the image, and then using a convolution kernel can increase the receptive field; stacking multiple convolution kernels can also increase the receptive field. However, in the image segmentation task, such as FCN [3], since the image segmentation prediction is the pixel-wise
output of , it is necessary to restore the smaller size after the pooling operation to the original image size feature map
through upsampling
the method (such as Conv2DTranspose
transposed convolution), and then perform predict. As shown below:
Therefore, there are two keys in image segmentation FCN. One is the pooling operation to reduce the image size and increase the receptive field, and the other is upsampling
to expand the image size. In the process of first reducing and then increasing the size, some detailed information is lost. So can we design a new operation that can have a larger receptive field and see more information without using pooling operations? The answer is yes Dilated Convolution
.
1.2 up-sampling
and pooling layer
existing problems
up-sampling
In image segmentation tasks, the designs of and are more famous pooling layer
, but these designs have some fatal flaws. The main problems are:
- Parameters are not learnable: Up-sampling / pooling layer (eg bilinear interpolation) is deterministic.
- Internal data structures are lost; spatial hierarchical information is lost.
- Small object information cannot be reconstructed: If an object occupies 4x4 pixels, the object information cannot be reconstructed after 4 pooling operations. In other words, assuming there are 4 pooling layers, anything less than 2 4 = 16 pixel 2^4=16 pixel24=The object information of 16 p i x e l will theoretically be impossible to reconstruct.
- The pooling operation is irreversible, and information is lost when restoring the image size by performing the
feature map
operationupsampling
.
In the presence of such problems, the image segmentation problem has been in a bottleneck period and cannot significantly improve the accuracy, and Dilated Convolution
the design of has well avoided these problems.
2. Dilated Convolution
The concept of
Dilated Convolution
, called dilated convolution or dilated convolution in Chinese, adds some holes during the convolution process. As shown in the figure below, the three convolution kernel sizes are all 3x3.
(a) The convolution in the picture is a standard convolution, that is dilation rate=1
, the receptive field size of this convolution kernel at this time is 3x3.
(b) In the figure Dilated Convolution
, , is used, dilation rate=2
that is, the hole of the convolution is 1. At this time, the receptive field of each convolution operation point is 3x3, and the receptive field of the entire convolution kernel is 7x7.
(c) In the figure Dilated Convolution
, , is used, dilation rate=4
that is, the hole of the convolution is 3. At this time, the receptive field of each convolution operation point is 7x7, and the receptive field of the entire convolution kernel is 15x15.
3. dilation rate
Dilated Convolution
It is to inject holes on the basis of standard convolution Convolution map
to increase the receptive field ( reception field
). Therefore, Dilated Convolution
on the basis of standard convolution, there is an additional hyper-parameter called the expansion rate ( dilation rate
), which meansConvolution kernel interval。
As shown in the figure below, Dilated Convolution
the standard convolution uses a 3x3 convolution kernel . The standard convolution has a value dilation rate
of 1 and Dilated Convolution
a dilation rate
value of 2.
Standard convolution dilation rate=1
, as shown in the figure below:
Dilated Convolution
, as shown in the figure below dilation rate=2
:
4. Dilated Convolution
The role of
Dilated Convolution
Instead of the traditional max-pooling
and strided convolution
, it can increase the receptive field and maintain feature map
the size of and the original image size.
Dilated Convolution
The advantage is that it increases the receptive field without losing information by pooling, so that each convolution output contains a larger range of information. It can be well applied in problems where images require global information or speech text requires long sequence information dependenceDilated Convolution
, such as image segmentation [3], speech synthesis WaveNet [2], and machine translation ByteNet [1].
5.The difference between Conv2DTranspose
andDilated Convolution
Conv2DTranspose
One of its uses is upsampling
to increase image size. And Dilated Convolution
it's not about doing it upsampling
, but about increasing the receptive field.
Dilated Convolution
Instead of padding blank pixels between pixels, skip some pixels on existing pixels, or keep the input unchanged, and insert some weight of 0 into the kernel parameters of conv to achieve the spatial range seen by a convolution. The purpose of getting bigger.
Of course, setting the step size of the standard convolution stride
to greater than 1 will also achieve the effect of increasing the receptive field, but stride
if it is greater than 1 downsampling
, the image size will become smaller.
6.Existing Dilated Convolution
problems
-
The Gridding Effect
If we superimpose multiple identical ones
Dilated Convolution
, we will find that many pixels in the receptive field are not utilized, resulting in a large number of holes. At this time, the continuity and integrity between data will be lost, which is not conducive to learning. As shown in the figure below, it shows the sameDilated Convolution
effect performed three times in succession (the convolution kernel size is 3x3,dilation rate
=2).
-
Long-ranged information might be not relevant
Dilated Convolution
It is designed to obtainlong-range-information
, some long-distance information is completely irrelevant to the current point, which will affect the consistency of the data. Moreover, using only large dilation rate information may have a better segmentation effect on large objects, but may be detrimental to small objects. How to handle the relationship between large objects and small objects at the same time isDilated Convolution
the key to designing a good network.
7. Hybrid Dilated Convolution (HDC)
Hybrid dilated convolution ( Hybrid Dilated Convolution
HDC).
-
Different convolutional layers use different ones
dilation rate
. For a groupDilated Convolution
, the settings are differentdilation rate
anddilation rate
gradually increase. For example, the three convolution kernels can be setdilation rate
to [1, 2, 4] respectively. In this way, the last layer has a relatively large receptive field without losing a lot of local information. As shown below:
-
Dilated Convolution
So that there are no holes in the receptive fields of multiple posteriors. Suppose there are n dilated convolution kernels,dilation rate
which are [r1, r2,…, rn]. If [r1, r2,…, rn] can make the following formula hold, it means that there will be no holes in the receptive field. M i M_i
in the above formulaMiIt refers to the largest one that can be used by the i-th layerdilation rate
, and K is the convolution kernel size.
3. Relevant experience
1. GRABAR
GrabAR: Occlusion-aware Grabbing Virtual Objects in AR
附件:Supplementary material, A. GrabAR-Net architecture details
4. References
[1] Kalchbrenner N, Espeholt L, Simonyan K, et al. Neural machine translation in linear time[J]. arxiv preprint arxiv:1610.10099, 2016.
[2] Oord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio[J]. arxiv preprint arxiv:1609.03499, 2016.
[3] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[4] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[J]. arXiv preprint arXiv:1511.07122, 2015.