[Machine Learning] Detailed Expansion/Expansion/Empty Convolution (Dilated / Atrous Convolution)

Table of contents

1. Generation of dilated convolution

2. Definition of dilated convolution

2.1 Receptive Field (Receptive Filed)

2.2 Expansion rate/void rate

2.3 Examples

3. The characteristics of dilated convolution

3.1 Advantages

3.2 Disadvantages

3.3 Improvements


1. Generation of dilated convolution

        Dilated / Atrous Convolution (Dilated / Atrous Convolution)  (hereinafter collectively referred to as dilated convolution) was originally designed to solve the problem of image segmentation. In the early days, convolutional layer + pooling layer stacking was used to increase the receptive field (Receptive Filed), but at the same time it also reduced the size of the feature map (Resolution), so it is necessary to restore the image size through upsampling.

        However, there are many problems with such a stacked Deep CNN, such as:  

  1. The results of pooling layers/upsampling layers (such as bilinear interpolation) are fixed and not learnable
  2. Loss of internal data structures, loss of spatial hierarchy information
  3. Small object information is difficult to reconstruct (with four continuous pooling layers, any object information smaller than 2^4 = 16 pixels will theoretically be impossible to reconstruct)

        In the context of such problems, the problem of semantic segmentation has been in the bottleneck period, and the dilated convolution that can increase the receptive field while keeping the  size of the feature map unchanged is well avoiding these problems.

        Of course, if expansion convolution is not used, there is another solution to compensate for the information loss caused by downsampling - Skip Connection . Typical topological networks such as FCN and U-Net (downsampling + upsampling + skip connections) make up information when upsampling through skip connections.


2. Definition of dilated convolution

2.1 Receptive Field (Receptive Filed)

Receptive field: A point on the feature map can perceive the area/range on the input image

        Receptive Field refers to the input area "perceived/seen" by neurons in the neural network. In CNN, the calculation of an element on the feature map is affected by an area on the input image, and this area is the receptive field of the element.

        In CNN, the deeper the neuron sees the larger the input area-the larger the receptive field. Therefore, stacking convolutional layers is a common way to expand the receptive field .

        For example, the image below uses two regular convolutions (kernel size=3×3, stride=1, padding=0) from left to right. The green area represents the area that each neuron in Layer 2 perceives/sees in Layer 1; the yellow area represents the area that each neuron in Layer 3 perceives/sees in Layer 2 and Layer 1.

        More specifically, each neuron in Layer 2 can perceive/see a 3×3 area on Layer 1; each neuron in Layer3 can perceive/see a 3×3 area on Layer 2, the Regions can in turn perceive/see a 5×5 region on Layer1.

Changes in the receptive field of a two-layer CNN

        For another example, the figure below shows that the receptive field obtained by  two 3×3 convolutions (s=1, p=0) is equivalent to one 5×5 convolution (s=1, p=0), and  three times 3×3 3 The receptive field obtained by convolution (s=1, p=0) is equivalent to  a 7×7 convolution (s=1, p=0).

The receptive field obtained by two 3×3 convolutions (s=1, p=0) is equivalent to a 5×5 convolution (s=1, p=0)
The receptive field obtained by three times of 3×3 convolution (s=1, p=0) is equivalent to one 7×7 convolution (s=1, p=0)

         Therefore, the receptive field is a relative concept. The elements on a feature map of a certain layer see that the area ranges on different layers in front are different. In general, receptive field defaults to the area seen on the input image.


2.2 Expansion rate/void rate

Three different expansion convolutions with kernel size=3, stride=1 and dilation rate of 1, 2, 4 respectively

         As can be seen from the previous section, the stacking of conventional convolutions can achieve the same effect  as  conventional convolutions 3 \times 3 on the receptive field . The expansion convolution can increase the receptive field without increasing the parameter amount (parameter amount = convolution kernel weight + bias).5 \times 57 \times 7

        Assuming that  the convolution kernel size of a certain expansion convolution is kernel size = k, and the expansion rate/number of holes is dilated rated, then its  equivalent kernel size = k', and the calculation formula is as follows:

k' = k + (k-1) \times (d-1)

        Suppose  i the receptive field of the first layer is  RF_{i}, and the equivalent kernel size is k'_i, then the recursive relationship of the receptive field of the next layer (deeper layer) is:

RF_{i+1} = RF_i + (k'_i - 1) \times S_i

        Among them, represents  ithe product of the step size of the first layer and all previous layers, namely:

S_i = \prod_{j=1}^{i} Stride_j = S_{i-1} \times Stride_i

        It can be seen that  i+1the step size of the first layer Stride_{i+1} does not affect  i+1the receptive field of the first layer, and the receptive field has nothing to do with Padding.

Standard Convolution with a 3 x 3 kernel (and padding)
Dilated Convolution with a 3 x 3 kernel and dilation rate 2

2.3 Examples

Three different expansion convolutions with kernel size=3, stride=1 and dilation rate of 1, 2, 4 respectively

         Still take the above figure as an example to show the receptive field calculation process:

        1. Input image :

RF_1 = 1

        2. After  the first layer of convolution (kernel size=3, stride=1, padding=0, dilation rate=1) :

k'_1 = k_1 + (k_1-1) \times (d_1 - 1) = 3 + (3-1) \times (1-1) = 3

S_1 = \prod_{j=1}^{1}Stride_j = Stride_1 = 1

RF_{2} = RF_1 + (k'_1 - 1) \times S_1 = 1 + (3 - 1) \times 1 = 3

         3. After  the second layer of convolution (kernel size=3, stride=1, padding=0, dilation rate=2) :

k'_2 = k_2 + (k_2-1) \times (d_2 - 1) = 3 + (3-1) \times (2-1) = 5

S_2 = \prod_{j=1}^{2}Stride_j = Stride_1 \times Stride_2 = 1 \times 1 = 1

RF_{3} = RF_2 + (k'_2 - 1) \times S_2 = 3 + (5 - 1) \times 1 = 7

        4. After  the third layer of convolution (kernel size=3, stride=1, padding=0, dilation rate=4) :

k'_3 = k_3 + (k_3-1) \times (d_3 - 1) = 3 + (3-1) \times (4-1) = 9

S_3 = \prod_{j=1}^{3}Stride_j = Stride_1 \times Stride_2 \times Stride_3 = 1 \times 1 \times 1 = 1

RF_{4} = RF_3 + (k'_3 - 1) \times S_3 = 7 + (9 - 1) \times 1 = 15


3. The characteristics of dilated convolution

3.1 Advantages

        On the one hand, in order not to lose resolution due to downsampling (pooling or s>2 conv), and to expand the receptive field, dilated convolutions can be used. This is very useful in detection and segmentation tasks. The expansion of the receptive field helps to detect and segment large objects, and the maintenance of the resolution of the feature map is conducive to the precise positioning of objects .

        On the other hand, the specific parameter dilation rate of the dilated convolution indicates that dilation rate-1 zeros are filled between the convolution kernels. Therefore, when different dilation rates are set, receptive fields of different sizes are obtained, that is, multi-scale information is obtained . Multi-scale information is quite important in vision tasks. As can be seen from the ASPP module in DeepLab, the dilated convolution can arbitrarily expand the receptive field without introducing additional parameters. However, since the dilated convolution usually maintains the resolution of the feature map, the overall calculation amount of the algorithm will not be reduced.


3.2 Disadvantages

        Gridding Effect

        When only 3 x 3 kernels with dilation rate = 2 are stacked multiple times, the following problems will occur:

         Since the calculation method of the expansion convolution is similar to the checkerboard format, the convolution result of the current layer comes from an independent set that does not depend on each other in the previous layer, so there is still a lack of correlation between the convolution results of this layer, resulting in the loss of local . This is fatal for the pixel-level dense prediction task.

        Long-range information might be not relevant.

         The dilated convolution sparsely samples the input signal, which makes the information obtained by long-distance convolution lack correlation, thus affecting the classification results.


3.3 Improvements

         Hybrid Dilated Convolution (HDC)

         Future researchers in Tucson propose an HDC structure with the following properties:

  1. The dilation rate of superimposed convolution cannot have a common divisor greater than 1. For example, the greatest common divisor of the combination of [2, 4, 6] is 2, and gridding effect will still appear
  2. Design the dilation rate as  a zigzag structure , such as [1, 2, 5, 1, 2, 5] to combine such loop structures
  3. Need to meet:

M_i = max[M_{i+1} - 2r_i, M_{i+1} - 2 (M_{i+1} - r_i), r_i]

        Among them r_i is i the dilation rate of the layer, and  M_irefers to the maximum dilation rate in ithe layer , then it is assumed that there are a total of n layers, the default  M_n = r_n. Assuming that the kernel size used k \times kis , our goal is M_2 < kto at least cover all holes with dilation rate = 1, that is, standard convolution. A simple example: kernel size = 3 × 3 and dilation rate = [1, 2, 5] (feasible solution)

        The nature of such a jagged dilation rate is suitable for meeting the segmentation requirements of large and small objects at the same time (a small dilation rate pays attention to relatively local short-distance information, and a large dilation rate cares about relatively global long-distance information ).

        The following comparative experiments show that a well-designed dilated convolution network can effectively avoid gridding effect.

        Atrous Spatial Pyramid Pooling (ASPP)

        When dealing with multi-scale object segmentation, there are usually the following methods:

        However, it is unorthodox to use dilated convolutions only (under a convolutional branch network) to extract multi-scale objects. For example, using an HDC method to obtain the information of a large (near) vehicle is no longer suitable for the information of small (far) vehicles. If the small expansion convolution is used to reacquire the information of the small vehicle, it will be very redundant.  

        ASPP uses dilation rates of different sizes on the network decoder to extract multi-scale information. Each scale is an independent branch, and it is merged at the end of the network to predict through a convolutional layer output. This effectively avoids the acquisition of redundant information on the encoder, and directly focuses on the correlation between objects and within objects.


References:

Dilated Convolutions

Graphically transposed convolution, hollow convolution, C++ classic interview questions

Thoroughly understand the meaning and calculation of the receptive field - Jishi Community

A detailed introduction to the receptive field in convolutional neural networks

Dilated convolution understanding - short book

How to understand dilated convolution? - Know almost

Guess you like

Origin blog.csdn.net/qq_39478403/article/details/121172670