Deformable Convolutional Networks笔记


Abstract
CNN is limited by the geometric transformation of the model due to its fixed structure. The two new modules proposed in this paper enhance the transformation capabilities of CNN, (Deformable Convolution, DC) and (Deformable RoI Pooling, DRP). The design of both is based on the offset in the module to increase the sampling position of the wall space, and learn the offset from the target task. Experiments show that DCN can replace CNN to achieve complex vision tasks such as target detection and semantic segmentation.
论文贡献 :
1. Propose an adaptive sampling method;
2. Improve the performance of target detection and semantic segmentation models without significantly increasing the amount of parameters and calculations;
3. Can be easily integrated into CNN-based computer vision tasks

1. Introduction

Deformable convolution network consists of two modules, one is deformable convolution and the other is deformable RoI pooling .

The convolution operation on the feature map in CNN is three-dimensional, that is, plane plus channel. The deformable convolution and deformable RoI pooling are in two-dimensional space . They change the sampling position of the convolution on the plane, that is, the position of the receptive field, while the channel dimension has not changed . In this way, the performance of feature extraction is improved . As shown
Insert picture description here

2. Deformable Convolutional Networks

Deformable Convolution

2D convolution has two steps: the
first step is to sample on the feature map through a convolution kernel; the
second step is to multiply these sampling points with different weights and add them.
The sampling points in convolution are regular. For example, a 3×3 convolution kernel with a dilation of 1 is expressed as:
Insert picture description here
p If o is a point on the output feature map y, then the convolution operation is defined as:
Insert picture description here

The transformation of variable convolution refers to the first step in changing the convolution operation-modifying the position of the sampling point, which can be achieved by adding a displacement offset to the sampling point.
The principle is as follows:
First, through a convolution and convolution of the input feature map the same as the normal convolution, a feature map size with the same spatial dimension as the output feature map is obtained. The channel number is 2N, and 2 is that each point has x and The offset in the y direction, N is the size of a receptive field convolved on a two-dimensional space, N=k×k. Corresponding to each point on the output feature map, the sampling point of its convolution is determined by the 2N offsets values ​​on the 2N channels at that point on the offsets feature map. After determining the sampling point, the final value of the point on the output map is obtained by adding the weight w.
Insert picture description here
The above formula means:
Insert picture description here

Insert picture description here
ΔPn is our offsets, a total of N (N=k×k).

Then add a point in detail, ΔPn is usually a decimal, so P=P0+Pn+ΔPn is also a decimal, then the value of x(p) must be carefully selected, here we take the four integer points q around P, through bilinear Interpolate to find x(P).

Insert picture description here

Among them, G is two-dimensional and can be divided into two one-dimensional operations.
Insert picture description here

Where g(a, b) = max(0, 1 − |a − b|).

The above formula is bilinear interpolation. Simply put, the four points are divided into two groups and first linearly interpolated in the x direction to obtain two points, and then these two points are linearly interpolated in the y direction to obtain the final value.
Schematic diagram: Image source: https://blog.csdn.net/u013010889/article/details/78803240
Insert picture description here
emphasizes the following, here is used to generate the convolution kernel for sampling point offset and the size and step size of the convolution kernel for generating the final feature map They are all the same, they all act on the same input feature map, and the generated offset field is the same size as the generated feature map. The two convolution kernels learn at the same time, and the learning of the offset field may have decimal coordinate points, so the opposite Linear interpolation is used to learn the gradient in the propagation process, which is the bilinear interpolation method mentioned above.

Deformable RoI Pooling

RoI Pooling converts the input rectangular area of ​​any size into fixed-size features.

The operation of RoI Pooling : first map the RoI to the feature map, and then given the input feature map x, and a RoI of size w×h with P0 in the upper left corner, then RoI Pooling divides the feature map into k × k Bins, each bin outputs a value through the pooling operation, and finally outputs a k × k feature map y.
So for the (i, j)th bin
Insert picture description here

Where nij n_{ij}nijIs the number of pixels in the bin(i,j). Two coordinates px, py p_x, p_y of pixel p in bin(i,j)px,pandThe range of values ​​is as follows:
Insert picture description hereInsert picture description hereInsert picture description here

Similar to the above variability convolution equation (2), here an offset is added to all pixels in each bin {∆ pij p_{ij}pij|0 ≤ i, j <k}, then the above equation 5 becomes the following equation 6.
Insert picture description here
Usually ∆ pij ∆p_{ij}pijIt is a decimal, so Equation 6 has to be calculated by the bilinear interpolation mentioned above, namely Equations (3) and (4).
Figure 3 shows how to obtain this offset. First, RoI pooling generates a pooled feature map, and then generates a normalized offset ∆ pij ^ \hat{p_{ij} on this feature map through a full connection }pij^, And then the offset will be multiplied element-wise by the width and height of RoI to obtain the ∆ pij p_{ij} used in equation (6)pij,即有 ∆ p i j = γ ⋅ ∆ p i j ^ 。 ( w , h ) ∆p_{ij}=γ ·∆\hat{p_{ij}}。(w,h) pij=γpij^(w,h ) . Whereγ γγ is a predefined scalar, used to adjust the magnitude of the offset, generally letγ γγ = 0.1, and the normalization of the offset can make the offset have scale invariance. The fully connected layer is learned through back propagation, as shown in Appendix A.

Insert picture description here

Position-Sensitive (PS) RoI Pooling . As a variant of RoI Pooling, Position Sensing RoI Pooling also has its variability form, namely deformable PS RoI pooling. Its architecture diagram is as follows:
Insert picture description here
The following branches are roughly the same as RoI , except that each bin comes from a specific score map, such as (i, j ) Corresponding to the (i, j)th map, we will not discuss each class separately here. That is, in deformable PS RoI pooling, the only change in equation (6) is that x becomes xi, j x_{i,j}xi,j. But in addition to the formula, the learning of the offset is also different. The deformable PS RoI pooling follows the idea of ​​full convolution. The upper branch in Figure 4 is the offsets learning branch. The input feature map is passed into a convolution, and the channel is 2×( C+1)×k×k, offset fields with the same size as the score map. The offset learned here is the normalized offset, which needs to be obtained through the same transformation method as in deformable RoI pooling to get ∆ pij ∆p_{ ij}pij. Among them, C+1 corresponds to the number of categories plus background, 2 is the two dimensions of x and y, and k×k corresponds to the divided N parts, and each part has a separate channel at the offset of each grid. Just as the object in the following branch has a separate channel for each part's score.

Deformable ConvNets

The deformable convolution kernel RoI Pooling module has the same input and output as their plain version (non-deformable). Therefore, it is easy to replace their naive version in the existing CNN, add the conv core fc layer to learn the offset during training, and use 0 to initialize the weight. Their learning is set to β times of the existing layer (the default β is 1, and the β of the fc layer in Faster R-CNN is β=0.01), and they learn through bilinear interpolation and backpropagation. The resulting CNN is called deformable ConvNets.
How to combine deformable ConvNets with the existing SOTA CNN architecture? First of all, we have to notice that these architectures are divided into two stages: the first stage, the deep full convolutional network extracts feature maps from the entire input image ; The second stage shallow special network generates results on this feature map. These two stages are detailed below.
Deformable Convolution for Feature Extraction. Two SOTA models are used for feature extraction in this article, namely ResNet-101 and a modified Inception-ResNet version. Both models consist of several convolutional blocks, an average pooling, and a 1000-way fc for ImageNet classification. Layer composition, here remove their average pooling and 1000-way fc layer, and add a 1×1 convolution at the end to reduce the number of channels to 1024. Like some common settings, here will also be the last convolution block The beginning stride of the change from 2 to 1, so that the effective stride of the last block is changed from 32 to 16, so that the output feature map resolution is larger. In order to compensate, the dilation of all convolution kernels of the last block is changed from 1 to 2.
You can choose to use deformable convolution to the last few convolutional layers. Experiments show that using 3a deformable convolution to achieve the best balance.
Segmentation and Detection Networks.. As mentioned above, a special network is built on the feature map output by the feature extraction network. In the paper, here are some examples of head networks for example segmentation and nuclear target detection. Among them, the target detection target is replaced by the detection network head Faster R-CNN and R-FCN, that is, the use of deformable RoI Pooling and position-sensitive deformable RoI Pooling replaces the corresponding plain version.

3. Understanding Deformable ConvNets

The idea of ​​this work is based on enhancing the spatial sampling position of convolution through offset, enhancing RoI Pooling, and learning offset from the target task.
With the stacking of deformable convolutions, the greater the influence of deformable convolution, as shown in Figure 5, the receptive fields and sampling points in the uppermost standard convolution on the left are fixed, while in the deformable convolution on the right they are It will change adaptively.
Insert picture description here
Figure 5.2 shows the convolutional layers separately: a is the fixed receptive field of standard convolution, and b is the adaptive receptive field of deformable convolution.

3.1. In Context of Related Works

This part introduces the related work of the paper from many aspects, and the content refers to (13) Paper Reading | Target Detection DCN
Spatial Transform Networks, STN , spatial transformation network is the pioneering work of learning data spatial transformation in the field of deep learning.
Active Convolution , similarly, this work also learns the offset of the sampling position during the convolution process, and updates the parameters through backpropagation. But the difference is: (1) It shares offsets in different spatial locations; (2) The offset parameters are static, that is, they are learned separately for each task or each training.
Effective Receptive Field , this work reveals that the contribution of pixels in the receptive field is not the same. The effective area only occupies a small part of the receptive field, and the overall contribution of the pixels presents a Gaussian distribution. This requires an adaptive regional sampling method.

Atrous Convolution (dilated convolutions) , hole convolution first appeared in semantic segmentation tasks, the purpose is to increase the receptive field in the process of convolution, so that the original features of the image can be retained in the deep network. At the same time, multi-scale context information can be captured by setting different void rates. The following figure is a schematic diagram of one-dimensional hole convolution:
Insert picture description here
as shown in the figure above, ordinary convolution corresponds to sparse sampling; hole convolution corresponds to dense sampling. Hole convolution adds zero padding to the input feature map to increase the receptive field of the output feature map. The corresponding receptive field in figure (b) is 5 55, and the corresponding receptive field in figure (a) is 3. Hollow convolution was first proposed in DeepLab, and the paper comes from here .

Deformable Part Models , DPM, the idea of ​​deformable RoI pooling is similar to DPM. DPM is a traditional target detection method based on pixel gradient. The core idea is to model the target object as a combination of several components. It cannot achieve end-to-end training, and requires a lot of prior information, such as parts and the size of the parts.

DeepID-Net , which is similar to the idea of ​​deformable RoI pooling, but the implementation is more complicated. DeepID-Net is based on R-CNN and is difficult to integrate into the target detection model to achieve end-to-end training. The remaining part introduces related work such as spatial pyramid pooling, SIFT, ORB and so on.

4. Experiments

Insert picture description here
Insert picture description here
The classic model is improved after deformable convolution is applied.
Insert picture description here

5. Conclusion

The paper proposes an adaptive sampling method, and then proposes a deformable convolutional neural network DCN. For the first time, a flexible and effective method is proposed to learn the dense spatial transformation of CNN to solve complex visual tasks. And DCN can be combined with any CNN-based model without increasing a large number of parameters and costs to improve model accuracy.

DilatedConv和Deconvolution

The article also mentioned dilation (extended) dilated convolutions, also known as atrous convolutions, here is a simple note.
What are the ways to improve the receptive field?
We may all think of pooling or making the stride of the convolution kernel>1, so The receptive field of each pixel on the obtained feature map will become larger. However, these two methods will make our feature map resolution smaller, and sometimes we need to improve the receptive field while maintaining the high resolution of the image. For example, in power segmentation, the classification is at the pixel level, and the final classification feature map is the same size as the input image. If you use the above two methods, you need to go through downsampling and upsampling, and there will inevitably be a loss of progress in the process.
So is there any difference between them? The rate becomes smaller, and at the same time, the way to increase the receptive field of each pixel on the output feature map? Yes, that is, dilated conv (dilated conv)

The following quotes Zhihu Tan Xu’s explanation of hole convolution and deconvolution:
link: https://www.zhihu.com/question/54149221/answer/192025860
Insert picture description here

(a) The figure corresponds to 3x3 1-dilated conv, which is the same as the ordinary convolution operation. (b) The figure corresponds to 3x3 2-dilated conv. The actual convolution kernel size is still 3x3, but the hole is 1, that is, for one In the 7x7 image patch, only 9 red points and the 3x3 kernel undergo convolution operations, and the remaining points are skipped. It can also be understood that the size of the kernel is 7x7, but only the weight of the 9 points in the figure is not 0, and the rest are all 0. It can be seen that although the kernel size is only 3x3, the receptive field of this convolution has been increased to 7x7 (if you consider that the previous layer of this 2-dilated conv is a 1-dilated conv, then each red dot is 1- The convolution output of dilated, so the receptive field is 3x3, so 1-dilated and 2-dilated together can reach 7x7 conv), the picture © is 4-dilated conv operation, the same is followed by two 1-dilated and 2 -Behind the dilated conv, the receptive field can reach 15x15. Compared with the traditional conv operation, 3 layers of 3x3 convolutions add up. If stride is 1, it can only reach the receptive field of (kernel-1)*layer+1=7, which is a linear relationship with the number of layers, and dilated The receptive field of conv is exponential growth. The advantage of dilated is that without pooling loss information, the receptive field is enlarged, so that each convolution output contains a larger range of information.

The difference between deconv and dilated conv:

The specific explanation of deconv can be found in How to understand deconvolution networks in deep learning? One of the uses of deconv is to do upsampling, that is, to increase the image size. And dilated conv is not doing upsampling, but increasing the receptive field. It can be explained visually: for the standard k*k convolution operation, stride is s, and there are three cases:
(1) s>1, that is, downsampling is performed during convolution, and the image size is reduced after convolution;
(2) s=1, ordinary convolution with a step size of 1. For example, if padding=SAME is set in tensorflow, the image input and output of the convolution have the same size;
(3) 0<s<1, fractionally strided convolution is equivalent to upsampling the image. For example, when s=0.5, it means that after padding a blank pixel between each pixel of the image, stride is changed to 1 for convolution, and the resulting feature map size is doubled.
The dilated conv is not padding blank pixels between pixels, but on the existing pixels, skip some pixels, or the input remains unchanged, insert some 0 weights into the kernel parameters of conv to achieve a convolution view The purpose of increasing the scope of the space. Of course, setting the common convolution stride step size greater than 1 will also achieve the effect of increasing the receptive field, but stride greater than 1 will cause downsampling and the image size will become smaller. You can understand the connection and difference between deconv, dilated conv, pooling/downsampling, and upsampling from the above . Welcome to leave a message for communication.

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/110309771