DCN-v2 for Deep Learning

insert image description here

This article was published on CVPR in 2019. It is an advanced version of Deformable Convolution Network -DCN-v2, through two improvements to DCN to increase the adaptability and flexibility of convolutional neural networks. Specifically, the range of perception is enhanced by stacking multiple DCNs ;DCNs are given more degrees of freedom to select sampling regions by introducing a modulation mechanism , which achieves attention to the sampling regions through a gate mechanism.

Abstract

Found the problem :
DCN can be used as a trend to replace the ordinary CNN layer in recent years due to its adaptability to image geometric changes and keeping the network performance unchanged (ie, spatial invariance). However, a major defect exposed by DCN is that the new position of the sampling point after the offset will exceed our ideal sampling position, resulting in the convolution points of some deformable convolutions may be some parts that are not related to the content of the object, as follows As shown in the figure:
insert image description here

Solve the problem

  1. By introducing a modulation module to control the variation range (or variation range) of the offset, the dilemma of either the offset range being too small or too large in the above two cases can be alleviated. This modulation module is very similar to the modulation branch in the time adaptive network in the super-resolution algorithm Robust-LTD , both to train aattention weightto control the new sampling point for our region of interest.
  2. In addition, in order to enhance the expressiveness of the model, the author introduced multiple DCN stacking mechanisms in DCN-v2 , and proved its effect on improving the expressiveness of the model through experiments.
  3. DCN-v2 plays a vital role in the alignment of the video super-resolution field. Since DCN-v2 was proposed in 19 years, such as EDVR, BasicVSR++ , etc. have used DCN-v2 to replace DCN (the original version).

Efficacy of DCN-v2

  1. In terms of expressiveness, the expressiveness of the model is stronger than that of DCN stacking by stacking modulated DCN blocks. Of course, it is much better than ordinary CNN stacking, which benefits from the enhancement of DCN's own spatial invariance and sampling range. expansion.
  2. With the modulation module, deformable convolution is also more precise in controlling the sampling range.

1 Introduction

On the basis of DCN, the author added 2 innovative points, namely the modulation module and the use of multiple modulated DCN modules, and formed an upgraded version of DCN-DCN-v2!
①Modulation module:
In addition to learning offset Parameter Δ p \Delta pIn addition to Δ p (offset), it is also necessary to learn achange range Δ m \Delta m through modulationΔ m . This range is used to further reasonably control the offset range of the new sampling point. We record the modulated single DCN asmDCN(modulated-DCN).

②Accumulation of multiple modulated DCNs:
By accumulating multiple modulated mDCNs to increase the offset range of the offset, at the same time, it is obvious that the accumulation of multiple blocks also has a certain correction and refinement effect on the accuracy of the offset, that is, to further enhance its resistance The ability to change space (coarse-to-fine) .

2 Focus point

2.1 Stacking More Deformable Conv Layers

The idea here is also very simple. It not only replaces the ordinary CNN with DCN, but also uses multiple DCN stacks to improve the model's ability to resist spatial transformation.
The improvement of the expressiveness of the model produced by this cascade can be directly verified from the experimental results, see the experimental results in Section 3 for details.

2.2 Modulated Deformable Modules

The understanding of this part requires the basis of the existing DCN. Before introducing the modulation mechanism, please refer to the DCN of deep learning for those who do not understand DCN , so as to avoid ignorance of some symbols and principles.


For the convenience of introduction, we still assume that the deformable convolution kernel is 3 × 3 3\times 33×3 p k , w k p_k,w_k pk,wkRespectively k ∈ { 1 , ⋯ , K } k\in \{1, \cdots, K\} in a convolution kernelk{ 1,,K } positions and corresponding convolution parameters, wherepk ∈ { ( − 1 , − 1 ) , ⋯ , ( 0 , 0 ) , ⋯ , ( 1 , 1 ) } p_k\in \{(-1, -1) , \cdots, (0,0), \cdots , (1,1)\}pk{ (1,1),,(0,0),,(1,1)} K = 9 K=9 K=9 represents the number of convolution points in a convolution kernel.

let p 0 p_0p0To output the position on the feature map, the DCN-v1deformable convolution is:
Y ( p 0 ) = ∑ k = 1 K wk X ( p 0 + pk + Δ pk ⏟ p ) . Y(p_0) = \sum^K_ {k=1} w_k X(\underbrace{p_0+p_k+\Delta p_k}_p).Y ( p)0)=k=1KwkX(p p0+pk+p _k) .Inputp = p 0 +pk + Δ pkp=p_0+p_k+\Delta p_kp=p0+pk+p _kIndicates the new sampling position to be convolved, generally a floating point number; X , YX, YX and Y are the input and output feature maps respectively, of courseYYY can also be an Image;Δ p \Delta pΔp is our learned offset.
Note:

  1. Here we ignore the bias parameter in the convolution.

Next, we derive the modulated DCN- DCN-v2 based on DCN-v1 :
Y ( p 0 ) = ∑ k = 1 K wk X ( p 0 + pk + Δ pk ⏟ p ) ⋅ Δ mk . Y (p_0) = \sum^K_{k=1} w_k X(\underbrace{p_0+p_k+\Delta p_k}_p) \cdot {\color{mediumorchid}\Delta m_k}.Y ( p)0)=k=1KwkX(p p0+pk+p _k)m _k.

  1. This Δ mk \Delta m_km _kLike offset, an additional CNN is needed to learn it. The same as the learning process of offset, the input is XXX , the output is a feature map; the only difference is that the offset parameter channel output by offset to each input channel is2K 2K2 K , because it includes vertical and horizontal directions, andthe number of output channels of modulation scalars is KKK
  2. It is a constant, and Δ m ∈ [ 0 , 1 ] \Delta m\in[0,1]m _[0,1 ] , so the output terminal of the modulation network must have a limited function, such as sigmoid, becauseΔ m \Delta mΔ m is combined with the previous part in a multiplicative manner, and threshold functions such as sigmoid are used, so the modulation mechanism is essentially a deformation of our mechanism.
  3. By Δ m \Delta mΔ m is used to control the magnitude of the offset, so that the new position we sample will better move to the area of ​​​​our interest, so modulation is also a kind of attention mechanism.
  4. Δ m \Delta mThe appearance of Δ m is precisely because ofΔ p \Delta pΔp is an unrestricted variable.

Note:

  1. The deformation in deformable convolution is based on the change of sampling points on the input feature map, and the convolution kernel is unchanged!
  2. Enter XXX , offset, and modulation scalars are feature maps with the same height and width.
  3. w k w_k wkis the deformable convolution parameter, and the convolution of learning offsetand modulation scalarsis a normal CNN.
  4. Like DCN-v1, bilinear interpolation is required to further obtain the final X ( p ) X(p)X ( p ) thus completes the entire warp.

3 Experiments

In order to verify the performance improvement of DCN-v2 relative to DCN-v1 and conventional CNN, the author COCO 2017used deformable convolution to replace the convolutional layer in the 3rd to 5th
layers of the network model on the data set. The experimental results are as follows:
insert image description here
The conclusion of the experiment is as follows:

  1. No matter which version of DCN it is, compared with conventional CNN, it can increase a certain amount of expressiveness.
  2. Without using mDCN, stacking DCN can improve performance, which proves that using multiple DCN can effectively enhance the expressiveness of the model.
  3. The stacking after adding mDCN can improve the expressiveness of the model more than the stacking of DCN-v1, which proves the effectiveness of the modulation mechanism.

The visualization results are as follows:
insert image description here

There are a total of 3 scenes, where the green point is the object we want to convolve: the first scene our target is a small object; the second scene is a larger object; the third scene is the background. Under each convolution type, the convolution position, receptive field, and error-bounded saliency regions are represented from top to bottom.
Note:

  1. error-bounded saliency regions in visual saliency (visual saliency (Visual Attention Mechanism, VA, ie visual attention mechanism) refers to when facing a scene, human beings automatically process the region of interest and selectively ignore the region of interest , these areas of interest to people are called significant areas) are often used, the explanation of this indicator is introduced in the original text: it means the insert image description hereminimum range that does not affect the network output, for example, a certain area actually affects the network output It is only one of them, so delete the redundant part, and the rest is error-bounded saliency regions, which is similar to simplifying and paying more attention to the part we need.

Experimental results:

  1. Compared with conventional convolution, deformable convolution can increase the sampling range, and the sampling range of DCN-v2 is similar to DCN-v1.
  2. Especially in the receptive field of small objects, the sampling range of DCN-v2 is too large, so that in addition to the small objects we pay attention to, there are other parts that are not related to small objects, and the purpose of our convolution is to use spatial redundancy. According to The correlation of the local space is used to implement tasks such as classification, so the excessive offset of DCN-v1 will affect the final performance; and DCN-v2 obviously controls the magnitude of the offset.
  3. It can be seen from the error-bounded saliency regions that DCN-v2 pays more attention to the areas that are really useful for network performance, thanks to the fact that the modulation mechanism is essentially an attention mechanism.

4 Conclusion

  1. The article proposes an advanced version of DCN by introducing two points on the basis of DCN: multiple DCN stacking and modulation mechanismsDCN-v2 .
  2. Functions of DCN-v2: ① Improve the model's resistance to spatial transformation ② Control the range of offset to avoid extracting irrelevant feature information ③ Increase the range of convolution to improve the expressiveness of the model.

Guess you like

Origin blog.csdn.net/MR_kdcon/article/details/124552878