1、论文总述

在这里插入图片描述

最近在看anchor-free的目标检测网络时，发现好几篇都不约而同地使用了可变形卷积层（Deformable Convolution）,如RepPoints、Region Proposal by Guided Anchoring等，所以来看看这篇论文，发现已经出了两个版本，DCNv2版本也出了一年多了，第二个版本的性能比较好，V1只是提出这个思想。

这篇paper主要提出了两个模块： deformable convolution and deformable RoI pooling， 即可变形卷积层和可变形的RoI pooling层。
可变形卷积层是通过对feature map进行卷积（注意：offsets的得到是通过常规卷积）得到offsets，offsets是与输入feature map一样尺寸、通道数为2N（N为卷积核的像素点数，如3*3卷积核则N=9），即通过一个简单的卷积之后得到feature map上每个点的对应N个offset，2N是因为每个偏移包含x y两个维度的偏移，这样对当前点进行卷积时候即可得到卷积核的N个点的偏移， deformable RoI pooling与之类似，具体看后面。

这样卷积时候卷积核不再是正规的矩形结构，而是可以根据目标内容进行形变（具体就是采样点位置更灵活，哪里有特征采哪里），理论上应该是这样的，但具体卷积时候它能不能真的做到这样就不清楚了，感觉性能只是提升了一点点。

In this work, we introduce two new modules that greatly
enhance CNNs’ capability of modeling geometric transformations.
The first is deformable convolution. It adds 2D
offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. （采样点位置更自由）
It is illustrated in Figure 1（下图）.
The offsets are learned from the preceding feature maps, via additional
convolutional layers. Thus, the deformation is conditioned
on the input features in a local, dense, and adaptive manner.
The second is deformable RoI pooling. It adds an offset
to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned
from the preceding feature maps and the RoIs, enabling
adaptive part localization for objects with different shapes

在这里插入图片描述

2、目标变形问题的先前的解决办法以及他们的缺点

传统方法和CNN中的常见的对形变问题的解决办法：

In general, there are two ways.
The first is to build the （加形变的训练数据集）
training datasets with sufficient desired variations. This is
usually realized by augmenting the existing data samples,
e.g., by affine transformation. Robust representations can
be learned from the data, but usually at the cost of expensive training and complex model parameters.
The second is to use transformation-invariant features and algorithms.
This category subsumes many well known techniques, such
as SIFT (scale invariant feature transform) [42] and sliding
window based object detection paradigm. （人工设计一些形变不变特征）

上述方法的缺点：

There are two drawbacks in above ways.
First, the geometric transformations are assumed fixed and known. Such
prior knowledge is used to augment the data, and design the
features and algorithms. This assumption prevents generalization to new tasks possessing unknown geometric transformations, which are not properly modeled. (对形变的数据增广手段少，泛化能力差)
Second, handcrafted design of invariant features and algorithms could be
difficult or infeasible for overly complex transformations,
even when they are known***（人工设计的可形变特征不能处理复杂形变）***

3、CNN不能处理复杂的未知的形变的原因

主要是觉得这段总结的特别好

In short, CNNs are inherently limited to model large,
unknown transformations. The limitation originates from
the fixed geometric structures of CNN modules: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a
fixed ratio; a RoI (region-of-interest) pooling layer separates a RoI into fixed spatial bins, etc. （固定位置采样、池化比率固定、ROI pooling后的bin的大小固定）
There lacks internal
mechanisms to handle the geometric transformations. This
causes noticeable problems.
For one example, the receptive field sizes of all activation units in the same CNN layer
are the same. This is undesirable for high level CNN layers that encode the semantics over spatial locations. Because different locations may correspond to objects with
different scales or deformation, adaptive determination of
scales or receptive field sizes is desirable for visual recognition with fine localization, e.g., semantic segmentation using fully convolutional networks [41].
For another example, while object detection has seen significant and rapid
progress [16, 52, 15, 47, 46, 40, 7] recently, all approaches
still rely on the primitive bounding box based feature extraction. This is clearly sub-optimal, especially for non-rigid objects.

上述原文中提到一个问题：CNN中同一层的feature map的different locations有可能对应着不同大小的目标或者是不同形变的目标，但这个feature map是由同一个卷积核得到的，所以理论上不能同时处理多种形变，但DCN给feature map上的每个点都赋予了对应的offsets，所以同一个卷积核在对feature map进行卷积时是在对应的位置采样然后卷积，有处理形变和不同尺度能力；
我猜想原CNN处理不同类别的目标应该是通过不同的卷积核，即不同类别的目标响应在不同channel的feature map上。

4、 Deformable Convolution的数学公式推导

在这里插入图片描述

（1）式是CNN，（2）式是DCN，
∆pn一般是小数，所以用了双线性插值来得到该像素点的值，且∆pn没有范围限制，有可能偏移到ROI之外，这就是DCNv2的出发点。

5、Deformable RoI Pooling的数学公式表达

在这里插入图片描述

p0为ROI的左上角点，
（5）是CNN正常的ROI pooling, (6)是DCN的ROI pooling，每个bin可以有偏移

其中：
$\Delta \mathbf{p}_{i j}=\gamma \cdot \Delta \hat{\mathbf{p}}_{i j} \circ(w, h)$

说是可以让offset对ROI的尺寸有 learning invariant。

6、the adaptive receptive field in deformable convolution

在这里插入图片描述

When the deformable convolution are stacked, the effect
of composited deformation is profound. This is exemplified
in Figure 5. The receptive field and the sampling locations
in the standard convolution are fixed all over the top feature
map (left). They are adaptively adjusted according to the
objects’ scale and shape in deformable convolution (right).

DCN可以自适应调整 receptive field and the sampling locations。

7、Details of Aligned-Inception-ResNet

在这里插入图片描述
首先引出问题，为什么需要对齐：

In the original Inception-ResNet [51] architecture, multiple layers of valid convolution/pooling are utilized, which brings feature alignment issues for dense prediction tasks.To remedy this issue, the network architecture is modified [20], called “Aligned-Inception-ResNet” and shown in Table 6.

对齐与不对齐的两个网络的主要区别：

There are two main differences between Aligned-Inception-ResNet
and the original Inception-ResNet [51].
Firstly, Aligned Inception-ResNet does not have the feature alignment problem, by proper padding in convolutional and pooling layers.
Secondly, Aligned-Inception-ResNet consists of repetitive modules, whose design is simpler than the original Inception-ResNet architectures.