1、论文总述

在这里插入图片描述

这也是一篇anchor_free的目标检测论文，和CornerNet的对角点编码框的方式，不一样，它是利用中心点和框的宽高来编码这个框，这种编码方式可以更有效的利用目标内部特征来分类，这种思路让我想到了DenseBox
，它是2015年百度的一篇论文，对框的编码方式和这个很像，他也是利用中心点然后离对角点的四个坐标的距离，是5个值编码一个框，CenterNet是三个值编码一个框（中心点 w h），虽然这么说，但作者为了精度更高，仿照CornerNet用了一个离中心点的offset（两个值 x y），所以最终CenterNet也是五个值编码一个框。
而且使用简单的ResNet18和反卷积，其可以跑出142FPS，COCO准确率为28.1%，如果使用精心设计的姿态估计网络DLA-34，那么可以达到37.4%的准确率，速度为52FPS，最后作者还使用了SOTA的关键点检测的网络Hourglass-104，可以达到45.1%的COCO AP。

思路扩展 到其他视觉任务：检测目标的中心点，然后根据中心点及其周围的特征提取回归初目标的其他属性，如2D目标检测里的长宽、3D目标检测里的深度和朝向、人体姿态估计里的关键点（其他关键点相对于中心点的偏移）等等，所以本文提出的算法可以很方便扩展到其他任务，作者也做了相关实验，发现效果还不错，论文中有数据。下图为不同的检测任务下需要回归的不同属性。

在这里插入图片描述

In this paper, we provide a much simpler and more effi-
cient alternative. We represent objects by a single point at
their bounding box center (see Figure 2). Other properties,
such as object size, dimension, 3D extent, orientation, and
pose are then regressed directly from image features at the
center location. Object detection is then a standard keypoint（目标检测变成一个关键点检测任务）
estimation problem [3,39,60]. We simply feed the input image to a fully convolutional network [37, 40] that generates
a heatmap. Peaks in this heatmap correspond to object centers. Image features at each peak predict the objects bounding box height and weight. The model trains using standard
dense supervised learning [39,60]. Inference is a single network forward-pass, without non-maximal suppression for post-processing.

【注】：2019有两篇centernet，CenterNet: Keypoint Triplets for Object Detection和objects as points

2、center point与Anchor-based的不同

A center point can be seen
as a single shape-agnostic anchor (see Figure 3). However,
there are a few important differences.
First, our CenterNet
assigns the “anchor” based solely on location, not box overlap [18]. We have no manual thresholds [18] for foreground （不需要人工设定IOU阈值）
and background classification.
Second, we only have one
positive “anchor” per object, and hence do not need NonMaximum Suppression (NMS) [2]. We simply extract local peaks in the keypoint heatmap [4, 39]. （不需要NMS）
Third, CenterNet
uses a larger output resolution (output stride of 4) compared
to traditional object detectors [21, 22] (output stride of 16).
This eliminates the need for multiple anchors [47]. （分辨率大，不需要多个anchors）
相比于cornernet，不需要对关键点进行分组

3、Hourglass-104 resnet和DLA-34的网络结构

在这里插入图片描述

4、中心点正负样本的标定以及改进的focal loss

在这里插入图片描述
【注】：backbone中利用可变性卷积 deformable convolution

正负样本的设定主要是参考了cornernet的思路，利用高斯核函数在feature人 map正样本那一个点的周围产生一圈的小于1大于零的“正样本”，注意是在下采样过后的feature map上‘撒’正样本点，所以就有了正样本中心点与原图上的中心点有误差的问题，这个问题就是offset的来源。如下图图示：
在这里插入图片描述
改进的focal loss损失函数：（具体解读参考：扔掉anchor！真正的CenterNet）

$L_{k}=\frac{-1}{N} \sum_{x y c}\left\{\begin{array}{cc}{\left(1-\hat{Y}_{x y c}\right)^{\alpha} \log \left(\hat{Y}_{x y c}\right)} & {\text { if } Y_{x y c}=1} \\ {\left(1-Y_{x y c}\right)^{\beta}\left(\hat{Y}_{x y c}\right)^{\alpha}} & {\text { otherwise }} \\ {\log \left(1-\hat{Y}_{x y c}\right)} & {\text { otherwise }}\end{array}\right.$

offset损失函数：

$L_{o f f}=\frac{1}{N} \sum_{p}\left|\partial_{\tilde{p}}-\left(\frac{p}{R}-\tilde{p}\right)\right|$

【注】：所有的类别用的一个offset，就是Offset输出只有两个channel

5、宽高和总体的损失函数

在这里插入图片描述
The overall training objective is：
$L_{d e t}=L_{k}+\lambda_{s i z e} L_{s i z e}+\lambda_{o f f} L_{o f f}$

We set λsize = 0.1 and λoff = 1 in all our experiments unless specified otherwise. We use a single network to predict
the keypoints Yˆ , offset Oˆ, and size Sˆ. The network predicts a total of C + 4 outputs at each location. All outputs
share a common fully-convolutional backbone network

6、From points to bounding boxes

$\begin{aligned}\left(\hat{x}_{i}+\delta \hat{x}_{i}-\hat{w}_{i} / 2,\right.& \hat{y}_{i}+\delta \hat{y}_{i}-\hat{h}_{i} / 2 \\ \hat{x}_{i}+\delta \hat{x}_{i}+\hat{w}_{i} / 2, &\left.\hat{y}_{i}+\delta \hat{y}_{i}+\hat{h}_{i} / 2\right) \end{aligned}$

7、不同basenet和augmentation下的网络性能

在这里插入图片描述

8、State-of-the-art comparison

在这里插入图片描述
这里有好多论文都没看过啊，得抓紧看了。。比如TridentNet

9、CenterNet的Center point collision

In unlucky circumstances, two different objects might share
the same center, if they perfectly align. In this scenario,
CenterNet would only detect one of them.

就是中心点重合的问题，不同类的物体的中心点处于一个位置，或者同类的挨得比较近的物体下采样之后处于一个cell里。