Papers Translation | CenterNet: Keypoint Triplets for Object Detection

论文地址:https://arxiv.org/pdf/1904.07850.pdf
代码:https://github.com/xingyizhou/CenterNet
引用:Duan K, Bai S, Xie L, et al. CenterNet: Keypoint Triplets for Object Detection[J]. 2019.

Abstract

       Target detection and recognition on the target block in an image are often shown in block form of axially symmetrical. Most successful are the first target detector exhaustive a potential target position, then the position classification, this approach is a waste of time, inefficient, require additional post-processing. In this article, we take a different approach, building a model of the target as a point - that is the center point of the target BBox. Our key point is estimated using a detector to find the center point, and return to the other target properties, such as size, 3D position, orientation, and even attitude . Our method based on the center point, referred to: CenterNet, compared to BBox based detectors, our model is differentiable end, simpler, faster and more accurate. Our model achieved the best trade-off of speed and accuracy, the following is its performance:

MS COCO dataset, with 28:1% AP at 142 FPS, 37:4% AP at 52 FPS, and 45:1% AP with multi-scale testing at 1.4 FPS.

Using the same models made in 3D bbox KITTI benchmark, the COCO keypoint dataset do body posture detection. Compared with complex multi-stage approach, we have achieved a competitive result and to achieve real-time.


Introduction

       Target detection driving many vision-based tasks such as segmentation example, pose estimation, tracking, motion recognition. Application and downstream operations, such as monitoring, automatic driving, visual quiz. We are in the form of current detector bbox axisymmetric frame firmly bonded to the targets. For each target frame, classifier to determine whether each box is target specific categories or background.

One stage detectors may slide on the image bbox complex arrangement (i.e. anchor), and then directly to block classification, without the contents of the specified frame.

Two-stage detectors recalculated for each potential frame image features, and those features will be classified.

After treatment, i.e., the NMS (non-maximum value suppression), to delete duplicate detection frame with the target by calculating IOU between Bbox. This is difficult to distinguish post-treatment and training, so most conventional detectors are not trainable end.

 

       This paper presented by a target center point of the target (see FIG. 2), then return to some properties of the target at the center position, for example: size, dimension, 3D extent, orientation, pose. The target detection problem into a standard key point estimation problem . We just passed a full convolution of the image network, to obtain a thermodynamic diagram, heat FIG center point i.e. the peak point, the peak position of each feature point is predicted FIG width and height information of the object .

Model training using standard supervised learning, reasoning just a single forward propagation network, NMS after this type of treatment does not exist .

 Our model to do some expansion (see Figure 4), can output 3D target box at each center point, people pose estimation results you want.

For detecting a BBox 3D, we return to the depth information directly obtained target, the size of the 3D frame, toward the target;

For people pose estimation, we will joints (2D joint) position as an offset center point, direct return to the values ​​of these offsets at the center position.

 Since the simplified model of the design, a higher speed (see FIG. 1)


Related work

        Our approach is based on the anchor point one-stage method is similar. Unknown shape can be viewed as the center point anchor (see FIG. 3). But there are several important differences ( Innovation in this article ):

First, we assign the anchors are placed only on location, not size box . No threshold manually set to do before and after the scene classification. (Faster RCNN would like to GT IOU> 0.7 as the prospect of <0.3 as the background, regardless of the other);

第二,每个目标仅仅有一个正的锚点,因此不会用到NMS,我们提取关键点特征图上局部峰值点(local peaks);

第三,CenterNet 相比较传统目标检测而言(缩放16倍尺度),使用更大分辨率的输出特征图(缩放了4倍),因此无需用到多重特征图锚点;

通过关键点估计做目标检测:

       我们并非第一个通过关键点估计做目标检测的。CornerNet将bbox的两个角作为关键点;ExtremeNet 检测所有目标的 最上,最下,最左,最右,中心点;所有这些网络和我们的一样都建立在鲁棒的关键点估计网络之上。但是它们都需要经过一个关键点grouping阶段,这会降低算法整体速度;而我们的算法仅仅提取每个目标的中心点,无需对关键点进行grouping 或者是后处理

单目3D 目标检测:

3D BBox检测为自动驾驶赋能。Deep3Dbox使用一个 slow-RCNN 风格的框架,该网络先检测2D目标,然后将目标送到3D 估计网络;3D RCNN在Faster-RCNN上添加了额外的head来做3D projection;Deep Manta 使用一个 coarse-to-fine的Faster-RCNN ,在多任务中训练。而我们的模型同one-stage版本的Deep3Dbox 或3D RCNN相似,同样,CenterNet比它们都更简洁,更快


Preliminary

        令I\epsilon R^{W\times H\times 3} 为输入图像,其宽W,高H。我们目标是生成关键点热力图\hat{Y}\epsilon [0,1]^{\frac{W}{R}\times \frac{H}{R}\times C},其中R 是输出stride(即尺寸缩放比例),C是关键点类型数(即输出特征图通道数);关键点类型有: C = 17 的人关节点,用于人姿态估计; C = 80 的目标类别,用于目标检测。我们默认采用下采用数为R=4 ;\hat{Y}_{x,y,c}=1 表示检测到的关键点;\hat{Y}_{x,y,c}=0 表示背景;我们采用了几个不同的全卷积编码-解码网络来预测图像 得到的\hat{Y}:stacked hourglass network , upconvolutional residual networks (ResNet), deep layer aggregation (DLA) 。

       我们训练关键点预测网络时参照了Law和Deng (H. Law and J. Deng. Cornernet: Detecting objects as
paired keypoints. In ECCV, 2018.)  对于 Ground Truth(即GT)的关键点 c ,其位置为 p \epsilon R^{2} ,计算得到低分辨率(经过下采样)上对应的关键点 \tilde{p}=\left \lfloor \frac{p}{R} \right \rfloor . 我们将 GT 关键点 通过高斯核  分散到热力图 上,其中 是目标尺度-自适应 的标准方差。如果对于同个类 c (同个关键点或是目标类别)有两个高斯函数发生重叠,我们选择元素级最大的。训练目标函数如下,像素级逻辑回归的focal loss:

 其中  和 是focal loss的超参数,实验中两个数分别设置为2和4, N是图像 I 中的关键点个数,除以N主要为了将所有focal loss归一化。

        由于图像下采样时,GT的关键点会因数据是离散的而产生偏差,我们对每个中心点附加预测了个局部偏移  所有类别 c 共享同个偏移预测,这个偏移同个 L1 loss来训练:

 只会在关键点位置 做监督操作,其他位置无视。下面章节介绍如何将关键点估计用于目标检测。


Objects as Points

        令 是目标 k (其类别为   )的bbox. 其中心位置为  ,我们用 关键点估计 来得到所有的中心点,此外,为每个目标 k 回归出目标的尺寸  。为了减少计算负担,我们为每个目标种类使用单一的尺寸预测   ,我们在中心点位置添加了 L1 loss:

 我们不将scale进行归一化,直接使用原始像素坐标。为了调节该loss的影响,将其乘了个系数,整个训练的目标loss函数为:

 实验中, ,,整个网络预测会在每个位置输出 C+4个值(即关键点类别C, 偏移量的x,y,尺寸的w,h),所有输出共享一个全卷积的backbone;

 

从点到Bbox 

       在推理的时候,我们分别提取热力图上每个类别的峰值点。如何得到这些峰值点呢?做法是将热力图上的所有响应点与其连接的8个临近点进行比较,如果该点响应值大于或等于其八个临近点值则保留,最后我们保留所有满足之前要求的前100个峰值点。令  是检测到的 c 类别的 n 个中心点的集合。 每个关键点以整型坐标 的形式给出。作为测量得到的检测置信度, 产生如下的bbox:

其中是偏移预测结果;是尺度预测结果;所有的输出都直接从关键点估计得到,无需基于IOU的NMS或者其他后处理。

 

3D 检测

       3D检测是对每个目标进行3维bbox估计,每个中心点需要3个附加信息:depth, 3D dimension, orientation。我们为每个信息分别添加head.

        对于每个中心点,深度值depth是一个维度的。然后depth很难直接回归!我们参考【D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.】对输出做了变换。 其中 是sigmoid函数,在特征点估计网络上添加了一个深度计算通道 , 该通道使用了两个卷积层,然后做ReLU 。我们用L1 loss来训练深度估计器。

       目标的3D维度是三个标量值。我们直接回归出它们(长宽高)的绝对值,单位为米,用的是一个独立的head :   和L1 loss;

       方向默认是单标量的值,然而其也很难回归。我们参考【A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka.
3d bounding box estimation using deep learning and geometry. In CVPR, 2017.】, 用两个bins来呈现方向,且i做n-bin回归。特别地,方向用8个标量值来编码的形式,每个bin有4个值。对于一个bin,两个值用作softmax分类,其余两个值回归到在每个bin中的角度。

 

人姿态估计

       人的姿态估计旨在估计 图像中每个人的k 个2D人的关节点位置(在COCO中,k是17,即每个人有17个关节点)。因此,我们令中心点的姿态是 kx2维的,然后将每个关键点(关节点对应的点)参数化为相对于中心点的偏移。 我们直接回归出关节点的偏移(像素单位) ,用到了L1 loss;我们通过给loss添加mask方式来无视那些不可见的关键点(关节点)。此处参照了slow-RCNN。

        为了refine关键点(关节点),我们进一步估计k 个人体关节点热力图  ,使用的是标准的bottom-up 多人体姿态估计【4,39,41】,我们训练人的关节点热力图使用focal loss和像素偏移量,这块的思路和中心点的训练雷同。我们找到热力图上训练得到的最近的初始预测值,然后将中心偏移作为一个grouping的线索,来为每个关键点(关节点)分配其最近的人。具体来说,令是检测到的中心点。第一次回归得到的关节点为:

我们提取到的所有关键点(关节点,此处是类似中心点检测用热力图回归得到的,对于热力图上值小于0.1的直接略去): for each joint type j from the corresponding heatmap 

然后将每个回归(第一次回归,通过偏移方式)位置  与最近的检测关键点(关节点)进行分配 ,考虑到只对检测到的目标框中的关节点进行关联。


Implementation details

       我们实验了4个结构:ResNet-18, ResNet-101, DLA-34, Hourglass-104. 我们用deformable卷积层来更改ResNets和DLA-34,按照原样使用Hourglass 网络。

Hourglass

堆叠的Hourglass网络【30,40】通过两个连续的hourglass 模块对输入进行了4倍的下采样,每个hourglass 模块是个对称的5层 下和上卷积网络,且带有skip连接。该网络较大,但通常会生成最好的关键点估计。

ResNet

Xiao et al. [55]等人对标准的ResNet做了3个up-convolutional网络来dedao更高的分辨率输出(最终stride为4)。为了节省计算量,我们改变这3个up-convolutional的输出通道数分别为256,128,64。up-convolutional核初始为双线性插值。

DLA

即Deep Layer Aggregation (DLA),是带多级跳跃连接的图像分类网络,我们采用全卷积上采样版的DLA,用deformable卷积来跳跃连接低层和输出层;将原来上采样层的卷积都替换成3x3的deformable卷积。在每个输出head前加了一个3x3x256的卷积,然后做1x1卷积得到期望输出。

Training

训练输入图像尺寸:512x512; 输出分辨率:128x128  (即4倍stride);采用数据增强方式:随机flip, 随机scaling (比例在0.6到1.3),裁剪,颜色jittering;采用Adam优化器;

在3D估计分支任务中未采用数据增强(scaling和crop会影响尺寸);

更详细的训练参数设置(学习率,GPU数量,初始化策略等)见论文~~

Inference

采用3个层次的测试增强:没增强,flip增强,flip和multi-scale(0.5,0.75,1.25,1.5)增强;For flip, we average the network
outputs before decoding bounding boxes. For multi-scale,we use NMS to merge results.




Experiments

 

 

 

 

文章网络结构细节信息见下图:

 

Guess you like

Origin blog.csdn.net/weixin_39875161/article/details/90552609