在这里插入图片描述

CVPR-2017

文章目录

1 Background and Motivation
2 Related Work
3 Advantages / Contributions
4 Method
- 4.1 Person Box Detection
- 4.2 Person Pose Estimation
5 Experiments
6 Conclusion

1 Background and Motivation

Human pose estimation, defined as 2-D localization of human joints on the arms，legs，and key-points on torso and the face.

从本文作者的标题可以看出，motivation 很直接，冲着公共数据集的 accurate 去的，don’t say so much

in the wild 指的是场景更复杂，需要统筹人的检测和关键点的检测

2 Related Work

part-based models：Pictorial Structures
Single-Person Pose：在 MPII 和 FLIC 数据集上表现最好的是 stacked hourglass 方法
Top-Down Multi-Person Pose, in which a pose estimator is applied to the output of a bounding-box person detector
Bottom-Up Multi-Person Pose, in which keypoint proposals are grouped together into person instances

3 Advantages / Contributions

用热力图和 offset 的方式，配合区域内 voting 机制，实现 SOTA，比 2016 COCO keypoints challenge winner（CMU-Pose）的效果还好很多

4 Method

在这里插入图片描述
Faster RCNN + ResNet-101 for person detection

heatmap + offset

OKS-NMS 代替 box-level IOU NMS

提出 keypoint-based confidence score estimator，re-score the detection based on the estimated keypoints

4.1 Person Box Detection

Faster RCNN + ResNet-101

配合 atrous convolution 来保持分辨率，按文中描述分辨率最低为 1/8，相当于只有三个 stage

4.2 Person Pose Estimation

用回归的方法，【DeepPose】《DeepPose：Human Pose Estimation via Deep Neural Networks》直接回归难度有点大，且无法有效的解决重叠时一个 patch 中有多个关键点情况（比如下面的图，有很多肩膀，你最后肩膀回归在哪都说的过去，但是可能不是同一个人的了，这样就不行）

在这里插入图片描述

用分类的方法，也即热力图的方法，精度受限于 output features 的分辨率

作者结合分类和回归的方法，用热力图预测出大致的位置，回归方法预测出 offset 来精修 key-points

作者说这种灵感来源于 two-stage 的 object detection（在一阶段的基础上 refine），哈哈哈，天道有轮回，后续流行的 anchor-free 的 object detection 都有作者方法的影子，也即 heatmap+offset，例如，FCOS，CenterNet，CornerNe

在这里插入图片描述

1）Image Cropping

保证输入 bbox 的 aspect ratio 是一样的，然受 resize 到 353×257，训练的时候 bbox 外扩 1~1.5倍随机（更多的 context 信息），evaluate 的时候 re-scale 1.25

2）Heatmap and Offset Prediction with CNN

输出通道数为 3K，K 是数据集关键点的个数，3 由 1 通道的热力图和 2 通道的 coordinate offset 构成

输出分辨率大小为原图的 1/8，然后双线性插值到输入大小 353×257（目前流行的做法是输出 1/4，GT 也弄成 1/4）

在这里插入图片描述

下面探讨下定位的形式

方式一：ideally delta function，热力图精准打击，无须 offset

在这里插入图片描述

$f_k(x_i) = 1$ 如果 k-th keypoint（ $\in \{1,...,K\}$ ） is located at position $x_i$ （ $\in \{1,...,N\}$ ，N = 353×257 ）

直接回归坐标（精准打击）是比较困难的，作者采用如下的范围攻击方式

方式二：unit-mass delta function，热力图范围攻击，配合 offset 精修（佐助和鸣人的变身术配合影风车）

在这里插入图片描述

热力图 $h_k(x_i) = 1$ 如果 $x_i - l_k|| ≤ R$ ，也即 point $x_i$ 在 location $l_k$ 的半径为 $R$ 内都有效（实验中 R 为 25）

这种热力图形式就是二分类，在关键点范围内就是 1，否则就是 0（计算每个位置是关节点的概率）

热力图的表达形式我们清楚了，offset 形式为 $F_k(x_i) = l_k - x_i$

最终热力图+offset 聚合成 highly localized activation maps $f_k(x_i)$ 的过程如下所示

在这里插入图片描述

聚合的公式表达如下

$f_k(x_i) = \sum_j \frac{1}{\pi R^2}G(x_j + F_k(x_j) - x_i)h_k(x_j)$

$G(\cdot)$ 是 bi-linear interpolation kernel
$F_k(x_i) = l_k - x_i$
$i$ 表示区域内当前位置， $j$ 表示区域内除了 $i$ 之外点的位置（要参与投票的点）

假设网络学出来的 offset $F_k(x_j)$ 是完美的，那么 $F_k(x_j) = l_k-x_j$ ，上述公式中 $x_j + F_k(x_j) - x_i$ 这一项就变成了 $l_k-x_i$ ，每个位置都一样， $\sum$ 后再平均也一样，配合热力图的结果，能得到最终的 fused activation maps

然而网络一般学习到的 offset 是有偏差的，也即 $F_k(x_j) ≠ l_k-x_j$ ，作者这里用 $G$ 函数进行插值学出来该方向的一个 weight（G 的作用应该是不同方向 weight 不一样，预测出来的 offset 向量 $F_k(x_j)$ 和当前位置与 GT 点构成的向量 $l_k-x_j$ 的偏差越大，作用到 $h_k(x_j)$ 上的权重越小），然后和当前位置热力图的预测结果 $h_k(x_j)$ 加权组合在一起（参考 Towards Accurate Multi-person Pose Estimation in the Wild 论文阅读中的评论）

上述聚合的形式 is a form of Hough voting：

each point $j$ in the image crop grid（353×257）casts a vote with its estimate for the position of every key-point, with the vote being weighted by the probability that it is in the disk of influence of the corresponding keypoint（范围内的点都在发挥着自己的作用，来寻找最优的 key-point location，注意这里公式的形式哈，圈圈内的热力点都在做贡献，然后根据插值结果的权重加权在了一起）

Hough voting 可以用下面的简单例子说明（来自【OpenCV学习笔记】之霍夫变换（Hough Transform））

例如，检测图形中的直线，根据直线方程 $y = a x + b$ 可知，确定斜率 $a$ 和截距 $b$ 就可以求得直线方程，原图像坐标空间的一个点，对应参数空间的一条线，两个点对应两条线，n 个点对应 n 条线，然后投票取最大值（重叠最多的点），确定交点 $a_0,b_0$ ，进而得到直线的方程

上图以两个点的求解为例

3）Model Training

热力图分支（分类分支）采用的是 logistic losses（二分类）， $x_i-l_k||≤R$ 内 label 为 1，外 label 为 0（区别于现在的高斯 label）

offset 分支采用的是 Huber robust loss（类似于 smooth L1）
在这里插入图片描述参考 Huber robust error function

具体损失如下：
在这里插入图片描述
整体损失如下

在这里插入图片描述
$L_h(\theta)$ 表示 heatmap 分支的 loss

$\lambda_h$ 和 $\lambda_o$ 分别为 4 和 1，是加权的权重

4）Pose Rescoring

在这里插入图片描述
每张热力图中相应最高的点的 score 求和取平均，得到 score，来评估关键点的质量

关于这点作者后面实验没有进行 ablation，可惜了

5）OKS-Based Non Maximum Suppression

检测人时，用到 NMS 后处理，根据 IoU 来抑制重叠度较高的框

作者在 IoU-NMS 的基础上，对 pose estimation 的最终结果还进行了一下 OKS-Based NMS——measure overlap using the object keypoint similarity（OKS） for two candidate pose detection！

5 Experiments

5.1 Datasets

MS COCO
MS COCO+internal

5.2 COCO Keypoints Detection State-of-the-Art

在这里插入图片描述

下面是展示的部分例子，给小的假人也标出来了是真的搞笑

在这里插入图片描述

5.3 Ablation Study：

1）Box Detection Module

在这里插入图片描述
两个检测人的模型在 COCO 数据集上的 AP 为

0.466 and 0.500 for mini-val
0.456 and 0.487 for test-dev

人检测器的质量还是有一定影响的，但是加了 GT 后 AP 也才 70（我是三年后说的这句话），关键点检测的算法还有一定的提升空间

PS：测试人形检测器对关键点检测影响大不大时，可以用 GT 人来测关键点检测的 AP

2）Pose Estimation Module

对 backbone 和 input size 进行了消融

在这里插入图片描述

5.4 OKS-Based Non Maximum Suppression

在这里插入图片描述
person box detector 的 IoU-NMS 阈值设置为了 0.6

6 Conclusion

COCO，17 keypoints（12 body joints and 5 face landmarks）
Top-Down 型式的 human pose estimation 和 object detection 中的 two-stage 类似，第一阶段检测人，第二阶段在第一阶段的基础上检测关键点
Huber robust loss 类似于 smooth L1
Pose Rescoring 来评估关键点的检测质量
测试人形检测器对关键点检测影响大不大时，可以用 GT 人来测关键点检测的 AP

【Heatmap+offset】《Towards Accurate Multi-person Pose Estimation in the Wild》

文章目录

1 Background and Motivation

2 Related Work

3 Advantages / Contributions

4 Method

4.1 Person Box Detection

4.2 Person Pose Estimation

5 Experiments

5.1 Datasets

5.2 COCO Keypoints Detection State-of-the-Art

5.3 Ablation Study：

5.4 OKS-Based Non Maximum Suppression

6 Conclusion

猜你喜欢