1、论文总述

在这里插入图片描述
这篇paper提出的网络结构主要是应用在了多人姿态估计上（我是看了ConerNet之后来看这篇文章的，感觉入了个坑），作者最想表达的应该是 End-to-End Learning for Joint Detection and Grouping，就是说好多的视觉任务都可以套用这个框架，即先检测各个部分，然后将他们通过一定的规则组合到一起，作者实现了这个过程，并且应用到了多人姿态预测方面，取得了SATA的效果，论文中也将其应用了实例分割方面，但效果一般，并且作者也提到多目标跟踪里也可以用这种思路了。

具体流程：先是 Stacked Hourglass Architecture提取目标特征，检测多人的关键点，产生,K个channel的heatmap（其中K为关键点个数，一般为17），在产生heatmap的同时产生K个channel的对应‘tag’标签，即为同一个人的关键点的tag数值都差不多，根据tag来将同一个人的关键点分到一起，这也叫为bottom-up方式，本质上为将分割级的检测通过tag变为了detection的检测。

Multi-person pose systems must
scan the whole image detecting all people and their corresponding keypoints.
For this task, we integrate associative embedding with a stacked hourglass network [31], which
produces a detection heatmap and a tagging heatmap for each
body joint, and then groups body joints with similar tags
into individual people. Experiments demonstrate that our
approach outperforms all recent methods and achieves state
of the art results on MS-COCO [27] and MPII Multiperson
Pose [3, 35].

2、许多任务都可以被视为joint detection and grouping

Many computer vision tasks can be viewed as joint detection and grouping: detecting smaller visual units and
grouping them into larger structures. For example, multiperson pose estimation can be viewed as detecting body
joints and grouping them into individual people; instance
segmentation can be viewed as detecting relevant pixels and
grouping them into object instances; multi-object tracking
can be viewed as detecting object instances and grouping
them into tracks. In all of these cases, the output is a variable
number of visual units and their assignment into a variable
number of visual groups.

3、 Common grouping approaches

Common grouping approaches include ***spectral clustering [***51, 46], conditional random fields (e.g. [31]),
and generative probabilistic models (e.g. [21]). These grouping approaches all assume pre-detected basic visual units
and pre-computed affinity measures between them but differ among themselves in the process of converting affinity
measures into groups. In contrast, our approach performs
detection and grouping in one stage using a generic network
that includes no special design for grouping.

4、the dimension of the embeddings

Note that the dimension of the embeddings is not critical. If a network can successfully predict high-dimensional
embeddings to separate the detections into groups, it should
also be able to learn to project those high-dimensional embeddings to lower dimensions, as long as there is enough
network capacity. In practice we have found that 1D embedding is sufficient for multiperson pose estimation, and higher
dimensions do not lead to significant improvement. Thus
throughout this paper we assume 1D embeddings.

1D embeddings的效果就很好。。

5、沙漏网络介绍

在这里插入图片描述

In this work we combine associative embedding with the
stacked hourglass architecture [40], a model for dense pixelwise prediction that consists of a sequence of modules each
shaped like an hourglass (Fig. 2). Each “hourglass” has a
standard set of convolutional and pooling layers that process
features down to a low resolution capturing the full context of the image. Then, these features are upsampled and
gradually combined with outputs from higher and higher resolutions until reaching the final output resolution. Stacking
multiple hourglasses enables repeated bottom-up and topdown inference to produce a more accurate final prediction.

The stacked hourglass model was originally developed
for single-person human pose estimation. The model outputs a heatmap for each body joint of a target person. Then,
the pixel with the highest heatmap activation is used as the
predicted location for that joint. The network is designed to
consolidate global and local features which serves to capture information about the full structure of the body while
preserving fine details for precise localization. This balance
between global and local features is just as important in
other pixel-wise prediction tasks, and we therefore apply the
same network towards both multiperson pose estimation and
instance segmentation.

6、训练时的两个loss

To train the network, we impose a detection loss and a
grouping loss on the output heatmaps. The detection loss
computes mean square error between each predicted detection heatmap and its “ground truth” heatmap which consists
of a 2D gaussian activation at each keypoint location. This
loss is the same as the one used by Newell et al. [40].

两个损失：检测损失和分组损失

检测损失：应该是类似CornerNet的检测损失，先将正样本点根据高斯函数洒在heatmap上，然后根据均方差计算损失

分组损失：这个比较重要！

在这里插入图片描述如上图的损失函数，思想就是类内差距小，类间差距大，reference embedding这个具体数值训练时候可能是动态的，但训练时候一个人的关键点的位置不会变的，GT只有同一个人的关节点的位置，而没有具体‘tag’，这点论文里也说了，‘tag’具体数值不重要，重点是同一个人的关节点对应的TAG差不多，然后对相似的tag进行分组，组成一个人，reference embedding就是根据GT的位置产生的’tag’取平均得到的，这样训练时候就可以让同一组的关节点都朝着一个数值收敛，所以这个数值是提前不知道的，但是通过训练可以同一组的关节点让他们产生相似的数值。

To train a network to predict the tags, we use a loss function that encourages pairs of tags to have similar values if
the corresponding detections belong to the same group in the
ground truth or dissimilar values otherwise. It is important to
note that we have no “ground truth” tags for the network to
predict, because what matters is not the particular tag values,
only the differences between them. The network has the
freedom to decide on the tag values as long as they agree
with the ground truth grouping.

在这里插入图片描述从图上可以看到，同一个人的关键点位置产生的tag数值很相近。

7、训练和测试时候的多尺度

While it is feasible to train a network to make pose predictions for people of all scales, there are some drawbacks.
Extra capacity is required of the network to learn the necessary scale invariance, and the precision of predictions for
small people will suffer due to issues of low resolution after
pooling.
To account for this, we evaluate images at test time
at multiple scales. There are a number of potential ways
to use the output from each scale to produce a final set of
pose predictions. For our purposes, we take the produced
heatmaps and average them together. Then, to combine
tags across scales, we concatenate the set of tags at a pixel
location into a vector ***v ∈ Rm (assuming m scales)***. The decoding process does not change from the method described
with scalar tag values, we now just compare vector distances.