Associative Embedding:End-to-End Learning for Joint Detection and Grouping论文翻译

Abstract:

We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping.

我们提出了一中关联嵌入的新颖的方法,应用在监督学习卷积神经网络中的检测和分组任务中.

A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking.

可以以这种方式构建,能够应用于解决许多计算机视觉问题，包括多人姿势估计，实例分割和多对象跟踪。

This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions.

这个方法,可以被很容易地用于逐像素预测的任何优秀的网络架构中.

we show how to apply this method to both multi-person pose estimation and instance segmenta tion and report state-of-the-art performance for multi-person pose on the MPII and MS-COCO datasets.

通过在MPII和MS-COCO数据集上的实验,表明应用这种方法可以同时对于多人姿态识别和实力分割,都取得了非常良好的效果.

扫描二维码关注公众号，回复： 4953604 查看本文章

1.introduce

Many computer vision tasks can be viewed as joint detection and grouping: detecting smaller visual units and grouping them into larger structures. For example, multi-person pose estimation can be viewed as detecting body joints and grouping them into individual people; instance segmentation can be viewed as detecting relevant pixels and grouping them into object instances; multi-object tracking can be viewed as detecting object instances and grouping them into tracks. In all of these cases, the output is a variable number of visual units and their assignment into a variable number of visual groups.

许多计算机视觉任务可以被视为联合检测和分组：检测较小的视觉单元并将它们分组为更大的结构。例如，可以将多人姿势估计视为检测身体关节并将其分组为个体; 实例分割可以被视为检测相关像素并将它们分组到对象实例中;多对象跟踪可以被视为检测对象实例并将它们分组到轨道中。在所有这些情况下，输出是可变数量的可视单元，并将它们分配到可变数量的可视组中。

Such tasks are often approached with two-stage pipelines that perform detection first and grouping second. But such approaches may be suboptimal because detection and grouping are usually tightly coupled: for example, in multiperson pose estimation, a wrist detection is likely a false positive if there is not an elbow detection nearby to group with.

这些任务通常采用两级管道进行处理，这两条管道首先执行检测，然后进行分组。但是这样的方法可能是次优的，因为检测和分组通常紧密耦合：例如，在多人姿势估计中，如果在组附近没有肘检测，则手腕检测可能是假阳性。

In this paper we ask whether it is possible to jointly perform detection and grouping using a single-stage deep network trained end-to-end. We propose associative embedding, a novel method to represent the output of joint detection and grouping. The basic idea is to introduce, for each detection, a real number that serves as a “tag” to identify the group the

detection belongs to. In other words, the tags associate each detection with other detections in the same group.

在本文中，我们探讨是否有可能使用单端深度网络训练的端到端联合执行检测和分组。我们提出了关联嵌入，一种表示联合检测和分组输出的新方法。基本思想是为每次检测引入一个实数，作为“标签”来识别检测所属的组。换句话说，标签将每个检测与同一组中的其他检测相关联。

Consider the special case of detections in 2D and embeddings in 1D (real numbers). The network outputs both a heatmap of per-pixel detection scores and a heatmap of per-pixel identity tags. The detections and groups are then decoded from these two heatmaps.

考虑2D中的检测和1D（实数）嵌入的特殊情况。网络输出每像素检测分数的热图和每像素身份标签的热图。然后从这两个热图解码检测和组。

To train a network to predict the tags, we use a loss function that encourages pairs of tags to have similar values if the corresponding detections belong to the same group in the ground truth or dissimilar values otherwise. It is important to note that we have no “ground truth” tags for the network to predict, because what matters is not the particular tag values, only the differences between them. The network has the freedom to decide on the tag values as long as they agree with the ground truth grouping.

为了训练网络来预测标签，我们使用一种损失函数，如果相应的检测属于背景实况中的相同组或不相同的值，则鼓励成对的标签具有相似的值。值得注意的是，我们没有网络预测的“基本事实”标签，因为重要的不是特定的标签值，只是它们之间的差异。只要他们同意背景真实分组，网络就可以自由决定标签值。

We apply our approach to multiperson pose estimation, an important task for understanding humans in images. Concretely, given an input image, multi-person pose estimation seeks to detect each person and localize their body joints. Unlike single-person pose there are no prior assumptions of a person’s location or size. Multi-person pose systems must scan the whole image detecting all people and their corresponding keypoints. For this task, we integrate associative embedding with a stacked hourglass network [31], which produces a detection heatmap and a tagging heatmap for each body joint, and then groups body joints with similar tags into individual people. Experiments demonstrate that our approach outperforms all recent methods and achieves state of the art results on MS-COCO [27] and MPII Multiperson Pose.

我们将我们的方法应用于多人姿势估计，这是了解图像中人类的重要任务。具体地，给定输入图像，多人姿势估计寻求检测每个人并定位他们的身体关节。与单人姿势不同，先前没有人的位置或大小的假设。多人姿势系统必须扫描整个图像，检测所有人及其相应的关键点。为此，我们将关联嵌入与堆叠沙漏网络(stacked hourglass network)[31]集成在一起，为每个身体关节生成检测热图和标记热图，然后将具有相似标签的身体关节分组到个人身上。实验表明，我们的方法优于所有最近的方法，并在MS-COCO [27]和MPII Multiperson上实现了最先进的结果.

We further demonstrate the utility of our method by applying it to instance segmentation. Showing that it is straightforward to apply associative embedding to a variety of vision tasks that fit under the umbrella of detection and grouping.

我们通过将其应用于实例分割进一步证明了我们方法的实用性。表明将关联嵌入应用于适合检测和分组的各种视觉任务是很简单的。

Our contributions are two fold: (1) we introduce associative embedding, a new method for single- stage, end-to-end joint detection and grouping. This method is simple and generic(通用的); it works with any network architecture that produces pixel-wise prediction; (2) we apply associative embedding to multiperson pose estimation and achieve state of the art results on two standard benchmarks.

我们的贡献有两个方面：（1）我们引入了关联嵌入(associative embedding)，一种用于单级，端到端联合检测和分组的新方法。这种方法简单通用;它适用于任何产生像素预测的网络架构; （2）我们将关联嵌入应用于多人姿势估计，并在两个标准基准上实现最先进的结果。

2.Related Work

Vector Embeddings Our method is related to many prior works that use vector embeddings. Works in image retrieval have used vector embeddings to measure similarity between images [17,53]. Works in image classification, image captioning, and phrase localization have used vector embeddings to connect visual features and text features by mapping them to the same vector space [16,20,30]. Works in natural lan-

guage processing have used vector embeddings to represent the meaning of words, sentences, and paragraphs [39,32].Our work differs from these prior works in that we use vector embeddings as identity tags in the context of joint detection and grouping.

矢量嵌入:我们的方法与许多使用矢量嵌入的先前工作有关。图像检索中的工作使用矢量嵌入来测量图像之间的相似性[17,53]。图像分类，图像字幕和短语定位中的工作使用矢量嵌入通过将视觉特征和文本特征映射到相同的矢量空间来连接它们[16,20,30]。自然语言处理中的工作使用向量嵌入来表示单词，句子和段落的含义[39,32]。我们的工作与以前的工作不同之处在于我们在联合检测的上下文中使用向量嵌入作为身份标记和分组。

Perceptual Organization:Work in perceptual organization aims to group the pixels of an image into regions, parts, and objects. Perceptual organization encompasses a wide range of tasks of varying complexity from figure-ground segmentation [37] to hierarchical image parsing [21]. Prior works typically use a two stage pipeline [38], detecting basic visual

units (patches, superpixels, parts, etc.) first and grouping them second. Common grouping approaches include spectral clustering [51,46], conditional random fields (e.g. [31]),and generative probabilistic models (e.g. [21]). These grouping approaches all assume predetected basic visual units and precomputed affinity measures between them but differ among themselves in the process of converting affinity measures into groups. In contrast, our approach performs detection and grouping in one stage using a generic network that includes no special design for grouping.

感知组织:感知组织中的工作旨在将图像的像素分组为区域，部分和对象。感知组织包含各种复杂的任务，从图形 - 背景分割[37]到分层图像解析[21]。先前的工作通常使用两阶段管道[38]，首先检测基本的视觉单元（补丁，超像素，部件等），然后将它们分组。常见的分组方法包括谱聚类[51,46]，条件随机场（例如[31]）和生成概率模型（例如[21]）。这些分组方法都假设它们之间有预先确定的基本视觉单元和预先计算的亲和力度量，但在将亲和力度量转换为组的过程中它们之间各不相同。相比之下，我们的方法使用通用网络在一个阶段中执行检测和分组，该网络不包括用于分组的特殊设计。

It is worth noting a close connection between our approach to those using spectral clustering. Spectral clustering(e.g. normalized cuts [46]) techniques takes as input precomputed affinities (such as predicted by a deep network) between visual units and solves a generalized eigenproblem(特征值问题) to produce embeddings (one per visual unit) that are similar for visual units with high affinity. Angular Embedding [37, 47] extends spectral clustering by embedding depth ordering as well as grouping. Our approach differs from spectral clustering in that we have no intermediate representation of affinities nor do we solve any eigenproblems. Instead our network directly outputs the final embeddings.

值得注意的是我们使用光谱聚类的方法之间存在密切联系。谱聚类（例如归一化切割[46]）技术将视觉单元之间的预先计算的亲和度（例如由深度网络预测）作为输入，并且解决广义特征问题（特征值问题）以产生类似的嵌入（每个视觉单元一个）用于具有高亲和力的视觉单元。 Angular Embedding(角度嵌入) [37,47]通过嵌入深度排序和分组来扩展谱聚类。我们的方法与光谱聚类的不同之处在于我们没有亲和力的中间表示，也没有解决任何特征问题。相反，我们的网络直接输出最终的嵌入。

Our approach is also related to the work by Harley et al. on learning dense convolutional embeddings [24], which trains a deep network to produce pixel-wise embeddings for the task of semantic(语义) segmentation(分段). Our work differs from theirs in that our network produces not only pixel-wise embeddings but also pixel-wise detection scores. Our novelty lies in the integration of detection and grouping into a single network; to the best of our knowledge such an integration has not been attempted for multiperson human pose estimation.

我们的方法也与Harley等人的工作有关。学习密集卷积嵌入(dense convolutional embedding)[24]，它训练深度网络，为语义（语义）分割（分段）任务产生逐像素嵌入。我们的工作与他们的不同之处在于，我们的网络不仅产生像素嵌入，还产生像素检测分数。我们的新颖之处在于将检测和分组集成到单个网络中;据我们所知，这种整合尚未尝试用于多人人体姿势估计。

Multiperson Pose Estimation

Recent methods have made great progress improving human pose estimation in images in particular for single person pose estimation [50, 48, 52,40, 8, 5, 41, 4, 14, 19, 34, 26, 7, 49, 44]. For multiperson pose, prior and concurrent work can be categorized as either top-down or bottom-up. Top-down approaches [42, 25, 15] first detect individual people and then estimate each person’s pose. Bottom-up approaches [45, 28, 29, 6] instead detect individual body joints and then group them into individuals. Our approach more closely resembles bottom-up approaches but differs in that there is no separation of a detection and grouping stage. The entire prediction is done at once by a single-stage, generic network. This does away with the need for complicated post-processing steps required by other methods [6, 28].

多人姿态评估:

最近的方法在改善图像中的人体姿势估计方面取得了很大进展，特别是对于单人姿势估计[50,48,52,40,8,5,41,4,14,19,34,26,7,49,44] 。对于多人姿势，先前和并发工作可以分为自上而下或自下而上。自上而下的方法[42,25,15]首先检测个体，然后估计每个人的姿势。自下而上的方法[45,28,29,6]反而检测个体关节，然后将它们分组为个体。我们的方法更接近于自下而上的方法，但不同之处在于检测和分组阶段没有分离。整个预测由单级通用网络立即完成。这消除了对其他方法所需的复杂后处理步骤的需要[6,28]。

Instance Segmentation

Most existing instance segmentation approaches employ a multi-stage pipeline to do detection followed by segmentation [23,18,22,11]. Dai etal. [12] made such a pipeline differentiable （可微分）through a special layer that allows backpropagation through spatial（空间的） coordinates.

实例分段：

大多数现有的实例分割方法采用多级流水线进行检测，然后进行分割[23,18,22,11]。戴等人。 [12]通过允许通过空间（空间）坐标进行反向传播的特殊层使这样的管道可微分。

Two recent works have sought tighter integration of detection and segmentation using fully convolutional networks.DeepMask [43] densely scans subwindows and outputs a detection score and a segmentation mask (reshaped to a vector) for each subwindow. Instance-Sensitive FCN [10] treats each object as composed of a set of object parts in a regular grid, and outputs a per-piexl heatmap of detection scores for each object part. Instance-Sensitive FCN (IS-FCN) then detects object instances where the part detection scores are spaitally coherent, and assembles object masks from the heatmaps of object parts. Compared to DeepMask and ISFCN, our approach is substantially simpler: for each object category we output only two values at each pixel location, a score representing foreground versus background, and a tag representing the identity of an object instance, whereas both DeepMask and IS-FCN produce much higher dimensional output.

最近的两项研究使用完全卷积网络寻求更紧密的检测和分割集成.DeepMask [43]密集扫描子窗口，并为每个子窗口输出检测分数和分割掩模（重新形成矢量）。实例敏感FCN [10]将每个对象视为由常规网格中的一组对象部分组成，并输出每个对象部分的检测分数的每个piexl热图。然后，实例敏感FCN（IS-FCN）检测零件检测得分在空间上是一致的对象实例，并从对象零件的热图组装对象蒙版。与DeepMask和ISFCN相比，我们的方法非常简单：对于每个对象类别，我们在每个像素位置仅输出两个值，一个表示前景与背景的分数，以及一个表示对象实例的标识的标记，而DeepMask和IS-都是FCN产生更高的尺寸输出。

3. Approach

3.方法

3.1 Overview

To introduce associative embedding for joint detection and grouping, we first review the basic formulation of visual detection. Many visual tasks involve detection of a set of visual units. These tasks are typically formulated as scoring of a large set of candidates. For example, single-person human pose estimation can be formulated as scoring candidate body joint detections at all possible pixel locations. Object detection can be formulated as scoring candidate bounding boxes at various pixel locations, scales, and aspect ratios.

3.1 综述

为了介绍联合检测和分组的关联嵌入，我们首先回顾了视觉检测的基本方法。许多视觉任务涉及检测一组视觉单元。这些任务通常被制定为大量候选人的得分。例如，可以将单人人体姿势估计公式化为在所有可能的像素位置处的评分候选人体关节检测。可以将对象检测公式化为各种像素位置，比例和纵横比的影响比例的边界框。

The idea of associative embedding is to predict an embedding for each candidate in addition to the detection score.The embeddings serve as tags that encode grouping: detections with similar tags should be grouped together. In multiperson pose estimation, body joints with similar tags should be grouped to form a single person. It is important to note that the absolute values of the tags do not matter, only the distances between tags. That is, a network is free to assign arbitrary values to the tags as long as the values are the same for detections belonging to the same group.

关联嵌入的想法是除了检测得分之外还预测每个候选的嵌入。嵌入用作编码分组的标签：具有相似标签的检测应该被分组在一起。在多人姿势估计中，具有相似标签的身体关节应该被分组以形成单个人。重要的是要注意标签的绝对值是无关紧要的，只有标签之间的差异。也就是说，只要属于同一组的检测值相同，网络就可以自由地为标签分配任意值。

Note that the dimension of the embeddings is not critical. If a network can successfully predict high-dimensional embeddings to separate the detections into groups, it should also be able to learn to project those high-dimensional embeddings to lower dimensions, as long as there is enough network capacity. In practice we have found that 1D embedding is sufficient for multiperson pose estimation, and higher dimensions do not lead to significant improvement. Thus throughout this paper we assume 1D embeddings.

请注意，嵌入的维度并不重要。如果网络可以成功地预测高维嵌入以将检测分成组，那么只要有足够的网络容量，它也应该能够学习将这些高维嵌入投影到较低维度。在实践中，我们发现一维嵌入对于多人姿势估计是足够的，并且更高的维度不会导致显着的改进。因此，在本文中，我们假设一维嵌入。

To train a network to predict the tags, we enforce a loss that encourages similar tags for detections from the same group and different tags for detections across different groups. Specifically(具体地), this tagging loss is enforced on candidate detections that coincide with the ground truth. We compare pairs of detections and define a penalty based on the relative values of the tags and whether the detections should be from the same group.

为了训练网络来预测标签，我们强制执行损失，鼓励来自同一组的检测使用类似标签，并针对不同组检测不同标签。具体而言，这种标记丢失是在与基本事实一致的候选检测上强制执行的。我们比较检测对并根据标签的相对值以及检测是否应来自同一组来定义惩罚。

3.2. Stacked Hourglass Architecture

In this work we combine associative embedding with the stacked hourglass architecture [40], a model for dense pixelwise prediction that consists of a sequence of modules each shaped like an hourglass (Fig. 2). Each “hourglass” has a standard set of convolutional and pooling layers that process features down to a low resolution capturing the full context of the image. Then, these features are upsampled and gradually combined with outputs from higher and higher resolutions until reaching the final output resolution. Stacking multiple hourglasses enables repeated bottom-up and top-down inference to produce a more accurate final prediction.We refer the reader to [40] for more details of the network architecture.

3.2堆积沙漏网络架构

在这项工作中, 我们结合了关联嵌入与堆叠沙漏架构 [40], 密集像素预测的模型, 由一系列的模块组成, 每个模块的形状都像沙漏 (图 2)。每个 "沙漏" 都有一组标准的卷积和池化层, 这些层处理要素的分辨率较低, 可捕获图像的完整上下文。然后, 对这些特征进行向上采样, 并逐步与来自更高分辨率和更高分辨率的输出组合, 直到达到最终输出分辨率。堆叠多个沙漏可以重复自下而上和自上而下的推断, 从而产生更准确的最终预测。有关网络体系结构的更多详细信息, 我们请读者参考 [40]。

The stacked hourglass model was originally developed for single-person human pose estimation. The model outputs a heatmap for each body joint of a target person. Then, the pixel with the highest heatmap activation is used as the predicted location for that joint. The network is designed to consolidate global and local features which serves to capture information about the full structure of the body while preserving fine details for precise localization. This balance between global and local features is just as important in other pixel-wise prediction tasks, and we therefore apply the same network towards both multiperson pose estimation and instance segmentation.

堆叠沙漏模型最初是为单人人体姿势估计而开发的。该模型为目标人物的每个身体关节输出热图。然后，具有最高热图激活的像素被用作该关节的预测位置。该网络旨在整合全球和本地功能，用于捕获有关身体完整结构的信息，同时保留精细细节以实现精确定位。全局和局部特征之间的这种平衡在其他像素预测任务中同样重要，因此我们将相同的网络应用于多人姿势估计和实例分割。

We make some slight modifications to the network architecture. We increase the number of ouput features at each drop in resolution (256 -> 386 -> 512 -> 768). In addition, individual layers are composed of 3x3 convolutions instead of residual modules, the shortcut effect to ease training is still present from the residual links across each hourglass as well as the skip connections at each resolution.

我们对网络架构进行了一些细微的修改。我们在每次分辨率下增加输出功能的数量（256 - > 386 - > 512 - > 768）。此外，单个层由3x3卷积而不是残余模块组成，从每个沙漏的剩余链接以及每个分辨率的跳过连接仍然存在简化训练的快捷效果。

Figure 3. An overview of our approach for producing multi-person pose estimates. For each joint of the body, the network simultaneously produces detection heatmaps and predicts associative embedding tags. We take the top detections for each joint and match them to other detections that share the same embedding tag to produce a final set of individual pose predictions.

图3.我们制作多人姿势估计方法的概述。对于身体的每个关节，网络同时产生检测热图并预测关联嵌入标签。我们对每个关节进行顶部检测，并将它们与共享相同嵌入标记的其他检测进行匹配，以生成最终的一组个体姿势预测。

3.3. Multiperson Pose Estimation

To apply associative embedding to multiperson pose estimation, we train the network to detect joints as performed in single-person pose estimation [40]. We use the stacked hourglass model to predict a detection score at each pixel location for each body joint (“left wrist”, “right shoulder”, etc.) regardless of person identity. The difference from single-person pose being that an ideal heatmap for multiple people should have multiple peaks (e.g. to identify multiple left wrists belonging to different people), as opposed to just a single peak for a single target person.

3.3 多人姿态评估

为了将关联嵌入应用于多人姿势估计，我们训练网络以检测在单人姿势估计中执行的关节[40]。我们使用堆叠沙漏模型来预测每个身体关节（“左手腕”，“右肩”等）的每个像素位置的检测分数，而不管人的身份。与单人姿势的区别在于多人的理想热图应该具有多个峰值（例如，识别属于不同人的多个左手腕），而不是单个目标人的单个峰值。

In addition to producing the full set of keypoint detections, the network automatically groups detections into individual poses. To do this, the network produces a tag at each pixel location for each joint. In other words, each joint heatmap has a corresponding “tag” heatmap. So, if there are m body joints to predict then the network will output a total of 2m channels, m for detection and m for grouping. To parse detections into individual people, we use non-maximum suppression to get the peak detections for each joint and retrieve their corresponding tags at the same pixel location (illustrated in Fig. 3). We then group detections across body parts by comparing the tag values of detections and matching up those that are close enough. A group of detections now forms the pose estimate for a single person.

除了生成全套关键点检测之外，网络还会自动将检测分组为单个姿势。为此，网络在每个关节的每个像素位置处生成标签。换句话说，每个联合热图具有相应的“标签”热图。因此，如果有m个身体关节进行预测，那么网络将输出总共2m个通道，m个用于检测，m个用于分组。为了将检测解析为个体，我们使用非最大抑制来获得每个关节的峰值检测，并在相同的像素位置检索它们的相应标签（如图3所示）。然后，我们通过比较检测的标记值并匹配足够接近的标记值来对身体部位进行检测。现在，一组检测形成一个人的姿势估计。

To train the network, we impose a detection loss and a grouping loss on the output heatmaps. The detection loss computes mean square error between each predicted detection heatmap and its “ground truth” heatmap which consists of a 2D gaussian activation at each keypoint location. This loss is the same as the one used by Newell et al. [40].

为了训练网络，我们在输出热图上施加检测损失和分组损失。检测损失计算每个预测检测热图与其“地面实况”热图之间的均方误差，该热图由在每个关键点位置处的2D高斯激活组成。这种损失与Newell等人使用的损失相同。

The grouping loss assesses how well the predicted tags agree with the ground truth grouping. Specifically, we retrieve the predicted tags for all body joints of all people at their ground truth locations; we then compare the tags within each person and across people. Tags within a person should be the same, while tags across people should be different.

分组损失评估预测标签与地面真实分组的一致程度。具体来说，我们检索所有人在其地面真实位置的所有身体关节的预测标签;然后，我们比较每个人和人之间的标签。一个人内的标签应该是相同的，而人们之间的标签应该是不同的。

{在这里做一个特殊说明:在统计学和机器学习中ground truth 表示有监督学习的训练集和分类准确性,用于证明或者推翻某个假设.有监督的机器学习会对训练数据打标记,试想一下,如果训练标记错误,那么将会对测试数据的预测产生影响,因此这里将那些正确打标记的数据称为ground truth}

Rather than enforce the loss across all possible pairs of keypoints, we produce a reference embedding for each person. This is done by taking the mean of the output embeddings of the person’s joints. Within an individual, we compute the squared distance between the reference embedding and the predicted embedding for each joint. Then, between pairs of people, we compare their reference embeddings to each other with a penalty that drops exponentially to zero as the distance between the two tags increases.

我们不是在所有可能的关键点对之间强制执行损失，而是为每个人生成参考嵌入。这是通过获取人的关节的输出嵌入的平均值来完成的。在个体内，我们计算参考嵌入与每个关节的预测嵌入之间的平方距离。然后，在成对的人之间，我们将他们的参考嵌入彼此进行比较，并且当两个标签之间的距离增加时，惩罚随着指数下降到零。

Formally, let hk ∈ R W ×H be the predicted tagging heatmap for the k-th body joint, where h(x) is a tag value at pixel location x. Given N people, let the ground truth body joint locations be T = {(xnk )}, n = 1, . . . , N, k = 1 . . . , K, where xnk is the ground truth pixel location of the k-th body joint of the n-th person.

形式上，让hk∈RW×H是第k个人体关节的预测标记热图，其中h（x）是像素位置x处的标记值。给定N个人，让地面真实身体关节位置为T = {（xnk）}，n = 1 ,….,N，k = 1,….,K，其中xnk是第n个人的第k个身体关节的基础真实像素位置。

Assuming all K joints are annotated, the reference embedding for the nth person would be

假设所有K个关节都被注释，则第n个人的参考嵌入将是:

The grouping loss Lg is then defined as

然后将分组损失Lg定义为

To produce a final set of predictions we iterate through each joint one by one. An ordering is determined by first considering joints around the head and torso(躯干) and gradually moving out to the limbs(四肢). We start with our first joint and take all activations above a certain threshold after non-maximum suppression. These form the basis for our initial pool of detected people.
为了产生最终的预测集，我们逐个遍历每个关节。通过首先考虑头部和躯干周围的关节（躯干）并逐渐移动到四肢（四肢）来确定排序。我们从第一个关节开始，在非最大抑制之后将所有激活超过某个阈值。这些构成了我们最初检测到的人群的基础。

We then consider the detections of a subsequent joint. We compare the tags from this joint to the tags of our current pool of people, and try to determine the best matching between them. Two tags can only be matched if they fall within a specific threshold(设定的阈值内). In addition, we want to prioritize matching of high confidence detections. We thus perform a maximum matching where the weighting is determined by both the tag distance and the detection score. If any new detection is not matched, it is used to start a new person instance. This accounts for cases where perhaps only a leg or hand is visible for a particular person.

然后我们考虑后续关节的检测。我们将此关节中的标记与当前人群的标记进行比较，并尝试确定它们之间的最佳匹配。只有两个标签落在特定阈值范围内时才能匹配。此外，我们希望优先考虑高置信度检测的匹配。因此，我们执行最大匹配，其中权重由标签距离和检测分数确定。如果任何新检测不匹配，则用于启动新的人员实例。这解释了对于特定人可能只有腿或手可见的情况。

We loop through each joint of the body until every detection has been assigned to a person. No steps are taken to ensure anatomical correctness or reasonable spatial relation-

ships between pairs of joints. To give an impression of the types of tags produced by the network and the trivial nature of grouping we refer to Figure 4.

我们遍历身体的每个关节，直到每个检测分配给一个人。没有采取任何步骤来确保关节对之间的解剖学正确性或合理的空间关系。为了给出网络产生的标签类型和分组的微不足道的印象，我们参考图4。

Figure 4. Tags produced by our network on a held-out validation image from the MS-COCO training set. The tag values are already well separated and decoding the groups is straightforward.

图4.我们的网络在MS-COCO训练集的保持验证图像上生成的标签。标签值已经很好地分离，并且解码组很简单。

We then consider the detections of a subsequent joint.We compare the tags from this joint to the tags of our current pool of people, and try to determine the best matching between them. Two tags can only be matched if they fall within a specific threshold. In addition, we want to prioritize matching of high confidence detections. We thus perform a maximum matching where the weighting is determined by both the tag distance and the detection score. If any new detection is not matched, it is used to start a new person instance. This accounts for cases where perhaps only a leg or hand is visible for a particular person.

然后我们考虑后续关节的检测。我们将这个关节中的标签与我们当前人群的标签进行比较，并尝试确定它们之间的最佳匹配。只有两个标签落在特定阈值范围内时才能匹配。此外，我们希望优先考虑高置信度检测的匹配。因此，我们执行最大匹配，其中权重由标签距离和检测分数确定。如果任何新检测不匹配，则用于启动新的人员实例。这解释了对于特定人可能只有腿或手可见的情况。

We loop through each joint of the body until every detection has been assigned to a person. No steps are taken to ensure anatomical correctness or reasonable spatial relationships between pairs of joints. To give an impression of the types of tags produced by the network and the trivial nature of grouping we refer to Figure 4.

我们遍历身体的每个关节，直到每个检测分配给一个人。没有采取措施来确保关节对之间的解剖学正确性或合理的空间关系。为了给出网络产生的标签类型和分组的微不足道的印象，我们参考图4。

While it is feasible to train a network to make pose predictions for people of all scales, there are some drawbacks. Extra capacity is required of the network to learn the necessary scale invariance(比例不变形) , and the precision of predictions for small people will suffer due to issues of low resolution after pooling. To account for this, we evaluate images at test time at multiple scales. There are a number of potential ways to use the output from each scale to produce a final set of pose predictions. For our purposes, we take the produced heatmaps and average them together. Then, to combine tags across scales, we concatenate(系列,连环) the set of tags at a pixel location into a vector v ∈ R m (assuming m scales). The decoding process does not change from the method described with scalar tag values, we now just compare vector distances.

虽然训练网络为各种规模的人做出姿势预测是可行的，但也存在一些缺点。网络需要额外的容量才能学习必要的比例不变性，并且由于池化后的低分辨率问题，小人的预测精度将受到影响。为了解释这一点，我们在多个尺度的测试时评估图像。有许多潜在的方法可以使用每个比例的输出来产生最终的姿势预测集。出于我们的目的，我们采用生成的热图并将它们平均在一起。然后，为了跨标尺组合标签，我们将像素位置处的标签集合连接到矢量v∈Rm（假设m个标度）。解码过程不会改变标量标签值描述的方法，我们现在只是比较矢量距离。

Figure 5. To produce instance segmentations we decode the network output as follows: First we threshold on the detection heatmap, the resulting binary mask is used to get a set of tag values. By looking at the distribution of tags we can determine identifier tags for each instance and match the tag of each activated pixel to the closest identifier.

图5.为了生成实例分段，我们按如下方式对网络输出进行解码：首先，我们对检测热图进行阈值处理，生成的二进制掩码用于获取一组标记值。通过查看标签的分布，我们可以确定每个实例的标识符标签，并将每个激活的像素的标签与最接近的标识符相匹配。

3.4. Instance Segmentation

The goal of instance segmentation is to detect and classify object instances while providing a segmentation mask for each object. As a proof of concept we show how to apply our approach to this problem, and demonstrate preliminary results. Like multi-person pose estimation, instance segmentation is a problem of joint detection and grouping. Pixels belonging to an object class are detected, and then those associated with a single object are grouped together. For simplicity the following description of our approach assumes only one object category.

实例分割的目标是检测和分类对象实例，同时为每个对象提供分段掩码。作为概念证明，我们将展示如何将我们的方法应用于此问题，并展示初步结果。与多人姿势估计一样，实例分割是联合检测和分组的问题。检测属于对象类的像素，然后将与单个对象相关联的像素分组在一起。为简单起见，我们的方法的以下描述仅假设一个对象类别。

Given an input image, we use a stacked hourglass network to produce two heatmaps, one for detection and one for tagging. The detection heatmap gives a detection score at each pixel indicating whether the pixel belongs to any instance of the object category, that is, the detection heatmap segments the foreground from background. At the same time, the tagging heatmap tags each pixel such that pixels belonging to the same object instance have similar tags.

给定输入图像，我们使用堆叠沙漏网络生成两个热图，一个用于检测，一个用于标记。检测热图在每个像素处给出指示像素是否属于对象类别的任何实例的检测得分，即，检测热图将前景从背景分割。同时，标记热图标记每个像素，使得属于同一对象实例的像素具有相似的标记。

To train the network, we supervise the detection heatmap by comparing the predicted heatmap with the ground truth heatmap (the union of all instance masks). The loss is the mean squared error between the two heatmaps. We supervise the tagging heatmap by imposing a loss that encourages the tags to be similar within an object instance and different across instances. The formulation of the loss is similar to that for multiperson pose. There is no need to do a comparison of every pixel in an instance segmentation mask. Instead we randomly sample a small set of pixels from each object instance and do pairwise comparisons across the group of sampled pixels.

为了训练网络，我们通过将预测的热图与地面实况热图（所有实例掩模的并集）进行比较来监督检测热图。损失是两个热图之间的均方误差。我们通过强制丢失来监督标记热图，这种损失鼓励标记在对象实例中是相似的，并且跨实例不同。损失的表述类似于多人姿势的表述。无需对实例分割掩码中的每个像素进行比较。相反，我们从每个对象实例中随机采样一小组像素，并对采样像素组进行成对比较。

Formally, let h ∈ R W ×H be a predicted W × H tagging heatmap. Let x denote a pixel location and h(x) the tag at the location, and let Sn = xkn , k = 1, . . . , K be a set of locations randomly sampled within the n-th object instance. The grouping loss Lg is defined as

形式上，令h∈RW×H是预测的W×H标记热图。设x表示像素位置，h(x)表示该位置的标签，并且让Sn = xkn，k = 1，... ，K是在第n个对象实例内随机采样的一组位置。分组损失Lg定义为:

To decode the output of the network, we first threshold on the detection channel heatmap to produce a binary mask. Then, we look at the distribution of tags within this mask. We calculate a histogram(直方图) of the tags and perform non-maximum suppression(抑制) to determine a set of values to use as identifiers for each object instance. Each pixel from the detection mask is then assigned to the object with the closest tag value. See Figure 5 for an illustration of this process.

为了解码网络的输出，我们首先在检测通道热图上设置阈值以产生二进制掩码。然后，我们查看此掩码中标记的分布。我们计算标签的直方图并执行非最大抑制，以确定一组值，以用作每个对象实例的标识符。然后将来自检测掩模的每个像素分配给具有最接近标签值的对象。有关此过程的说明，请参见图5。

Note that it is straightforward to generalize from one object category to multiple: we simply output a detection heatmap and a tagging heatmap for each object category. As with multi-person pose, the issue of scale invariance is worth consideration. Rather than train a network to recognize the appearance of an object instance at every possible scale, we evaluate at multiple scales and combine predictions in a similar manner to that done for pose estimation.

请注意，可以直接从一个对象类别推广到多个对象：我们只需为每个对象类别输出检测热图和标记热图。与多人姿势一样，比例不变性问题值得考虑。我们不是训练网络以每个可能的尺度识别对象实例的外观，而是在多个尺度上进行评估，并以与姿势估计相似的方式组合预测。

Figure 6. Qualitative pose estimation results on MSCOCO validation images

图6. MSCOCO验证图像的定性姿态估计结果

4. Experiments

4.1 Multiperson Pose Estimation

Dataset We evaluate on two datasets: MS-COCO [35] and MPII Human Pose [3]. MPII Human Pose consists of about 25k images and contains around 40k total annotated people(three-quarters of which are available for training). Evaluation is performed on MPII Multi-Person, a set of 1758 groups of multiple people taken from the test set as outlined

in [45]. The groups for MPII Multi-Person are usually a subset of the total people in a particular image, so some information is provided to make sure predictions are made

on the correct targets. This includes a general bounding box and scale term used to indicate the occupied region. No information is provided on the number of people or the scales of individual figures. We use the evaluation metric outlined by Pishchulin et al. [45] calculating average precision of joint detections.

数据集我们评估了两个数据集：MS-COCO [35]和MPII Human Pose [3]。 MPII Human Pose由大约25,000个图像组成，包含大约40,000个注释人（其中四分之三可用于训练）。评估是在MPII Multi-Person上进行的，一组1758组多人从测试集中取出[45]。 MPII多人组通常是特定图像中总人数的子集，因此提供一些信息以确保预测是在正确的目标上进行的。这包括用于指示占用区域的一般边界框和比例项。没有提供有关人数或个人数字的信息的信息。我们使用Pishchulin等人概述的评估指标。 [45]计算关节检测的平均精度。

Figure 7. Here we visualize the associative embedding channels for different joints. The change in embedding predictions across joints is particularly apparent in these examples where there is significant overlap of the two target figures.

图7.在这里，我们可视化不同关节的关联嵌入通道。在两个目标图中存在显着重叠的这些示例中，跨关节的嵌入预测的变化尤其明显。

MS-COCO [35] consists of around 60K training images with more than 100K people with annotated keypoints. We report performance on two test sets, a development test set

(test-dev) and a standard test set (test-std). We use the official evaluation metric that reports average precision (AP) and average recall (AR) in a manner similar to object detection

except that a score based on keypoint distance is used instead of bounding box overlap. We refer the reader to the MS- COCO website for details [1].

MS-COCO [35]由大约60K的训练图像组成，超过10万人具有注释关键点。我们报告了两个测试集的性能，一个开发测试集（test-dev）和一个标准测试集（test-std）。我们使用官方评估指标，以类似于对象检测的方式报告平均精度（AP）和平均召回（AR），除了使用基于关键点距离的分数而不是边界框重叠。我们将读者推荐到MS-COCO网站了解详情[1]。

Implementation The network used for this task consists of four stacked hourglass modules, with an input size of 512×512 and an output resolution of 128×128. We train the network using a batch size of 32 with a learning rate of 2e-4(dropped to 1e-5 after 100k iterations) using Tensorflow [2]. The associative embedding loss is weighted by a factor of 1e-3 relative to the MSE loss of the detection heatmaps. The loss is masked to ignore crowds with sparse annotations. At test time an input image is run at multiple scales; the output detection heatmaps are averaged across scales, and the tags across scales are concatenated into higher dimensional tags. Since the metrics of MPII and MS-COCO are both sensitive to the precise localization of keypoints, following prior work [6], we apply a single-person pose model [40] trained on the same dataset to further refine predictions.

实现用于此任务的网络由四个堆叠沙漏模块组成，输入大小为512×512，输出分辨率为128×128。我们使用Tensorflow [2]使用32的批量训练网络，学习率为2e-4（在100k迭代后降至1e-5）。相关嵌入损耗相对于检测热图的MSE损失加权1e-3倍。掩盖损失以忽略具有稀疏注释的人群。在测试时，输入图像以多个比例运行;输出检测热图按比例平均，并且跨比例的标签被连接成更高维度的标签。由于MPII和MS-COCO的指标对关键点的精确定位都很敏感，因此在之前的工作[6]之后，我们应用在同一数据集上训练的单人姿势模型[40]来进一步完善预测。

MPII Results Average precision results can be seen in Table 1 demonstrating an improvement over state-of-the-art methods in overall AP. Associative embedding proves to be an effective method for teaching the network to group keypoint detections into individual people. It requires no assumptions about the number of people present in the image, and also offers a mechanism for the network to express confusion of joint assignments. For example, if the same joint of two people overlaps at the exact same pixel location, the predicted associative embedding will be a tag somewhere between the respective tags of each person.

MPII结果平均精确度结果可以在表1中看到，表明对整个AP中的最新方法的改进。关联嵌入被证明是教导网络将关键点检测分组到个人中的有效方法。它不需要对图像中存在的人数进行假设，也提供了一种网络表达联合任务混淆的机制。例如，如果两个人的相同关节在完全相同的像素位置处重叠，则预测的关联嵌入将是每个人的相应标签之间某处的标签。

We can get a better sense of the associative embedding output with visualizations of the embedding heatmap (Figure 7). We put particular focus on the difference in the predicted embeddings when people overlap heavily as the severe occlusion and close spacing of detected joints make it much more difficult to parse out the poses of individual people.

我们可以通过嵌入热图的可视化更好地了解关联嵌入输出（图7）。当人们严重遮挡时，我们特别关注预测嵌入的差异，因为严重的遮挡和检测到的关节的紧密间距使得解析个体姿势变得更加困难。

MS-COCO Results Table 2 and Table 3 report our results on MS-COCO. We report results on both test-std and test-dev because not all recent methods report on test-std. We see that on both sets we achieve the state of the art performance. An illustration of the network’s predictions can be seen in Figure 6. Typical failure cases of the net- work stem from overlapping and occluded joints in cluttered scenes. Table 4 reports performance of ablated versions of our full pipeline, showing the contributions from applying our model at multiple scales and from further refinement using a single-person pose estimator. We see that simply

applying our network at multiple scales already achieves competitive performance against prior state of the art methods, demonstrating the effectiveness of our end-to-end joint detection and grouping.

MS-COCO结果表2和表3报告了我们在MS-COCO上的结果。我们在test-std和test-dev上报告结果，因为并非所有最近的方法都报告了test-std。我们看到，在两套设备上，我们都达到了最先进的性能。网络预测的图示可以在图6中看到。网络的典型故障情况源于杂乱场景中的重叠和闭塞关节。表4报告了我们完整管道的消融版本的性能，显示了在多个尺度上应用我们的模型以及使用单人姿势估计器进一步细化的贡献。我们看到，简单地在多个尺度上应用我们的网络已经达到了与先前技术方法相比的竞争性能，证明了我们的端到端联合检测和分组的有效性。

We also perform an additional experiment on MS-COCO to gauge the relative difficulty of detection versus grouping, that is, which part is the main bottleneck of our system. We evaluate our system on a held-out set of 500 training images. In this evaluation, we replace the predicted detections with the ground truth detections but still use the predicted tags. Using the ground truth detections improves AP from 59.2 to 94.0. This shows that keypoint detection is the main bottleneck of our system, whereas the network has learned to produce high quality grouping. This fact is also supported by qualitative inspection of the predicted tag values, as shown in Figure 4, from which we can see that the tags are well separated and decoding the grouping is straightforward.

我们还对MS-COCO进行了额外的实验，以衡量检测与分组的相对难度，即哪个部分是我们系统的主要瓶颈。我们在一组500个训练图像上评估我们的系统。在此评估中，我们将预测的检测结果替换为地面实况检测，但仍使用预测的标签。使用地面实况检测可将AP从59.2提高到94.0。这表明关键点检测是我们系统的主要瓶颈，而网络已经学会了产生高质量的分组。如图4所示，对预测标签值的定性检查也支持这一事实，从中可以看出标签分离良好，解码分组很简单。

4.2. Instance Segmentation

Dataset For evaluation we use the val split of PASCAL VOC 2012 [13] consisting of 1,449 images. Additional pretraining is done with images from MS COCO [35]. Evaluation is done using mean average precision of instance segments at different IOU thresholds. [22, 10, 36].

数据集对于评估，我们使用PASCAL VOC 2012 [13]的val分割，包括1,449张图像。使用来自MS COCO [35]的图像进行额外的预训练。使用不同IOU阈值处的实例段的平均精度来完成评估。 [22,10,36]。

Implementation The network is trained in Torch [9] with an input resolution of 256 × 256 and output resolution of 64 × 64. The weighting of the associative embedding loss is lowered to 1e-4. During training, to account for scale, only objects that appear within a certain size range ar supervised, and a loss mask is used to ignore objects that are too big or

too small. In PASCAL VOC ignore regions are also defined at object boundaries, and we include these in the loss mask. Training is done from scratch on MS COCO for three days, and then fine tuned on PASCAL VOC train for 12 hours. At test time the image is evaluated at 3-scales (x0.5, x1.0, and x1.5). Rather than average heatmaps we generate instance proposals at each scale and do non-maximum suppression to remove overlapping proposals across scales. A more sophisticated approach for multi-scale evaluation is worth further exploration.

实现网络在Torch [9]中训练，输入分辨率为256×256，输出分辨率为64×64。关联嵌入损耗的权重降低到1e-4。在训练期间，为了考虑比例，只有在特定大小范围内出现的对象受到监督，并且丢失掩码用于忽略太大或太小的对象。在PASCAL VOC中，忽略区域也在对象边界处定义，并且我们将这些区域包含在损失掩码中。在MS COCO上从头开始培训三天，然后在PASCAL VOC列车上进行微调12小时。在测试时，图像以3级（x0.5，x1.0和x1.5）进行评估。我们不是通过平均热图来生成每个比例的实例提案，而是进行非最大限制抑制，以便跨比例删除重叠提案。更复杂的多尺度评估方法值得进一步探索。

Results We show mAP results on the val set of PASCAL VOC 2012 in Table 4.2 along with some qualitative examples in Figure 8. We offer these results as a proof of concept that

associative embeddings can be used in this manner. We achieve reasonable instance segmentation predictions using the supervision as we use for multi-person pose. Tuning of training and postprocessing will likely improve performance, but the main takeaway is that associative embedding serves well as a general technique for disparate computer vision tasks that fall under the umbrella of detection and grouping problems.

结果我们在表4.2中显示了PASCAL VOC 2012的val组的mAP结果以及图8中的一些定性示例。我们提供这些结果作为概念证明，可以以这种方式使用关联嵌入。我们使用监督来实现合理的实例分割预测，因为我们用于多人姿势。调整训练和后处理可能会提高性能，但主要的一点是，关联嵌入很适合作为不同计算机视觉任务的一般技术，这些任务属于检测和分组问题的范畴。

Figure 8. Example instance predictions produced by our system on the PASCAL VOC 2012 validation set.

5. Conclusion

In this work we introduce associative embeddings to supervise a convolutional neural network such that it can simultaneously generate and group detections. We apply this method to two vision problems: multi-person pose and instance segmentation. We demonstrate the feasibility of training for both tasks, and for pose we achieve state-of-the-art performance. Our method is general enough to be applied to other vision problems as well, for example multi-object tracking in video. The associative embedding loss can be implemented given any network that produces pixelwise predictions, so it can be easily integrated with other state-of-the-art architectures.

在这项工作中，我们引入了关联嵌入来监督卷积神经网络，以便它可以同时生成和分组检测。我们将此方法应用于两个视觉问题：多人姿势和实例分割。我们展示了对这两项任务进行培训的可行性，并为我们提供了最先进的性能。我们的方法足够通用，也可以应用于其他视觉问题，例如视频中的多目标跟踪。在任何产生像素预测的网络中都可以实现关联嵌入损耗，因此可以轻松地与其他最先进的架构集成。

References

见于原版论文

arXiv:1611.05424

Associative Embedding:End-to-End Learning for Joint Detection and Grouping论文翻译

猜你喜欢