文章目录

1. Motivation
2. Contribution
3. EmbedMask

1. Motivation

two-stage 通过ROIPool方法，会造成图像信息的丢失（低分辨率以及ROIPool/Align）的对齐操作。同时它的参数量较多比较复杂。
现有(2019)的one-stage实例分割方法精度比不上two-stage方法。one-stage中的segmentation-based方法的bottlenecks是如何制定clustering procedures，在制定cluster的数量以及cluster的中心位置上存在难度，这就造成了proposal-based方法的效果incomparable。
Mask RCNN复习知识

Mask RCNN中掩码分支输出[28, 28, C]，这是对于每一个proposal而言的，流程就是detect-first then segment，先根据回归和分类分支得到对应的bbox的坐标以及这个bbox的分类，然后对应到C中的某一个channel上。然后怎么运用mask的gt呢，也就是对应的pixel用1表示foreground，用0表示background，就是得到的bbox后，查看里面包含了多少gt的mask，包含的mask计算BCE loss。

在测试的时候，直接根据分类分支得到的类别C确定mask对应的某一个Channel上的mask预测值，然后设置一个阈值(论文中给出的是0.5），大于这个阈值的 $m a s k [x, y, C]$ 设置为1，否则为0，从而构建出mask。然后再根据回归的坐标得到bbox的坐标，最后得到候选框的分类、边界框以及内部的mask。

2. Contribution

作者将目前的实例分割的方法分为proposal-based方法以及segmentation-based方法。本文提出了将Embedding coupling的one-stage 实例分割模型EmbedMask，类似proposal-based方法，EmbedMask建立在检测模型的基础之上（FCOS），同时EmbedMask额外添加了Embedding 模块来产生pixel embeddings 以及 proposal embeddings，其中pixel embedings如果和proposal embeddings 属于同一个实例，那么pixel embeddings就被proposal embeddings指导，也就是pixels会被制定为相应的proposals的mask。

如图1所示，图 c代表的是pixel embedding，图b代表的是instance proposals （其中embedding proposals用不同的颜色表示）。

图1 proposal embedding(b)；pixel embedding(c)

作者总结的贡献有2点：

本文提出了EmbedMask，一个新的框架，结合了proposal-based方法和segmentation-based方法，通过加入了pixel-embedding以及proposal-embedding方法，根据pixel embedding的相似性，来制定

We propose a framework that unites the proposal-based and segmentation-based methods, by introducing the concepts of proposal embedding and pixel embedding so that pixels are assigned to instance proposals according to their embedding similarity.
作为一个one-stage 实例分割方法，EmbedMask可以达到与Mask R-CNN comparable的精度(37.7 vs. 38.1)，并且EmbedMask可以获得更高分辨率的mask以及更快的速度。

As a one-stage instance segmentation method, our method can achieve comparable scores as Mask R- CNN in the COCO benchmark, and meanwhile it provides masks with a higher quality than Mask R-CNN, running at a higher speed.

3. EmbedMask

3.1 Overview

图2 EmbedMask 网络

EmbedMask的网络结构，如图2所示，EmbedMask采用FCOS的架构，其中红色部分是FCOS原来的结构，蓝色部分是新加入的。EmbedMask加入了三个新的模块，分别是Proposal head中的Proposal Embedding以及Proposal Margin，以及Pixel Head中的Pixel Embedding。

其中Pixel Embedding特征图大小为 $\times W \times D$ ，设置为p，是通过最大的FPN层P3得到了，中间是5个3x3的conv。
Proposal bedding特征图大小为 $\times W \times D$ ，设置为q，是和center-ness以及box regression共享卷积权重的。（注意根据FOCS，每个FPN层的回归、分类是共享权重的，分为2类共享的卷积核）
Proposal Margin特征图大小为 $\times W \times 1$ ，设置为 $\sigma$ ，是在box regression分支后生成的。

对于proposal head，产生proposal features，对于location $x_j$ 的总的特征图 $proposal_j$ 可以表示为一个tuple $\{class_j, box_j, center_j, q_j, \sigma_j \}$ 。

对于每一个porposal和每一个pixel，需要计算proposal embedding 以及pixel embedding的distance，从而来描述这个pixel是否属于某个proposal的似然，同时proposal margin对于这种似然肯定了确切的boundary来得到最后的mask。

3.2 Embedding Definition

proposal embedding代表的是obejct-level的特征，而pixel embedding代表的是pixel-level的特征。

The proposal embedding represents the object-level context features for the object instance, which is a good representation of entire instance, while the pixel embedding represents the pixel-level context features for each location on the image, which learns the relation between each pixel with corresponding instance.

相比于之前的segmentation-based方法，EmbedMask可以避免去寻找cluster centers的中心以及数量。

Proposal embeddings are used as cluster centers of instances to do pixel clustering among the pixel embeddings, so that the difficulties appeared in segmentation-based methods, such as finding the locations and counts of cluster centers, are avoided.

在inference中，对于每一个pixel $x_i$ 的pixel embedding 用 $p_i$ 表示，经过NMS后的instance proposal $S_k$ 的proposal embedding用 $Q_k$ 表示。如果 $x_i$ 与proposal embedding $Q_k$ 的距离足够接近，那么pixel就被制定为这个instance $S_k$ 中。作者设置了一个margin $\delta$ ，那么 $S_k$ 的二进制掩码 $Mask_k(x_i)$ 的pixel assignment可以用以下公式计算：

在训练中，pixel embedding和proposal embedding联合用来产生mask。 $S_k$ 显而易见就是ground truth mask， $Q_k$ 作为gt instance $S_k$ 对应的proposal embeddings，（选取正样本的规则在3.5）也就是正样本proposal embeddings的均值。因此，如果pixel $x_i$ 属于某一个instance $s_k$ 的gt mask，那么就让 $p_i$ 与 $Q_k$ 更近。

为了更好对于前景和背景的pixel embedding进行挑选，一种直观的想法是在hinge loss应用两种固定的margins，如公式2所示：

> 原始的hinge loss公式如下所示，也就是满足一个边界(例如为1)，在后一项的概率大于1的时候，损失为0，在后一项的概率较小的时候，损失较大。

$L_{hingeloss} = \sum \limits_{i=1}^N max(0, 1-y_ilog \hat y)$

其中，公式2中的变量表示如下：

K表示instance gt的数量。
$B_k$ 代表的是对于 $S_k$ 的需要被supervised的pixel embeddings的集合，也就是在 $S_k$ 的bbox内部的pixel embeddings（就会呗分成了在mask内部的以及不在mask内部，即bbox内其余的background的部分）。
$N_k$ 表示 $B_k$ 中pixel embeddings的数量。 $x]_+=max(0, x)$ ，这一步是根据hinge loss原来公式来的。
$\mathbb{I}_{x\in S_k}$ 代表指示函数，如果为1代表pixel x在 $S_k$ 的gt mask内部（注意不是gt box 而是gt mask），否则为0。
$\delta_a$ 以及 $\delta_b$ 分别代表了push和pull 策略的2个margins。
- loss的第一项表示使得在margin $\delta_a$ 中缩小距离，即 $p_i -Q_k||$ ，即当 $||p_i -Q_k||<\delta_a$ ，那 $[x]_+=max(0,||p_i -Q_k||-\delta_a)=0$ 也就是pi与Qk之间的距离很小，那么就loss就为0。
- 第二项表示在margin $\delta_b$ 外增大距离，即增大 $p_i -Q_k||$ ，即当 $||p_i -Q_k||>\delta_b$ ， $[x]_+=(\delta_b-||p_i -Q_k||)=0$ 。

但作者发现这种固定的margins会导致3.3所示的一些问题，因此作者之后提出一个可学习的margins来取代。

3.3 Learnable Margin

作者分析了为什么fix margins不适合训练，其中有2点：

$\delta_b,\delta_a, \delta$ ，前2者用于训练，后者用于测试阶段，都是人工制定的，不利于找到最优解。
fix margins不利于多尺度的训练。大尺度的物体比较scattered稀疏，而小物体比较concentrated。

为了避免这些问题，对于所有的instance proposals，作者提出了适用于多尺度检测的margins $\delta_j$ ，而且 $\delta_j$ 不需要人工制定，可以从训练中直接的进行学习。通过高斯函数，可以将pixel embedding和proposal embedding之间的距离设置在 $[0, 1)$ 之间，设置为公式3：

$\phi(x_i,S_k)$ 表示pixel $x_i$ 是否属于 $S_k$ 的mask。如果 $p_i$ 与 $Q_k$ 之间的距离很接近，那么公式3的output $\phi(x_i,S_k)$ 就为1，否则为0。其中，公式3中的新变量 $\sum_k$ 来自于 $\delta_j$ ，来描述 $Q_k$ 来自于 $q_j$ 的程度， $\sum_k$ 对于每一个instanae扮演margin的作用。这样就对于每一个instance proposal， $\delta_j$ 就可以提供可学习的margin。

The additional introduced variant $Σ_k$ comes from σj just like how $Q_k$ comes from $q_j$ .
As what is introduced in [26], the $Σ_k$ plays a role of margin for instance $S_k$ .

于是二进制分类loss可以被优化为公式4：

其中 $L (.)$ 表示二进制的分类loss，本文中使用了lovasz-hinge loss。 $\mathbb{G}(x_i, S_k)$ 代表了 $x_i$ 的gt label，用于判断x是否属于某一个proposal $S_k$ 的mask，取值为 ${0,1\}$ 。这样的损失函数就包含了2个embedding以及margin的参数，因此porposal的margin就可以被自动的学习，并且对于每一个实例来说，相比于使用hinge loss，使用公式4生成的margin更加灵活。

3.4 Smooth Loss

在训练中，Sk代表的是gt的instance，因此对于每一个instance Sk，Qk和 $\sum_k$ 就是 $q_j, \sigma_j$ 的正样本集合的均值，将这个正样本集合设置为 $\mathcal M_k$ ，Qk以及 $\sum_k$ 的计算方式如公式5以及6所示，其中 $N_k$ 代表的是对于 $S_k$ 中proposal embedding以及margin的正样本的数量。（也就是说 $q_j$ 是positive examples，而是Qk是将属于这个instance的所有的postivie examples归一化的结果）。

但是在测试过程中， $S_k$ 代表的是经过NMS后留下来的instance proposal，因此 $Q_k$ ， $\sum_k$ 也随之改变了。

由于在训练和测试过程中 $S_k$ 的含义不同，因此训练和测试中的 $Q_k$ 以及 $\sum_k$ 是不一样的。基于此问题，作者提出需要对训练过程进行smooth loss，来使得它们更加的接近，如公式7所示：

3.5 Training

Objective

除了FCOS中使用的回归，分类以及ctrness的loss，在EmbedMask中对掩码预测分支新加入了 $L_{mask}$ 以及$L_{smooth}$2个损失函数，total loss由公式8所示：

Training Samples for Box and Classification

在计算分类和回归时，作者定义正样本为 ${box_j, class_j, center_j\}$ ，正样本的真实坐标要映射到原图的中心区域（其实就是后来FCOS的 center sampling)，同时中心采样点的location也要在gt的mask中（这一点是不同的）。相比于原来FCOS中制定的sampling策略，EmbedMask采用的方法更加严格。

We define the positive samples as the parameters ${box_j, class_j, center_j}$ whose real locations mapped back to the original image locate on the center region of the ground-truth bounding box, and at the meantime the locations are in the mask of the ground-truth instances.
Training Samples for Proposal Embedding and Margin

使用IOU，pred box与gt box的IOU值要大于0.5

that the Intersection over Union (IoU) between the corresponding predicted $box_j$ in the sampled location and the ground-truth box for instance $S_k$ should be more than 0.5
Training Samples for Pixel Embedding

上文中提到，原来是只有属于 $B_k$ 的pixel embeddings才用来监督 $S_k$ 。但因为在实践中，作者发现如果稍微扩大box来增加training example，实验结果会更好。因此，作者使用手工扩大bbox的方法。

B_k represents the set of pixel embeddings that need to be supervised for the instance S_k.

3.6 Inference

给定输入的图片，首先会经过object detection procedure（FCOS）得到分类，回归以及ctrness的预测值，然后经过NMS方法，剩余的pred instances就作为instance proposals $S_k$ 。每一个 $S_k$ 都有bbox的坐标，以及category的得分值，proposal embedding $q_j$ 以及marigin $\sigma_j$ 。其中proposal embedding $q_j$ 以及marigin $\sigma_j$ 在测试中就是作为 $Q_k$ 以及 $\sum_k$ 。同时，对于每张图片，获得pixel embedding $p_i$ 。对于 $S_k$ 的pixel $x_i$ ，使用公式3来计算 $x_i$ 属于 $S_k$ 的概率，然后将概率通过阈值转换为二进制的mask 值，也就是大于0.5的值设置为二进制mask的1，否则为0。

[EmbedMask]EmbedMask: Embedding Coupling for One-stage Instance Segmentation