1、论文总述

在这里插入图片描述

这篇实力分割的paper的想法符合初学者的思路，即要想实现实例分割（检测+分割），需要级联（这里是串联，MaskRCNN是并联）各个网络：1、先RPN找出box-level的proposals，2、然后接全连接层对这些ROI进行像素级别的分割，提取出目标的前景，进而变成mask-level的proposals，3、然后第三阶段再接全连接层，对这些proposals进行分类，到此就完成了实例分割，当然论文里也提到可以加到5个阶段，即再进行一次分割和分类。

注：串联操作也导致各个stage的损失不是独立，需要根据前一阶段的参数来BP，所以为了达到end_to_end的训练，论文中将ROI pooling那个操作变成了ROI wrapping操作，详见论文。

在多任务学习还不那么热的2015年，如果想的话也一般是这种思路，这个也是coco2015年实例分割的冠军，后来被coco2016年的冠军FCIS（下一篇博客写这篇的总结）吊打，然后MNC和FCIS一起被coco2017的冠军MaskRCNN吊打。。

In this work, we address instance-aware semantic segmentation solely based on CNNs, without using external
modules (e.g., [1]). We observe that the instance-aware semantic segmentation task can be decomposed into three different and related sub-tasks. ,❤️) Differentiating instances. In
this sub-task, the instances can be represented by bounding
boxes that are class-agnostic. ❤️) Estimating masks. In this
sub-task, a pixel-level mask is predicted for each instance.
❤️) Categorizing objects. In this sub-task, the category-wise
label is predicted for each mask-level instance. We expect
that each sub-task is simpler than the original instance segmentation task, and is more easily addressed by convolutional networks.

2、Regressing Mask-level Instances时与Deep Mask的不同

As a related method, DeepMask [25] also regresses discretized masks. DeepMask applies the regression layers
to dense sliding windows (fully-convolutionally), but our
method only regresses masks from a few proposed boxes（RPN）
and so reduces computational cost. Moreover, mask regression is only one stage in our network cascade that shares
features among multiple stages, so the marginal cost of the
mask regression layers is very small.

3、分类stage的一个细节

Following [13], we also use another
box-based pathway, where the RoI pooled features directly
fed into two 4096-d fc layers (this pathway is not illustrated
in Fig. 2). The mask-based and box-based pathways are
concatenated. On top of the concatenation, a softmax classifier of N+1 ways is used for predicting N categories plus
one background category. The box-level pathway may address the cases when the feature is mostly masked out by
the mask-level pathway (e.g., on background).

进到stage3的不仅有从stage2过来的，也有从stage1ROIpooling过来的，提高鲁棒性。

4、试用一款公式神器（MathpixSnip）：

The loss term L3 of stage 3 exhibits the following form:

$L_{3}=L_{3}(C(\Theta) | B(\Theta), M(\Theta))$

5、Differentiable（可微的） RoI Warping Layers.

The RoI pooling
layer [9, 15] performs max pooling on a discrete grid based
on a box. To derive a form that is differentiable w.r.t. the
box position, we perform RoI pooling by a differentiable
RoI warping layer followed by standard max pooling.

6、Ablation experiments

在这里插入图片描述

This baseline has an mAPr of
60.2% using VGG-16. We note that this baseline result is
competitive (see also Table 2), suggesting that decomposing
the task into three sub-tasks is an effective solution.

7、与其他方法的性能比较

在这里插入图片描述
注：这篇论文里写的推理时间为0.36秒，但在FCIS与它比较时写的是1.37秒，可能是由于用的basenet不一样，这里是VGG，FCIS中用的是ResNet。

8、作者的一点建议

Our method is designed with fast inference in mind, and
is orthogonal to some other successful strategies developed
previously for semantic segmentation. For example, one
may consider exploiting a CRF [5] to refine the boundaries
of the instance masks. This is beyond the scope of this paper
and will be investigated in the future

注：FCIS论文中说了这篇论文的3个缺点，并且改进了，具体的在FCIS博客中再写。