1、论文总述

在这里插入图片描述

深度学习之后，instance segmentation（实例分割）这个task最开始采用和R-CNN类似的两阶段算法：segmentation proposal提取以及proposal精细分类，如SDS，MNC，DeepMask和InstanceFCN，看过论文就知道，InstanceFCN（2016年）只是提取分割的proposals（二分类），相当于是实例分割的前一步，并不是实力分割；上一篇的FCN只是语义分割，有分类但是不能区分同一类的每个个体，这篇论文不能分类，但是可以区分每个个体。。。

2015年一篇论文是实例分割：Instance-aware Semantic Segmentation via Multi-task Network Cascades（后面看这篇），这篇实例分割论文要比InstanceFCN早些，虽然提出了但问题很多，实例分割比较好的就是2017年的FCIS（Fully Convolutional Instance-aware Semantic Segmentation）以及MaskRCNN了，FCIS借鉴了InstanceFCN提出的instance-sensitive score maps，InstanceFCN就是在发展了FCN的基础上靠instance-sensitive score maps来区分不同的个体的，感觉这篇论文比较好的点好像就是instance-sensitive score maps了，这个也被目标检测网络R-FCN借鉴了。

2、instance-sensitive score maps的工作原理

在这里插入图片描述
传统的图像分割中，由于对每个像素点采用交叉熵损失训练，所以每个像素点的语义信息只能有一个，那对于两种类别的重叠区域，实现实例分割就很困难，通过这篇论文提出的K * K个instance-sensitive score maps（类似于feature map，让每个像素产生K * K个语义信息，），而每个score map表示每个ROI的某一个位置，如第一个score map表示ROI的左上得分，第二表示上中等等。。。这样的话就是每个score map只表示instance的某个相对位置的得分，通过这个就可以实现instance-sensitive，下面举个例子：

在这里插入图片描述如上图，虽然是同一个位置（红点），但它在不同的roi里对应的位置是不一样的，在第一个ROI里对应的中右，而在第二个ROI里对应的上左，所以这俩ROI提取这个红点的特征时候（因为重合，所以都去提了），去的score map是不一样的，第一个ROI去的第6个score map，而第二个去的是第一个score map，这也是为啥俩图上显示的响应不一样的原因，因为这俩图负责响应的相对位置不一样，第一个图明显对每个instance的右中响应得分高，而第二个图对instance的左上响得分高。

一句话总结：虽然不同instance之间有相邻部分，但是相邻部分在不同的instance里处于不同的位置。

【注】：这篇论文寻找ROI的方式是通过sliding window的方式，FCIS是通过RPN网络。

3、与DeepMask的比较

Most related to our method, DeepMask [8] is an instance segment proposal
method driven by convolutional networks. DeepMask learns a function that maps
an image sliding window to an m2
-d vector representing an m×m-resolution
mask (e.g., m = 56). This is computed by an m2
-d fully-connected (fc) layer.
See Fig. 2. Even though DeepMask can be implemented in a fully convolutional
way (as at inference time in [8]) by recasting this fc layer into a convolutional
layer with m2
-d outputs, it fundamentally differs from the FCNs in [1] where
each output pixel is a low-dimensional classifier. Unlike DeepMask, our method
has no layer whose size is related to the mask size m2
, and each pixel in our
method is a low-dimensional classifier. This is made possible by exploiting local
coherence [9] of natural images for generating per-window pixel-wise predictions.
We will discuss and compare with DeepMask in depth.

4、Instance assembling module

The instance-sensitive score maps have not yet produced object instances.
But we can simply assemble instances from these maps. We slide a window of
resolution m×m on the set of instance-sensitive score maps (Fig. 1 (bottom)).
In this sliding window, each mk × mk
sub-window directly copies values from the
same sub-window in the corresponding score map. The k2
sub-windows are then
put together (according to their relative positions) to assemble a new window of
resolution m×m. This is the instance assembled from this sliding window.

5、 Local Coherence

By local
coherence we mean that for a pixel in a natural image, its prediction is most
likely the same when evaluated in two neighboring windows. One does not need
to completely re-compute the predictions when a window is shifted by a small
step.

在这里插入图片描述

The local coherence property has been exploited by our method. For a window
that slides by one stride (Fig. 3 (bottom)), the same pixel in the image coordinate
system will have the same prediction because it is copied from the same score
map (except for a few pixels near the partitioning of relative positions). This
allows us to conserve a large number of parameters when the mask resolution
m2 is high. （就是sliding window移动几步也不影响对同一个instance的预测，这样就可以减少sliding window的数量，这是由于图像的局部连通性）