1、论文总述

在这里插入图片描述
下图类似于发展史：

这篇论文是第一个实现e2e的全卷积网络的实例分割，他是基于InstanceFCN改进的，例如加了outside position-sensitive score maps（将分类和分割一齐实现）、RPN提取proposals等，InstanceFCN中是基于sliding windows的形式。

这篇论文的 新东西 主要是 position-sensitive inside/outside score maps的产生以及如何利用他们进行分割和分类，所以下面主要讲这两点。

2、position-sensitive inside/outside score maps的产生

首先是ResNet的basenet产生feature map，一般分割里的网络都需要将分辨率高的大一些，因此这里将conv5的步长由2变为1，然后加上空洞卷积，所以feature map一般为原图的w/16,h/16。
接着在这个feature map上继续卷积产生2k*k（c+1）个position-sensitive inside/outside score maps，一半为in，一半out；
最后每个ROI的每个部分都去 k * k个score map中选对应的那个第k个score map 复制粘贴，然后assembling，这样每个roi就只剩下2（c+1）个score maps了，一半为in，一半为out

3、position-sensitive inside/outside score maps的利用

在这里插入图片描述

inside map是指这个像素点属于ROI内目标边界内部的概率，outside map是指这个像素点属于ROI内目标边界外部的概率，所以这俩map只要有一个概率高就说明这个点是目标，应该把它框起来进行分类，所以分类的操作是对这俩取最大max，而分割的话需要把目标内部的分割出来，所以对这俩map进行softmax，就是转化为前景背景的概率。

Our joint formulation fuses the two answers into two
scores: inside and outside. There are three cases: 1) high
inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+, segmentation-; 3) both scores are low: detection-,
segmentation-.
The two scores answer the two questions
jointly via softmax and max operations. For detection, we
use max to differentiate cases 1)-2) (detection+) from case
.3) (detection-). The detection score of the whole ROI is
then obtained via average pooling over all pixels’ likelihoods (followed by a softmax operator across all the categories). For segmentation, we use softmax to differentiate
cases 1) (segmentation+) from 2) (segmentation-), at each
pixel. The foreground mask (in probabilities) of the ROI
is the union of the per-pixel segmentation scores (for each
category). Similarly, the two sets of scores are from two
1 × 1 conv layer. The inside/outside classifiers are trained
jointly as they receive the back-propagated gradients from
both segmentation and detection losses.

4、框的回归

Following the modern object detection systems, bounding box (bbox) regression [13, 12] is used to refine the initial input ROIs. A sibling 1×1 convolutional layer with 4k*k
channels is added on the conv5 feature maps to estimate the
bounding box shift in location and size.

5、MNC的3个缺点

Such methods have several drawbacks. First, the ROI
pooling step losses spatial details due to feature warping and
resizing, which however, is necessary to obtain a fixed-size
representation (e.g., 14 × 14 in [8]) for fc layers. Such distortion and fixed-size representation degrades the segmentation accuracy, especially for large objects. Second, the fc
layers over-parametrize the task, without using regularization of local weight sharing. For example, the last fc layer
has high dimensional 784-way output to estimate a 28 × 28
mask. Last, the per-ROI network computation in the last
step is not shared among ROIs. As observed empirically, a
considerably complex sub-network in the last step is necessary to obtain good accuracy [36, 9]. It is therefore slow for
a large number of ROIs (typically hundreds or thousands of
region proposals). For example, in the MNC method [8],
which won the 1st place in COCO segmentation challenge
2015 [25], 10 layers in the ResNet-101 model [18] are kept
in the per-ROI sub-network. The approach takes 1.4 seconds per image, where more than 80% of the time is spent
on the last per-ROI step. These drawbacks motivate us to
ask the question that, can we exploit the merits of FCNs for
end-to-end instance-aware semantic segmentation?

这里对第三个缺点备注下：有全连接层的时候，ROI只能一个个的forward，而本论文中，可以看到全卷积网络，所有的ROI在取值的时候都是在那固定的2k*k(c+1)个score map上取值的，所以可以对所有的ROI一起操作，这也就是全卷积的好处之一。
R_FCN检测网络中也有说到这一点。

6、FPN只是一个分类网络

看完这几篇分割的论文，发现FPN虽然是语义分割的网络，但它真的只是将分裂网络的最后几层全连接层换成了卷积层，由对区域分类变成了对每个像素点的分类。

7、Object Segment Proposal

The task is to generate
category-agnostic object segments. Traditional approaches,
e.g., MCG [1] and Selective Search [41], use low level image features. Recently, the task is achieved by deep learning
approaches, such as DeepMask [32] and SharpMask [33].
Recently, a fully convolutional approach is proposed in [5]（InstanceFCN）,
which inspires this work.

8、与其他网络的性能比较

在这里插入图片描述

【注】Multi-scale testing：

Following [17, 18],（SPPnet） the position sensitive score maps are computed on a pyramid
of testing images, where the shorter sides are of
{480, 576, 688, 864, 1200, 1400} pixels. For each ROI, we
obtain its result from the scale where the ROI has a number
of pixels closest to 224 × 224. Note that RPN proposals are
still computed from a single scale (shorter side 600). Multiscale testing improves the accuracy by 2.8%.