1、论文总述

在这里插入图片描述

在这里插入图片描述
这是一篇进行图像语义分割的paper，在这篇论文之前，前面的语义分割的方法好像效果都不好，这个方法一出，影响很大，网络效率高，效果也还好，而且是全卷积网络。

语义分割就是把一张image中一类的目标以像素级分出来，一般的网络效果展示都是一类目标的像素用同一种颜色表示，注意是像素级分类。
而实例分割就是即使是同一类目标的不同个体的像素也要用不同的颜色分开，要求更高，这篇论文几年之后的MaskRCNN就是实现的这个功能。

网络原理：简单来说就是利用原先的分类网络（如VGG），把其卷积层后面的池化层和全连接层变成相应的卷积层，因为全连接层只能是一维预测，而语义分割的话需要输出的是个二维的map，所以FCN中重要的一步就是将全连接层全部换成对应的卷积层，然后就是上采样层使输出为原图尺寸，上采样可以通过双线性插值（resize）或者反卷积实现（如下图），再最后就是用softmax对每个像素进行分类，为了让效果更好一点，softmax之前可以加入不同层之间的特征融合（对应像素相加或者拼接concat）。

在这里插入图片描述

【注】：这篇paper有两个版本，CVPR2015的会议版和2016IEEE的期刊版，具体区别我也清楚，我看的时候才发现有两个版本，看的期刊版的，然后也是云里雾里的，对这篇论文没有什么自己的理解，主要参考了这篇知乎上的文章：全卷积网络 FCN 详解，就不在此对这篇论文多做介绍了，接下来主要记录一些点，并没有自己的理解。

2、语义分割中也有的一种紧张

Semantic segmentation faces an inherent tension between semantics and location: global information resolves
what while local information resolves where. What can be
done to navigate this spectrum from location to semantics?
How can local decisions respect global structure? It is not
immediately clear that deep networks for image classification yield representations sufficient for accurate, pixelwise recognition.

3、全连接层与卷积层之间互相转化

参考：全卷积网络 FCN 详解

全连接层和卷积层之间唯一的不同就是卷积层中的神经元只与输入数据中的一个局部区域连接，并且在卷积列中的神经元共享参数。然而在两类层中，神经元都是计算点积，所以它们的函数形式是一样的。因此，将此两者相互转化是可能的：

i)卷积->全连接
对于任一个卷积层，都存在一个能实现和它一样的前向传播函数的全连接层。权重矩阵是一个巨大的矩阵，除了某些特定块，其余部分都是零。而在其中大部分块中，元素都是相等的。

ii）全连接->卷积
相反，任何全连接层都可以被转化为卷积层。比如，一个 K=4096 的全连接层，输入数据体的尺寸是 7∗7∗512，这个全连接层可以被等效地看做一个 F=7,P=0,S=1,K=4096 的卷积层。换句话说，就是将滤波器的尺寸设置为和输入数据体的尺寸一致了。因为只有一个单独的深度列覆盖并滑过输入数据体，所以输出将变成 1∗1∗4096，这个结果就和使用初始的那个全连接层一样了。

4、FCN extensions

Following the conference version of
this paper [17], FCNs have been extended to new tasks and
data. Tasks include region proposals [31], contour detection
[32], depth regression [33], optical flow [34], and weakly supervised semantic segmentation [35], [36], [37], [38].

5、Image-to-image learning（two regimes for batch size）

We consider two regimes for batch size. In the first,
gradients are accumulated over 20 images. Accumulation
reduces the memory required and respects the different
dimensions of each input by reshaping the network. We
picked this batch size empirically to result in reasonable
convergence. Learning in this way is similar to standard
classification training: each minibatch contains several images and has a varied distribution of class labels. The nets
compared in Table 1 are optimized in this fashion.
However, batching is not the only way to do image-wise
learning. In the second regime, batch size one is used for
online learning. Properly tuned, online learning achieves
higher accuracy and faster convergence in both number of
iterations and wall clock time. Additionally, we try a higher
momentum of 0.99, which increases the weight on recent
gradients in a similar way to batching. See Table 2 for the
comparison of accumulation, online, and high momentum
or “heavy” learning (discussed further in Section 6.2).

在这里插入图片描述

6、不同层的特征融合

Layer fusion is essentially an elementwise operation. （逐像素相加）
However, the correspondence of elements across layers is
complicated by resampling and padding. Thus, in general,
layers to be fused must be aligned by scaling and cropping.
We bring two layers into scale agreement by upsampling the
lower-resolution layer, doing so in-network as explained in
Section 3.3. Cropping removes any portion of the upsampled layer which extends beyond the other layer due to
padding. This results in layers of equal dimensions in exact
alignment. The offset of the cropped region depends on
the resampling and padding parameters of all intermediate
layers.