Feature Pyramid Networks for Object Detection

Abstract

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

本文的核心思路就是，将提取特征过程中先后提取到的特征都进行二次应用——第一次向上提取称为bottom-top，再用upsampling处理顶层特征向下处理称为top-bottom，融合两次得到的特征层就是FPN的核心。

1. Introduction

这张图相信已经见过很多次了，有一份很好的说明，来自FPN详解

识别不同大小的物体是计算机视觉中的一个基本挑战，我们常用的解决方案是构造多尺度金字塔。如上图a所示，这是一个特征图像金字塔，整个过程是先对原始图像构造图像金字塔，然后在图像金字塔的每一层提出不同的特征，然后进行相应的预测（BB的位置）。这种方法的缺点是计算量大，需要大量的内存；优点是可以获得较好的检测精度。它通常会成为整个算法的性能瓶颈，由于这些原因，当前很少使用这种算法。如上图b所示，这是一种改进的思路，学者们发现我们可以利用卷积网络本身的特性，即对原始图像进行卷积和池化操作，通过这种操作我们可以获得不同尺寸的feature map，这样其实就类似于在图像的特征空间中构造金字塔。实验表明，浅层的网络更关注于细节信息，高层的网络更关注于语义信息，而高层的语义信息能够帮助我们准确的检测出目标，因此我们可以利用最后一个卷积层上的feature map来进行预测。这种方法存在于大多数深度网络中，比如VGG、ResNet、Inception，它们都是利用深度网络的最后一层特征来进行分类。这种方法的优点是速度快、需要内存少。它的缺点是我们仅仅关注深层网络中最后一层的特征，却忽略了其它层的特征，但是细节信息可以在一定程度上提升检测的精度。因此有了图c所示的架构，它的设计思想就是同时利用低层特征和高层特征，分别在不同的层同时进行预测，这是因为我的一幅图像中可能具有多个不同大小的目标，区分不同的目标可能需要不同的特征，对于简单的目标我们仅仅需要浅层的特征就可以检测到它，对于复杂的目标我们就需要利用复杂的特征来检测它。整个过程就是首先在原始图像上面进行深度卷积，然后分别在不同的特征层上面进行预测。它的优点是在不同的层上面输出对应的目标，不需要经过所有的层才输出对应的目标（即对于有些目标来说，不需要进行多余的前向操作），这样可以在一定程度上对网络进行加速操作，同时可以提高算法的检测性能。它的缺点是获得的特征不鲁棒，都是一些弱特征（因为很多的特征都是从较浅的层获得的）。讲了这么多终于轮到我们的FPN啦，它的架构如图d所示，整个过程如下所示，首先我们在输入的图像上进行深度卷积，然后对Layer2上面的特征进行降维操作（即添加一层1x1的卷积层），对Layer4上面的特征就行上采样操作，使得它们具有相应的尺寸，然后对处理后的Layer2和处理后的Layer4执行加法操作（对应元素相加），将获得的结果输入到Layer5中去。其背后的思路是为了获得一个强语义信息，这样可以提高检测性能。认真的你可能观察到了，这次我们使用了更深的层来构造特征金字塔，这样做是为了使用更加鲁棒的信息；除此之外，我们将处理过的低层特征和处理过的高层特征进行累加，这样做的目的是因为低层特征可以提供更加准确的位置信息，而多次的降采样和上采样操作使得深层网络的定位信息存在误差，因此我们将其结合其起来使用，这样我们就构建了一个更深的特征金字塔，融合了多层特征信息，并在不同的特征进行输出。这就是上图的详细解释。（个人观点而已）

在这个过程中提到了上采样（upsampling），在计算机视觉中upsampling(上采样)的三种方式中同样也有说明。举例说明的话可以参考我的另一篇文章Mask R-CNN，里面提到的RoIAlign就用到了插值的方法对齐像素，而这里则采用了类似的方法扩展特征样本以便继续向下传播。

作者的算法结构可以分为三个部分：自下而上的卷积神经网络（上图左），自上而下过程（上图右）和特征与特征之间的侧边连接。
自下而上的部分其实就是卷积神经网络的前向过程。在前向过程中，特征图的大小在经过某些层后会改变，而在经过其他一些层的时候不会改变，作者将不改变特征图大小的层归为一个阶段，因此每次抽取的特征都是每个阶段的最后一个层的输出，这样就能构成特征金字塔。具体来说，对于ResNets，作者使用了每个阶段的最后一个残差结构的特征激活输出。将这些残差模块输出表示为{C2, C3, C4, C5}，对应于conv2，conv3，conv4和conv5的输出。
自上而下的过程采用上采样进行。上采样几乎都是采用内插值方法，即在原有图像像素的基础上在像素点之间采用合适的插值算法插入新的元素，从而扩大原图像的大小。通过对特征图进行上采样，使得上采样后的特征图具有和下一层的特征图相同的大小。
侧边之间的横向连接的过程在下图中展示。根本上来说，是将上采样的结果和自下而上生成的特征图进行融合。我们将卷积神经网络中生成的对应层的特征图进行1×1的卷积操作，将之与经过上采样的特征图融合，得到一个新的特征图，这个特征图融合了不同层的特征，具有更丰富的信息。这里1×1的卷积操作目的是改变channels，要求和后一层的channels相同。在融合之后还会再采用3*3的卷积核对每个融合结果进行卷积，目的是消除上采样的混叠效应，如此就得到了一个新的特征图。这样一层一层地迭代下去，就可以得到多个新的特征图。假设生成的特征图结果是P2，P3，P4，P5，它们和原来自底向上的卷积结果C2，C3，C4，C5一一对应。金字塔结构中所有层级共享分类层（回归层）。

2. Related Work

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Doll´ar et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.

手工设计特征和早期神经网络。SIFT特征[25]最初是从尺度空间极值中提取的，用于特征点匹配。HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。这些HOG和SIFT金字塔已在许多工作中得到了应用，用于图像分类，目标检测，人体姿势估计等。这对快速计算特征化图像金字塔也很有意义。Dollar等人[6]通过先计算一个稀疏采样（尺度）金字塔，然后插入缺失的层级，从而演示了快速金字塔计算。在HOG和SIFT之前，使用ConvNet[38，32]的早期人脸检测工作计算了图像金字塔上的浅网络，以检测跨尺度的人脸。

Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like Over- Feat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.

Deep ConvNet目标检测器。随着现代深度卷积网络[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标检测器在精度上显示出了显著的提高。OverFeat采用了一种类似于早期神经网络人脸检测器的策略，通过在图像金字塔上应用ConvNet作为滑动窗口检测器。R-CNN采用了基于区域提议的策略[37]，其中每个提议在用ConvNet进行分类之前都进行了尺度归一化。SPPnet[15]表明，这种基于区域的检测器可以更有效地应用于在单个图像尺度上提取的特征映射。最近更准确的检测方法，如Fast R-CNN[11]和Faster R-CNN[29]提倡使用从单一尺度计算出的特征，因为它提供了精确度和速度之间的良好折衷。然而，多尺度检测性能仍然更好，特别是对于小型目标。

Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores. There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and Sharp- Mask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine
segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

使用多层的方法。一些最近的方法通过使用ConvNet中的不同层来改进检测和分割。FCN[24]将多个尺度上的每个类别的部分分数相加以计算语义分割。Hypercolumns[13]使用类似的方法进行目标实例分割。在计算预测之前，其他几种方法（HyperNet[18]，ParseNet[23]和ION[2]）将多个层的特征连接起来，这相当于累加转换后的特征。SSD[22]和MS-CNN[3]可预测特征层级中多个层的目标，而不需要组合特征或分数。

3. Feature Pyramid Networks

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is generalpurpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and
region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.
Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.

我们的目标是利用ConvNet的金字塔特征层级，该层次结构具有从低到高的语义，并在整个过程中构建具有高级语义的特征金字塔。由此产生的特征金字塔网络是通用的，在本文中，我们侧重于滑动窗口提议（Region Proposal Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。在第6节中我们还将FPN泛化到实例细分提议。

我们的方法以任意大小的单尺度图像作为输入，并以全卷积的方式输出多层适当大小的特征映射。这个过程独立于主卷积体系结构（例如[19，36，16]），在本文中，我们呈现了使用ResNets[16]的结果。如下所述，我们的金字塔结构包括自下而上的路径，自上而下的路径和横向连接。

Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.
Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as fC2;C3;C4;C5g for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of f4, 8, 16, 32g pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

自下而上的路径。自下向上的路径是主ConvNet的前馈计算，其计算由尺度步长为2的多尺度特征映射组成的特征层级。通常有许多层产生相同大小的输出映射，并且我们认为这些层位于相同的网络阶段。对于我们的特征金字塔，我们为每个阶段定义一个金字塔层。我们选择每个阶段的最后一层的输出作为我们的特征映射参考集，我们将丰富它来创建我们的金字塔。这种选择是自然的，因为每个阶段的最深层应具有最强大的特征。

具体而言，对于ResNets[16]，我们使用每个阶段的最后一个残差块输出的特征激活。对于conv2，conv3，conv4和conv5输出，我们将这些最后残差块的输出表示为{C2,C3,C4,C5}，并注意相对于输入图像它们的步长为{4，8，16，32}个像素。由于其庞大的内存占用，我们不会将conv1纳入金字塔。

Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.
Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsam- pled map is then merged with the corresponding bottom-up map (which undergoes a 11 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 11 convolutional layer on C5 to produce the coarsest resolution map. Finally, we append a 33 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called fP2; P3; P4; P5g, corresponding to fC2;C3;C4;C5g that are respectively of the same spatial sizes.

自顶向下的路径和横向连接。自顶向下的路径通过上采样空间上更粗糙但在语义上更强的来自较高金字塔等级的特征映射来幻化更高分辨率的特征。这些特征随后通过来自自下而上路径上的特征经由横向连接进行增强。每个横向连接合并来自自下而上路径和自顶向下路径的具有相同空间大小的特征映射。自下而上的特征映射具有较低级别的语义，但其激活可以更精确地定位，因为它被下采样的次数更少。

图3显示了建造我们的自顶向下特征映射的构建块。使用较粗糙分辨率的特征映射，我们将空间分辨率上采样为2倍（为了简单起见，使用最近邻上采样）。然后通过按元素相加，将上采样映射与相应的自下而上映射（其经过1×1卷积层来减少通道维度）合并。迭代这个过程，直到生成最佳分辨率映射。为了开始迭代，我们只需在C5上添加一个1×1卷积层来生成最粗糙分辨率映射。最后，我们在每个合并的映射上添加一个3×3卷积来生成最终的特征映射，这是为了减少上采样的混叠效应。这个最终的特征映射集称为{P2,P3,P4,P5}，对应于{C2,C3,C4,C5}，分别具有相同的空间大小。

Because all levels of the pyramid use shared classifiers/ regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. We set d = 256 in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.
Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

由于金字塔的所有层都像传统的特征图像金字塔一样使用共享分类器/回归器，因此我们在所有特征映射中固定特征维度（通道数记为d）。我们在本文中设置d=256，因此所有额外的卷积层都有256个通道的输出。在这些额外的层中没有非线性，我们在实验中发现这些影响很小。

简洁性是我们设计的核心，我们发现我们的模型对许多设计选择都很鲁棒。我们已经尝试了更复杂的块（例如，使用多层残差块[16]作为连接）并观察到稍微更好的结果。设计更好的连接模块并不是本文的重点，所以我们选择上述的简单设计。

FPN论文阅读笔记

Feature Pyramid Networks for Object Detection

Abstract

1. Introduction

2. Related Work

3. Feature Pyramid Networks

4. Applications

4.1. Feature Pyramid Networks for RPN

4.2. Feature Pyramid Networks for Fast RCNN

5. Experiments on Object Detection

6. Extensions: Segmentation Proposals

7. Conclusion

References

猜你喜欢