R-CNN论文精读（论文翻译）

文章目录

摘要

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the lastfew years.
在PASCAL VOC数据集上测量的目标检测性能在过去几年已经趋于稳定。

The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context.
性能最好的方法是复杂的集成系统，它通常将多个低维图像特征与高级上下文相结合。

In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%.
在本文中，我们提出了一种简单和可扩展的检测算法，相对于之前的最佳结果，平均平均精度(mAP)提高了30%以上，达到了53.3%的mAP。

Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order tolocalize and segment objects and
(2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost.
我们的方法结合了两个关键的见解:
(1)可将大容量卷积神经网络works (cnn)应用于自下而上的区域定位和分割
(2)当标记训练数据匮乏时，对辅助任务进行有监督的预训练，然后进行特定区域的微调，会产生显著的性能提升。

Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn.
由于我们将区域提议与CNN结合在一起，所以我们称我们的方法为R-CNN:带有CNN特征的区域。我们还将R-CNN与最近提出的基于类似CNN架构的滑动窗口检测器OverFeat进行了比较。我们发现，在200类ILSVRC2013检测数据集上，R-CNN的性能明显优于OverFeat。

1.介绍

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT and HOG . But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection, it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.
功能问题。在过去的十年中，各种视觉识别任务的进展很大程度上是基于SIFT和HOG的使用。但是如果我们看看典型的视觉识别任务PASCAL VOC目标检测的性能，一般认为在2010-2012年期间进展缓慢，通过构建集成系统和使用成功方法的小变量获得了很小的收益。

SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.
SIFT和HOG是块方向直方图，我们可以粗略地将其与V1中的复杂细胞联系起来，V1是灵长类视觉通路的第一个皮层区域。但我们也知道，识别发生在下游的几个阶段，这表明，可能存在分层的、多阶段的过程，计算特征，甚至对视觉识别更具信息量。

Fukushima’s “neocognitron”, a biologically-inspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process. The neocognitron, however, lacked a supervised training algorithm. Building on Rumelhart et al. LeCun et al. showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.
Fukushima的neocognitron，一种用于模式识别的受生物启发多层次性和平移不变模型，算是这方面最早的尝试。然而neocognitron缺少监督性训练算法。之后Lecun等人的工作证明了基于反向传播的随机梯度算法可以用于训练卷积神经网络（一种被认为是继承neocognitron的模型）。

CNNs saw heavy use in the 1990s, but then fell out of fashion with the rise of support vector machines. In 2012, Krizhevsky et al. rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x, 0) rectifying non-linearities and “dropout” regularization).
cnn在20世纪90年代大量使用，但随后随着支持向量机的兴起而过时。2012年，Krizhevsky等人在ImageNet大规模视觉识别挑战赛(ILSVRC)上展示了高得多的图像分类精度，从而重新燃起了人们对cnn的兴趣。他们的成功源于在120万张标签图像上训练了一个大的CNN，并在LeCun的CNN上做了一些改变(例如，max(x, 0)校正非线性和“dropout”正则化)。

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?
ImageNet结果的重要性在ILSVRC 2012年研讨会上进行了激烈的辩论。中心问题可以归结为:ImageNet上的CNN分类结果在多大程度上可以推广到PASCAL VOC挑战上的目标检测结果?

We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.
我们通过弥合图像分类和目标检测之间的差距来回答这个问题。本文首次表明，与基于更简单的HOG类特征的系统相比，CNN可以显著提高PASCAL VOC对象检测性能。为了实现这一结果，我们重点研究了两个问题:用深度网络定位物体和是用少量带标签检测数据训练一个大型模型。

Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al, concurrent with our own, indicates that this strategy may not fare well in practice (theyreport a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces and pedestrians . In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However,units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.
与图像分类不同，检测需要在图像中定位(可能有很多)目标。有一种方法是把检测当做回归问题。然而，Szegedy等人与我们同时进行的研究表明，这种策略在实践中可能不会取得很好的效果(他们报告的2007年VOC mAP为30.5%，而我们的方法实现了58.5%)。另一种方法是构建滑动窗口检测器。cnn以这种方式使用了至少20年，通常用于受限对象类别，如人脸和行人。为了保持高空间分辨率，这些cnn通常只有两个卷积层和池化层。我们还考虑采用滑动窗口方法。然而，我们的网络有5个卷积层，其中较高的单元在输入图像中有非常大的感受野(195 × 195像素)和步长(32×32像素)，这使得在滑动窗口范式中进行精确定位成为一个开放的技术挑战。

Instead, we solve the CNN localization problem by operating within the“recognition using regions” paradigm,which has been successful for both object detection and semantic segmentation. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.
相反，我们通过在“区域识别”范式内操作来解决CNN定位问题，该范式在目标检测和语义分割方面都取得了成功。在测试时，我们的方法为输入图像生成大约2000个类别独立的候选区域，使用CNN从每个候选区域中提取固定长度的特征向量，然后使用类别特定的线性支持向量机对每个区域进行分类。我们使用一种简单的技术(仿射图像扭曲)从每个区域提议中计算一个固定大小的CNN输入，而不管区域的形状。图1展示了我们方法的概述，并突出显示了一些结果。由于我们的系统将区域提议与CNN相结合，我们将这种方法命名为R-CNN:带有CNN特征的区域。

In this updated version of this paper, we provide a head-to-head comparison of R-CNN and the recently proposed OverFeat detection system by running R-CNN on the 200-class ILSVRC2013 detection dataset. OverFeat uses a
sliding-window CNN for detection and until now was the best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with a mAP of 31.4% versus 24.3%.
在本文的更新版本中，我们通过在200类ILSVRC2013检测数据集上运行R-CNN，对R-CNN和最近提出的OverFeat检测系统进行了迎头对比。OverFeat使用滑动窗口CNN进行检测，到目前为止是ILSVRC2013检测中表现最好的方法。我们表明：R-CNN远远超过了OverFeat，mAP具体表现为：RCNN–31.4% OverFeat–24.3%。

A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN. The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning . The second principle contribution of this paper is to show thatsupervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain-specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when
data is scarce. In our experiments, fine-tuning for detection improves mAP performance by 8 percentage points. After fine-tuning, our system achieves a mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM). We also point readers to contemporaneous work by Donahue et al., who show that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellent performance on several recognition tasks including scene
classification, fine-grained sub-categorization, and domain adaptation.
检测面临的第二个挑战是，标记数据稀缺，目前可用的数量不足以训练一个大型CNN。解决这个问题的传统方法是使用无监督的预训练，然后进行有监督的微调。本文的第二个主要贡献是表明，在大辅助数据集(ILSVRC)上进行有监督的预训练，然后在小数据集上进行领域特定的精细调 (PASCAL)，是在数据匮乏时学习高容量cnn的有效范例。在我们的实验中，检测的微调提高了8个百分点的mAP性能。经过微调后，我们的系统在VOC 2010上的mAP值为54%，而在高度调整的、基于hog的可变形零件模型(DPM)上，mAP值为33%。我们也给读者指出了同时期的Donahue等人的工作，他们指出Krizhevsky’s的CNN（不带微调）像一个特征提取的黑箱一样也可以被使用，它在几个识别任务上也产生了不错的性能包括：情景分类，细粒度的子分类和域的适应性。

Our system is also quite efficient. The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. This computational property follows from features that are shared across all categories and that are also two orders of magnitude lower-dimensional than previously used region features.
我们的系统也是非常高效的。仅有的特定类别的计算是一个相当小的矩阵向量的相乘和贪婪的非极大值抑制。这种计算属性是根据跨所有类别的共享特征和比之前使用的区域特征更低维的两个数量级的特征所得出的。

Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. As an immediate consequence of this analysis, we demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode.
理解我们方法的失败的模型对改进它也是很关键的，所以我们报告了从 Hoiem等人的检测分析工具的结果。作为一个这种分析的直接结果，我们证明了一个简单的包围盒回归方法大大减少了误定位，这是主要的误差模式。

Before developing technical details, we note that because R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, we also achieve competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy
of 47.9% on the VOC 2011 test set.
在开发技术细节之前，我们注意到，因为R-CNN是在区域上运行的，所以很自然地将其扩展到语义分割的任务。经过少量修改，我们在PASCAL VOC分割任务上也取得了较好的结果，在VOC 2011测试集上的平均分割准确率为47.9%。

图1:对象检测系统概述。我们的系统
(1)取输入图像，
(2)提取约2000个自下而上的候选区域，
(3)利用大型卷积神经网络(CNN)计算每个提案的特征，然后
(4)使用特定类的线性支持向量机对每个区域进行分类。R-CNN在PASCAL VOC 2010上的平均精度(mAP)为53.7%。
作为比较，报告35.1%的mAP使用相同的候选区域，但使用空间金字塔和视觉词汇袋方法。流行的可变形零件模型表现为33.4%。在200类ILSVRC2013检测数据集上，R-CNN的mAP值为31.4%，比OverFeat的24.3%有了很大的提高。
在这里插入图片描述

2.目标检测与R-CNN

Our object detection system consists of three modules.The first generates category-independent region proposals.These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs. In this section, we present our design decisions for each module, describe their test-time usage,detail how their parameters are learned, and show detectionresults on PASCAL VOC 2010-12 and on ILSVRC2013.
我们的目标检测系统由三个模块组成。第一个生成独立于类别的候选区域。这些建议定义了我们检测器可用的候选检测集。第二个模块是一个大型卷积神经网络，从每个区域提取固定长度的特征向量。第三个模块是一组类特定的线性支持向量机。在本节中，我们给出了每个模块的设计决策，描述了它们的测试时间使用情况，详细说明了它们的参数是如何学习的，并展示了PASCAL VOC 2010-12和ILSVRC2013的检测结果。

2.1模块设计
Region proposals. A variety of recent papers offer methods for generating category-independent region proposals.
候选区域。最近很多论文提供了产生独立类别的候选区域的方法。

Examples include: objectness, selective search,category-independent object proposals, constrained parametric min-cuts (CPMC), multi-scale combinatorial grouping, and Cires¸an et al., who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work.
例子如下：一般物体检测（objectness），选择性搜索（selective search），独立类别的物体 proposals，受约束的参数分钟裁剪（constrained parametric min-cuts (CPMC)），多尺度组合分（multi-scale combinatorial grouping），和通过利用CNN网络到定期间隔平方生成去检测有丝分裂细胞的Cires¸等方法。因为R-CNN对特定region proposal 方法不可知，我们使用选择性搜索的方法与之前的检测工作产生可控的对比。

Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN described by Krizhevsky et al. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers.
特征提取。我们提取用 Krizhevsky等人描述的CNN的Caffe实现从每个候选区域上提取4096维特征向量。特征由前向传播将一个均差的227*227RGB图像通过5个卷积层和两个全连接层。

In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions.
为了能计算候选区域的特征，我们首先将图片数据转换成CNN网络兼容的格式（CNN网络的结构要求227*227像素固定大小的输入图片数据）。在我们任意大小区域可能转换的方式中，我们选择最简单的。不管候选区域的大小和长宽比，我们将物体附近的包裹紧密的包围盒中的像素warp到要求的大小尺寸。在warping之前，我们先膨胀扩张包围紧密的包围盒以至于在warped大小尺寸基础上在原有的包围盒周围恰好有p个像素的warped图像内容（我们用p=16）.图2展示了随机采样的warped训练区域。
在这里插入图片描述
2.2 测试时期的检测
At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). We warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
在检测时期，我们用选择性搜索方法对每张测试图片提取2000个 region proposals（在所有的试验中我们的选择性搜索方法是最快的模式）。我warp每一个proposal并且通过CNN网络前向传播来计算特征。然后对于每一个类别，我们用为该类别训练的SVM分类器去给每个提取的特征向量打分。考虑到图像上所有的低分区域，我们应用贪婪极大值抑制（对每一个独立类别）去拒绝一个区域，这个区域是有着intersection-over-union(IoU)的重叠并有着高于学习获得的阈值的高得分挑选的区域。

Run-time analysis. Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system, for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional).
运行时分析。两个特性使检测有效。首先，CNN的所有参数在所有类别中共享。其次，与其他常用方法(如空间金字塔与视觉词袋编码)相比，CNN计算的特征向量是低维的。例如，UVA探测系统中使用的特征比我们的大两个数量级(360k和4k维)。

The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The only class-specific computations are dot products between features and SVM weights and non-maximum suppression. In practice, all dot products for an image are batched into a single matrix-matrix product. The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N, where N is the number of classes.
这种共享的结果是，花费在计算候选区域和特性(GPU上的13秒/图像CPU上的53秒/图像)的时间在所有类上分摊。唯一特定于类的计算是特征和支持向量机权重之间的点积和非最大抑制。在实践中，一个图像的所有点积都被批处理成一个矩阵-矩阵积。特征矩阵一般为2000×4096，支持向量机权值矩阵为4096×N，其中N为类数。

This analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques, such as hashing. Even if there were 100k classes, the resulting matrix multiplication takes only 10 seconds on a modern multi-core CPU. This efficiency is not merely the result of using region proposals and shared features. The UVA system, due to its high dimensional features, would be two orders of magnitude slower while requiring 134GB of memory just to store 100k linear predictors, compared to just 1.5GB for our lower-dimensional features.
该分析表明，R-CNN可以扩展到数千个对象类，而不使用其他技术，如哈希。即使有100k个类，在现代多核CPU上产生的矩阵乘法也只需要10秒。这种效率不仅仅是使用候选区域和共享特性的结果。UVA系统，由于其高维特性，将会慢两个数量级，而仅存储100k线性预测器就需要134GB内存，而我们的低维特性只需要1.5GB。

It is also interesting to contrast R-CNN with the recent work from Dean et al. on scalable detection using DPMs and hashing. They report a mAP of around 16% on VOC 2007 at a run-time of 5 minutes per image when introducing 10k distractor classes. With our approach, 10k detectors can run in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).
将R-CNN与Dean等人最近在使用dpm和哈希进行可伸缩检测方面的工作进行对比也是很有趣的。当引入10k干扰类时，他们报告在每张图像5分钟运行时，VOC 2007的mAP值约为16%。使用我们的方法，10k检测器可以在CPU上运行大约一分钟，由于没有进行近似，mAP将保持在59%。

2.3. 训练
Supervised pre-training. We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data). Pre-training was performed using the open source Caffe CNN library. In brief, our CNN nearly matches the performance of Krizhevsky et al. , obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC2012 classification validation set. This discrepancy is due to simplifications in the training process.
有监督的预训练。我们在一个大型辅助数据集(ILSVRC2012分类)上仅使用图像级标注(该数据无法使用边界框标签)对CNN进行有区别的预训练。赛前训练是使用开源的Caffe CNN库进行的。简而言之，我们的CNN Krizhevsky等人的性能基本匹配，在ILSVRC2012分类验证集上的top-1错误率提高了2.2个百分点。这种差异是由于训练过程的简化造成的。

Domain-specific fine-tuning. To adapt our CNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals. Aside from replacing the CNN’s ImageNet-specific 1000 way classification layer with a randomly initialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background), the CNN architecture is unchanged. For VOC, N = 20 and for ILSVRC2013, N = 200. We treat all region proposals with 3 ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.
特定领域参数调优：为了使我们训练的CNN模型适应检测的任务和新的领域（归一化的VOC窗口），我们继续通过随机梯度下降方法（SGD）只使用在VOC数据集中提取的warp过的候选框训练。通过一个随机初始化的21类（10中物体种类加背景）分类层来代替CNN模型中ImageNet数据集中特定的1000类的分类层，CNN的基本结构不变。我们将所有的IoU大于0.5的候选框标记为正样本，其余为负样本。SGD的学习速率设置为0.001（十分之一初始预训练的速度），这样就确保了在不用设置初始化的情况下使调参（FT）顺利进行。在每一次随机梯度下降的迭代中，我们统一使用32个正样本（包括所有类别）和96个背景样本构建一个最小分支128.我们将偏置设置为正样本窗口，因为正样本窗口相对于负样本窗口过少。

Object category classifiers. Consider training a binary classifier to detect cars. It’s clear that an image region tightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example. Less clear is how to label a region that partially overlaps a car. We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, . . . , 0.5} on a validation set. We found that selecting this threshold carefully is important. Setting it to 0.5, , decreasedmAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points. Positive examples are defined simply to be the ground-truth bounding boxes for each class.
目标种类分类器：以训练一个二分类的检测器检测汽车为例。很明显需要把紧紧包括汽车的图片候选框设置为正样本，相类似的，没有与汽车部位相关的窗口则设置为负样本。不太清晰的是怎么设置部分覆盖汽车的窗口别解决这个问题需要用的IoU阈值，阈值以下的窗口设置为负样本。覆盖的阈值 0.3被从0-0.5的区间选取，我们发现小心的选取这个阈值是很重要的。如果这个阈值设置为0.5，会降低mAP 5个百分点，类似的，设置阈值为0则会降低4个百分点，每一类物体的正样本被简单的定义为ground-truth box。一旦特征被抽取，训练标签被应用，我们对于每一类物体训练一个线性SVM。由于训练数据对于存储空间来说太过巨大，我们使用标准的负难例减少方法。难负例的挖掘可以很快地手链，并且在实践中所有图片中第一次过模型的时候mAP就将停止增长。

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [17, 37]. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images
一旦特征提取和训练标签应用，我们优化每个类一个线性支持向量机。由于训练数据太大，无法放入存储器中，我们采用标准的hard negative mining方法。hard negative mining收敛速度很快，在实际中mAP仅在对所有图像进行一次遍历后就停止增长。

In Appendix B we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. We also discuss the trade-offs involved in training detection SVMs rather than simply using the outputs from the final softmax layer of the fine-tuned CNN.
在附录B中，我们讨论了为什么微调和SVM训练中的正面和负面例子定义不同。我们还讨论了训练检测支持向量机所涉及的权衡，而不是简单地使用经过微调的CNN的最终softmax层的输出。

2.4 PASCAL VOC 2010-12中的结果
Following the PASCAL VOC best practices, we validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 3.2). For final results on the VOC 2010-12 datasets, we fine-tuned the CNN on VOC 2012 train and optimized our detection SVMs on VOC 2012 trainval. We submitted test results to the evaluation server only once for each of the two major algorithm variants (with and without bounding-box regression).
按照PASCAL VOC的最佳实践步骤，我们在VOC 2007数据集(第3.2节)上验证了所有设计决策和超参数。在2010-12数据集上，我们对CNN在VOC 2012列车上进行了微调，对svm在VOC 2012上进行了训练和优化。对于两种主要的算法变体，我们只向评估服务器提交一次测试结果(有bounding-box regression和没有bounding-box regression)。

Table 1 shows complete results on VOC 2010. We compare our method against four strong baselines, including SegDPM, which combines DPM detectors with the output of a semantic segmentation system and uses additional inter-detector context and image-classifier rescoring. The most germane comparison is to the UVA system from Uijlings et al., since our systems use the same region proposal algorithm. To classify regions, their method builds a four-level spatial pyramid and populates it with densely sampled SIFT, Extended OpponentSIFT, and RGB-SIFT descriptors, each vector quantized with 4000-word codebooks. Classification is performed with a histogram intersection kernel SVM. Compared to their multi-feature,
non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster (Section 2.2). Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.
表1展示了（本方法）在VOC2010的结果，我们将自己的方法同四种先进基准方法作对比，其中包括SegDPM，这种方法将DPM检测子与语义分割系统相结合并且使用附加的内核的环境和图片检测器打分。更加恰当的比较是同Uijling的UVA系统比较，因为我们的方法同样基于候选框算法。对于候选区域的分类，他们通过构建一个四层的金字塔，并且将之与SIFT模板结合，SIFT为扩展的OpponentSIFT和RGB-SIFT描述子，每一个向量被量化为4000词的codebook。分类任务由一个交叉核的支持向量机承担，对比这种方法的多特征方法，非线性内核的SVM方法，我们在mAP达到一个更大的提升，从35.1%提升至53.7%，而且速度更快。我们的方法在VOC2011/2012数据达到了相似的检测效果mAP53.3%。

在这里插入图片描述
We ran R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that we used for PASCAL VOC. We followed the same protocol of submitting test results to the ILSVRC2013 evaluation server only twice, once with and once without bounding-box regression.
我们使用与PASCAL VOC相同的系统超参数在200类ILSVRC2013检测数据集上运行R-CNN。我们遵循相同的协议，只将测试结果提交到ILSVRC2013评估服务器两次，一次带 bounding-box 回归，一次不带 bounding-box 回归。

Figure 3 compares R-CNN to the entries in the ILSVRC 2013 competition and to the post-competition OverFeat result. R-CNN achieves a mAP of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat. To give a sense of the AP distribution over classes, box plots are also presented and a table of perclass APs follows at the end of the paper in Table 8. Most of the competing submissions (OverFeat, NEC-MU, UvA-Euvision, Toronto A, and UIUC-IFP) used convolutional neural networks, indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes.
图3将R-CNN与ILSVRC 2013竞赛的参赛作品以及赛后的OverFeat结果进行了比较。R-CNN获得了31.4%的mAP，明显领先于第二好的OverFeat的24.3%。为了说明AP在不同类别上的分布情况，本文还给出了箱形图，并在表8中给出了每个类别AP的表。大多数参赛作品(OverFeat、NEC-MU、UvA-Euvision、Toronto A和UIUC-IFP)使用了卷积神经网络，这表明在如何将cnn应用于目标检测方面存在显著的细微差别，导致结果大相径庭。
在这里插入图片描述
In Section 4, we give an overview of the ILSVRC2013 detection dataset and provide details about choices that we made when running R-CNN on it.
在第4节中，我们给出了ILSVRC2013检测数据集的概述，并提供了我们在其上运行R-CNN时所做选择的详细信息。

3.可视化，融合和模型的错误

3.1学习特征的可视化
First-layer filters can be visualized directly and are easy to understand.They capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [42]. We propose a simple (and complementary) non-parametric method that directly shows what the network learned.
第一层滤波器能够被直接可视化，并且易于理解（参考文献23）。他们捕捉方向边缘和局部颜色。理解接下来的基层更加有挑战性。zeiler和Fergus在文献[42]中展示了一个形象化并且引人注意的解卷积网络的方法。我们提出一种简单且补充性的非参数化方法，这可以直接展示网络学习什么。

The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right. That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform non maximum suppression, and then display the top-scoring regions. Our method lets the selected unit “speak for itself” by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insight into the invariances computed by the unit.
想法是在网络中挑选出特定的单元（一种特征）并且将之作为一种物体探测器。具体的，我们在一个很大的候选框数据集上计算特定特征的激活值，将激活值从高到低进行排序，应用NMS，然后显示出最高分区域。我们的方法使得这个选中的单元自己说话通过精确地展示他对什么样的区域感兴趣，我们避免平均为了看到不同的视觉模式，增加对于通过单元计算出的不变形的洞察力。

We visualize units from layer pool5, which is the maxpooled output of the network’s fifth and final convolutional layer. The pool5 feature map is 6 × 6 ×256 = 9216-dimensional. Ignoring boundary effects, each pool5 unit has a receptive field of 195×195 pixels in the original 227×227 pixel input. A central pool5 unit has a nearly global view, while one near the edge has a smaller, clipped support.
我们从第五pool层可视化单元，这个层经过最大池化的网络第五层，同时也是最后一层卷积层。第五层的特征结构是9216维，忽略边界影响，每一个第五pool单元在原始的227227输入图像上有一个195195像素的区域，中心的pool5单元有一个近似于全局的视角，而那些接近边缘的单元只有更小被剪切的支持。

Each row in Figure 4 displays the top 16 activations for a pool5 unit from a CNN that we fine-tuned on VOC 2007 trainval. Six of the 256 functionally unique units are visualized (Appendix D includes more). These units were selected to show a representative sample of what the network learns. In the second row, we see a unit that fires on dog faces and dot arrays. The unit corresponding to the third row is a red blob detector. There are also detectors for human faces and more abstract patterns such as text and triangular structures with windows. The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.
在图四中每一行展示了我们从使用VOC2007调参过的CNN网络挑选的经过pool5层单元激活的16个最高得分图片。这里可视化了256个单元中的6个（在补充材料中有更多）。这些单元被选中展示网络学习的代表性的模板。在第二行，我们可以看出这个单元对于狗脸和点阵更加敏感。对应第三行的单元对红色的团簇更敏感，还有的单元对人脸和一些抽象的结构敏感，例如文本和窗户的三角形结构，学习网络呈现出学习小规模类别特征的和分散的形状，纹理，颜色和材质。接下来的第六层全连接层则是具备将一系列丰富特征部分模型化的任务。
在这里插入图片描述
3.2消融学习（分解学习）
Performance layer-by-layer, without fine-tuning. To understand which layers are critical for detection performance, we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. Layer pool5 was briefly described in Section 3.1. The final two layers are summarized below.
逐层性能，不调参的情况下。为了理解哪些层对检测性能至关重要，我们分析了VOC 2007数据集上CNN的最后三层的结果。层pool5在3.1节中进行了简要描述。最后两层总结如下。

Layer fc6 is fully connected to pool5. To compute features, it multiplies a 4096×9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases. This intermediate vector is component-wise half-wave rectified (x ← max(0, x)).
与pool5相连的fc6是全连接层。为了计算特征，他将一个4096*9216的权值矩阵与pool5相乘（pool5被剪切为一个9216维的向量）然后增加一个偏置向量。这个中间向量是一种分量方式的半波映射。

Layer fc7 is the final layer of the network. It is implemented by multiplyingthe features computed by fc6 by a 4096 × 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification.
fc7是网络的最后一个层，这个层也被设置一个4096*4096的权值矩阵与fc6计算出来的特征相乘。相似的这个层也加入了一个偏置向量并应用了半波映射。

We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pre-trained on ILSVRC 2012 only. Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6. This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP. More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.
我们通过观察CNN在PASCAL上的结果开始，例如，所有的CNN参数只在ILSVRC2012上面预训练。逐层（表2.第1-3行）的分析性能揭示了经过fc7的特征相对于fc6的特征表现不够好。这就意味着29%或者说大约1680万的CNN网络参数对于提升mAP毫无作用。更加令人吃惊的是移除fc7和fc6层能产生更好的结果，即使pool5层的参数只用使用了CNN网络参数的6%。CNN网络最有代表性的作用产生自他的卷积网络，而不是用更多更密集连接的全连接网络。这个发现意味着就HOG的意义而言在计算密集特征更为有意义。这就意味着能够使基于滑动窗口的探测器成为可能包括DPM，在pool5层特征的基础上。（这两句话大致意思应该是，卷积层在网络中的作用相对于其它层是很大的，而且作为一种特征提取的方法，pool5层输出的特征同样可以选作为滑动窗口方法的素材）
在这里插入图片描述
Performance layer-by-layer, with fine-tuning. We now look at results from our CNN after having fine-tuned its parameters on VOC 2007 trainval. The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 percentage points to 54.2%. The boost from fine-tuning is much larger for fc6 and fc7 than for pool5 , which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.
逐层分析，有调参。我们现在来看我们基于VOC2007数据集调参之后的CNN网络的测试结果。这个提升是惊人的（表24-6行）：调参使得mAP增长8个百分点至54.2%。通过调参得到的提升大于fc6，fc7和fc5的结果，这也意味着pool5学习的特征在检测中表现平平，绝大多数的提升来自于在pool5层之上学习了一个特定类别的非线性分类器。

Comparison to recent feature learning methods. Relatively few feature learning methods have been tried on PASCAL VOC detection. We look at two recent approaches that build on deformable part models. For reference, we also include results for the standard HOG-based DPM.
与近年特征学习方法的比较：相当少的特征学习方法应用与VOC数据集。我们找到的两个最近的方法都是基于固定探测模型。为了参照的需要，我们也将基于基本HOG的DFM方法的结果加入比较

The first DPM feature learning method, DPM ST, augments HOG features with histograms of “sketch token” probabilities. Intuitively, a sketch token is a tight distribution of contours passing through the center of an image patch. Sketch token probabilities are computed at each pixel by a random forest that was trained to classify 35×35 pixel patches into one of 150 sketch tokens or background.
第一个DPM的特征学习方法，DPM ST,将HOG中加入略图表征的概率直方图。直观的，一个略图就是通过图片中心轮廓的狭小分布。略图表征概率通过一个被训练出来的分类35*35像素路径为一个150略图表征的的随机森林方法计算

The second method, DPM HSC, replaces HOG with histograms of sparse codes (HSC). To compute an HSC, sparse code activations are solved for at each pixel using a learned dictionary of 100 7 × 7 pixel (grayscale) atoms. The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit `2 normalized, and then power transformed (x ← sign(x)|x|α).
第二种方法是DPM HSC，用稀疏代码直方图(HSC)代替HOG。为了计算HSC，使用100个7 × 7像素(灰度)原子的学习字典在每个像素处求解稀疏代码激活。产生的激活通过三种方式(全波和半波)校正，空间汇聚，单元’ 2归一化，然后功率转换(x←符号(x)|x|α)。

All R-CNN variants strongly outperform the three DPM baselines (Table 2 rows 8-10), including the two that use feature learning. Compared to the latest version of DPM, which uses only HOG features, our mAP is more than 20 percentage points higher: 54.2% vs. 33.7%—a 61% relative improvement. The combination of HOG and sketch tokens yields 2.5 mAP points over HOG alone, while HSC improves over HOG by 4 mAP points (when compared internally to their private DPM baselines—both use non-public implementations of DPM that underperform the open source version). These methods achieve mAPs of 29.1% and 34.3%, respectively.
所有的RCNN变种算法都要强于这三个DPM方法（表2 8-10行），包括两种特征学习的方法（特征学习不同于普通的HOG方法？）与最新版本的DPM方法比较，我们的mAP要多大约20个百分点，61%的相对提升。略图表征与HOG现结合的方法比单纯HOG的性能高出2.5%，而HSC的方法相对于HOG提升四个百分点（当内在的与他们自己的DPM基准比价，全都是用的非公共DPM执行，这低于开源版本）。这些方法分别达到了29.1%和34.3%。

3.3 网路体系结构
Most results in this paper use the network architecture from Krizhevsky et al. [25]. However, we have found that the choice of architecture has a large effect on R-CNN detection performance. In Table 3 we show results on VOC 2007 test using the 16-layer deep network recently proposed by Simonyan and Zisserman [43]. This network was one of the top performers in the recent ILSVRC 2014 classification challenge. The network has a homogeneous structure consisting of 13 layers of 3 × 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
本文中的大部分结果所采用的架构都来自于Krizhevsky等人的工作[25]。然后我们也发现架构的选择对于R-CNN的检测性能会有很大的影响。表3中我们展示了VOC2007测试时采用了16层的深度网络，由Simonyan和Zisserman[43]刚刚提出来。这个网络在ILSVRC2014分类挑战上是最佳表现。这个网络采用了完全同构的13层3×3卷积核，中间穿插了5个最大池化层，顶部有三个全连接层。我们称这个网络为O-Net表示OxfordNet，将我们的基准网络称为T-Net表示TorontoNet。
在这里插入图片描述
To use O-Net in R-CNN, we downloaded the publicly available pre-trained network weights for the VGG ILSVRC 16 layers model from the Caffe Model Zoo.1 We then fine-tuned the network using the same protocol as we used for T-Net. The only difference was to use smaller minibatches (24 examples) as required in order to fit within GPU memory. The results in Table 3 show that R-CNN with O-Net substantially outperforms R-CNN with T-Net, increasing mAP from 58.5% to 66.0%. However there is a considerable drawback in terms of compute time, with the forward pass of O-Net taking roughly 7 times longer than T-Net.
为了在R-CNN中使用O-Net，我们从Caffe model Zoo.1中下载了VGG ILSVRC 16层模型的公开可用的预训练网络权值，然后使用与T-Net相同的协议对网络进行微调。唯一的区别是，为了适应GPU内存，需要使用更小的小批量(24个示例)。表3的结果显示，带O-Net的R-CNN的性能明显优于带T-Net的R-CNN, mAP从58.5%提高到66.0%。然而，在计算时间方面有一个相当大的缺点，O-Net的前向传递所花的时间大约是T-Net的7倍。

3.4 检测错误分析
We applied the excellent detection analysis tool from Hoiem et al. in order to reveal our method’s error modes, understand how fine-tuning changes them, and to see how our error types compare with DPM. A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [23] to understand some finer details (such as “normalized AP”). Since the analysis is best absorbed in the context of the associated plots, we present the discussion within the captions of Figure 5 and Figure 6.
为了揭示出我们方法的错误之处，我们使用Hoiem提出的优秀的检测分析工具，来理解调参是怎样改变他们，并且观察相对于DPM方法，我们的错误形式。这个分析方法全部的介绍超出了本篇文章的范围，我们建议读者查阅文献23来了解更加详细的介绍（例如归一化AP的介绍），由于这些分析是不太有关联性，所以我们放在图5和图6的题注中讨论。
在这里插入图片描述

3.5. Bounding-box 回归
Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding-box regression employed in DPM [17], we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal. Full details are given in Appendix C. Results in Table 1, Table 2, and Figure 5 show that this simple approach fixes a large number of mislocalized detections, boosting mAP by 3 to 4 points.
基于错误分析，我们使用了一种简单的方法减小定位误差。受到DPM[17]中使用的约束框回归训练启发，我们训练了一个线性回归模型在给定一个选择区域的pool5特征时去预测一个新的检测窗口。详细的细节参考附录C。表1、表2和图5的结果说明这个简单的方法，修复了大量的错位检测，提升了3-4个百分点。
在这里插入图片描述
3.6. 定性结果
Qualitative detection results on ILSVRC2013 are presented in Figure 8 and Figure 9 at the end of the paper. Each image was sampled randomly from the val2 set and all detections from all detectors with a precision greater than 0.5 are shown. Note that these are not curated and give a realistic impression of the detectors in action. More qualitative results are presented in Figure 10 and Figure 11, but these have been curated. We selected each image because it contained interesting, surprising, or amusing results. Here, also, all detections at precision greater than 0.5 are shown.
ILSVRC2013的定性检测结果如图8和图9所示。每幅图像都是从val2集合中随机采样，并显示了所有探测器的所有检测精度大于0.5。请注意，这些都不是经过精心策划的，并给出了实际操作中的探测器的印象。图10和图11中显示了更多的定性结果，但是这些结果已经被整理过了。我们选择每张图片是因为它包含了有趣的、令人惊讶的或有趣的结果。这里也显示了精度大于0.5的所有检测。

4、 ILSVRC2013检测数据集

In Section 2 we presented results on the ILSVRC2013 detection dataset. This dataset is less homogeneous than PASCAL VOC, requiring choices about how to use it. Since these decisions are non-trivial, we cover them in this section.
在第2节中，我们给出了ILSVRC2013检测数据集的结果。该数据集的同构性低于PASCAL VOC，需要选择如何使用它。由于这些决策不是微不足道的，我们将在本节中讨论它们。

1.数据集的概述
The ILSVRC2013 detection dataset is split into three sets: train (395,918), val (20,121), and test (40,152), where the number of images in each set is in parentheses. The val and test splits are drawn from the same image distribution. These images are scene-like and similar in complexity (number of objects, amount of clutter, pose variability, etc.) to PASCAL VOC images. The val and test splits are exhaustively annotated, meaning that in each image all instances from all 200 classes are labeled with bounding boxes. The train set, in contrast, is drawn from the ILSVRC2013 classification image distribution. These images have more variable complexity with a skew towards images of a single centered object. Unlike val and test, the train images (due to their large number) are not exhaustively annotated. In any given train image, instances from the 200 classes may or may not be labeled. In addition to these image sets, each class has an extra set of negative images. Negative images are manually checked to validate that they do not contain any instances of their associated class. The negative image sets were not used in this work. More information on how ILSVRC was collected and annotated can be found in [11, 36].
将ILSVRC2013检测数据集分为train(395,918)、val(20,121)、test(40,152)三个集合，每个集合中的图像数量用括号表示。val和test分割是从相同的图像分布中提取的。这些图像与场景相似，在复杂度(目标数量、杂波数量、姿态变异性等)方面与PASCAL VOC图像相似。val和test分割被详尽地注释，这意味着在每个图像中，来自所有200个类的所有实例都被标记为bounding box。相比之下，训练集是从ILSVRC2013分类图像分布中提取的。这些图像具有更多的可变复杂性，并倾向于单一中心对象的图像。与val和test不同，训练图像(由于它们的数量很大)并没有全部注释。在任何给定的训练映像中，来自200个类的实例可能被标记，也可能不被标记。除了这些图像集之外，每个类还有一组额外的 negative image sets。negative image sets 将被手动检查，以验证它们不包含关联类的任何实例。在本工作中没有使用负象集。关于如何收集和注释ILSVRC的更多信息可以在[11,36]中找到。

The nature of these splits presents a number of choices for training R-CNN. The train images cannot be used for hard negative mining, because annotations are not exhaustive. Where should negative examples come from? Also, the train images have different statistics than val and test. Should the train images be used at all, and if so, to what extent? While we have not thoroughly evaluated a large number of choices, we present what seemed like the most obvious path based on previous experience.
这些分裂的性质为训练R-CNN提供了许多选择。训练图像不能用于hard negative mining，因为注释不是详尽的。消极的例子应该从哪里来?此外，训练图像的统计量与val和test不同。是否应该使用训练图像，如果使用，使用到什么程度?虽然我们没有大量的彻底评估，但我们根据之前的经验给出了看起来最明显的路径。

Our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples. To use val for both training and validation, we split it into roughly equally sized “val1” and “val2 sets. Since some classes have very few examples in val (the smallest has only 31 and half have fewer than 110), it is important to produce an approximately class-balanced partition. To do this, a large number of candidate splits were generated and the one with the smallest maximum relative class imbalance was selected.2 Each candidate split was generated by clustering val images using their class counts as features, followed by a randomized local search that may improve the split balance. The particular split used here has a maximum relative imbalance of about 11% and a median relative imbalance of 4%. The val1/val2 split and code used to produce them will be publicly available to allow other researchers to compare their methods on the val splits used in this report.
我们的一般策略是高度依赖val集合，并使用一些训练图像作为积极例子的辅助来源。为了同时使用val进行训练和验证，我们将其划分为大小大致相同的“val1”和“val2”集。因为有些类在val中只有很少的例子(最小的只有31个，一半的少于110个)，所以生成一个近似于类平衡的分区是很重要的。为此，生成了大量的候选分割，并选择相对类失衡最小的最大分割使用val图像的类计数作为特征聚类生成每个候选分割，然后进行随机局部搜索，以改善分割平衡。这里使用的特殊分割最大相对不平衡约为11%，中值相对不平衡为4%。用于生成它们的val1/val2分割和代码将公开可用，以允许其他研究人员在本报告中使用的val分割上比较他们的方法。

4.2. 候选区域
We followed the same region proposal approach that was used for detection on PASCAL. Selective search [39] was run in “fast mode” on each image in val1, val2, and test (but not on images in train). One minor modification was required to deal with the fact that selective search is not scale invariant and so the number of regions produced depends on the image resolution. ILSVRC image sizes range from very small to a few that are several mega-pixels, and so we resized each image to a fixed width (500 pixels) before running selective search. On val, selective search resulted in an average of 2403 region proposals per image with a 91.6% recall of all ground-truth bounding boxes (at 0.5 IoU threshold). This recall is notably lower than in PASCAL, where it is approximately 98%, indicating significant room for improvement in the region proposal stage.
我们采用了与PASCAL检测相同的候选区域方法。在val1、val2和test中的每个图像上以“快速模式”运行 Selective search[39](但不在train中的图像上运行)。需要做一个小的修改，以处理selective search不是比例不变的这一事实，因此产生的区域数量取决于图像分辨率。ILSVRC图像大小从非常小到几个百万像素，所以我们在运行选择性搜索之前将每个图像的大小调整为固定宽度(500像素)。在val中，selective search的结果是每幅图像平均有2403个候选区域，对所有地面真实边界框的召回率为91.6%(在0.5 IoU阈值)。这一召回率明显低于PASCAL, PASCAL的召回率约为98%，表明在候选区域阶段存在显著的改进空间。

4.3 训练集
For training data, we formed a set of images and boxes that includes all selective search and ground-truth boxes from val1 together with up to N ground-truth boxes per class from train (if a class has fewer than N ground-truth boxes in train, then we take all of them). We’ll call this dataset of images and boxes val1+trainN . In an ablation study, we show mAP on val2 for N ∈ {0, 500, 1000} (Section 4.5).
对于训练数据，我们形成了一组图像和框，其中包括来自val1的所有selective search和ground-truth boxes，以及来自train的每个类最多N个ground-truth boxes(如果一个类在train中有少于N个ground-truth boxes，那么我们将它们全部取走)。我们将这个图像和盒子的数据集称为val1+trainN。在消融研究中，我们展示N∈{0,500,1000}在val2上的mAP(第4.5节)。

Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training, and (3) bounding-box regressor training. CNN fine-tuning was run for 50k SGD iteration on val1+trainN using the exact same settings as were used for PASCAL. Fine-tuning on a single NVIDIA Tesla K20 took 13 hours using Caffe. For SVM training, all ground-truth boxes from val1+trainN were used as positive examples for their respective classes. Hard negative mining was performed on a randomly selected subset of 5000 images from val1. An initial experiment indicated that mining negatives from all of val1, versus a 5000 image subset (roughly half of it), resulted in only a 0.5 percentage point drop in mAP, while cutting SVM training time in half. No negative examples were taken from 2Relative imbalance is measured as |a − b|/(a + b) where a and b are class counts in each half of the split. train because the annotations are not exhaustive. The extra sets of verified negative images were not used. The bounding-box regressors were trained on val1.
R-CNN中有三个步骤需要训练数据:(1)CNN微调，(2)检测器支持向量机训练，(3)bounding-box回归器训练。CNN微调在val1+trainN上运行5万次SGD迭代，使用与PASCAL完全相同的设置。使用Caffe对NVIDIA Tesla K20进行了13个小时的微调。对于SVM训练，所有来自val1+trainN的ground-truth boxes被用作各自类的正例。从val1中随机选取5000张图像进行hard negative mining。最初的实验表明，与5000张图像子集(大约是它的一半)相比，从所有val1中挖掘负值只导致mAP下降了0.5个百分点，同时将支持向量机的训练时间减半。相对不平衡被测量为|a−b|/(a + b)，其中a和b是分裂的每一半的类计数。训练，因为注释不是详尽的。没有使用额外的验证过的negative图像。在val1上训练bounindg-box回归器。

4.4. 验证和评估
Before submitting results to the evaluation server, we validated data usage choices and the effect of fine-tuning and bounding-box regression on the val2 set using the training data described above. All system hyperparameters (e.g., SVM C hyperparameters, padding used in region warping, NMS thresholds, bounding-box regression hyperparameters) were fixed at the same values used for PASCAL. Undoubtedly some of these hyperparameter choices are slightly suboptimal for ILSVRC, however the goal of this work was to produce a preliminary R-CNN result on ILSVRC without extensive dataset tuning. After selecting the best choices on val2, we submitted exactly two result files to the ILSVRC2013 evaluation server. The first submission was without bounding-box regression and the second submission was with bounding-box regression. For these submissions, we expanded the SVM and bounding-box regressor training sets to use val+train1k and val, respectively. We used the CNN that was fine-tuned on val1+train1k to avoid re-running fine-tuning and feature computation.
在将结果提交到评估服务器之前，我们使用上面描述的训练数据验证数据使用选择以及微调和bounding-box回归对val2集的影响。所有系统超参数(例如，支持向量机C超参数、用于区域翘曲的填充、NMS阈值、bounding-box回归超参数)都固定在PASCAL中使用的相同值上。毫无疑问，对于ILSVRC来说，有些超参数的选择略显次优，然而，这项工作的目标是在不进行大量数据集调优的情况下，在ILSVRC上生成一个初步的R-CNN结果。在val2上选择最佳方案后，我们向ILSVRC2013评估服务器提交了两个结果文件。第一次提交没有bounding-box回归，第二次提交有bounding-box框回归。对于这些提交，我们扩展了支持向量机和bounding-box回归器训练集，分别使用val+train1k和val。我们使用了在val1+train1k上进行微调的CNN，以避免重新运行微调和特征计算。

4.5. 消融研究
Table 4 shows an ablation study of the effects of different amounts of training data, fine-tuning, and bounding-box regression. A first observation is that mAP on val2 matches mAP on test very closely. This gives us confidence that mAP on val2 is a good indicator of test set performance. The first result, 20.9%, is what R-CNN achieves using a CNN pre-trained on the ILSVRC2012 classification dataset (no fine-tuning) and given access to the small amount of training data in val1 (recall that half of the classes in val1 have between 15 and 55 examples). Expanding the training set to val1+trainN improves performance to 24.1%, with essentially no difference between N = 500 and N = 1000. Fine-tuning the CNN using examples from just val1 gives a modest improvement to 26.5%, however there is likely significant overfitting due to the small number of positive training examples. Expanding the fine-tuning set to val1+train1k, which adds up to 1000 positive examples per class from the train set, helps significantly, boosting mAP to 29.7%. Bounding-box regression improves results to 31.0%, which is a smaller relative gain that what was observed in PASCAL.
表4显示了对不同数量训练数据、微调和bounding-box回归的影响的消融研究。第一个观察结果是，val2上的mAP与test上的mAP非常匹配。这使我们确信val2上的mAP是测试集性能的良好指标。第一个结果是20.9%，是R-CNN使用在ILSVRC2012分类数据集上预先训练的CNN(没有微调)，并访问val1中的少量训练数据(回想一下，val1中一半的类有15到55个示例)所获得的结果。将训练集扩展到val1+trainN将性能提高到24.1%，N = 500和N = 1000之间基本上没有区别。使用val1的示例对CNN进行微调，得到了适度的改进，达到26.5%，然而，由于积极训练示例的数量很少，可能存在显著的过拟合。将微调集扩展到val1+train1k，即从训练集中每个类别加起来有1000个正例子，这将大大帮助提高mAP到29.7%。定界盒回归将结果提高到31.0%，这是PASCAL中观察到的较小的相对增益。
在这里插入图片描述
4.6. 与OverFeat的关系
There is an interesting relationship between R-CNN and OverFeat: OverFeat can be seen (roughly) as a special case of R-CNN. If one were to replace selective search region proposals with a multi-scale pyramid of regular square regions and change the per-class bounding-box regressors to a single bounding-box regressor, then the systems would be very similar (modulo some potentially significant differences in how they are trained: CNN detection fine-tuning, using SVMs, etc.). It is worth noting that OverFeat has a significant speed advantage over R-CNN: it is about 9x faster, based on a figure of 2 seconds per image quoted from [34]. This speed comes from the fact that OverFeat’s sliding windows (i.e., region proposals) are not warped at the image level and therefore computation can be easily shared between overlapping windows. Sharing is implemented by running the entire network in a convolutional fashion over arbitrary-sized inputs. Speeding up R-CNN should be possible in a variety of ways and remains as future work.
在R-CNN和OverFeat之间有一个有趣的关系:OverFeat可以被(大致)看作是R-CNN的一个特例。如果我们用规则正方形区域的多尺度金字塔代替选择性搜索区域建议，并将每个类的界盒回归器改为单个的界盒回归器，那么系统将非常相似(取训练方法中一些潜在的显著差异的模):CNN检测微调，使用svm等)。值得注意的是，OverFeat比R-CNN有显著的速度优势:它大约快9倍，根据从[34]引用的每幅图像2秒的数字。这种速度来自于OverFeat的滑动窗口(即区域提议)在图像级别上没有扭曲，因此计算可以很容易地在重叠窗口之间共享。共享是通过在任意大小的输入上以卷积方式运行整个网络来实现的。加快R-CNN的速度应该以多种方式实现，这仍是未来的工作。

5.语义分割(略)

6.结论

In recent years, object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low level image features with high-level context from object detectors and scene classifiers. This paper presents a simple and scalable object detection algorithm that gives a 30% relative improvement over the best previous results on PASCAL VOC 2012.
近年来，目标检测性能停滞不前。性能最好的系统是将多个低级图像特征与来自目标探测器和场景分类器的高级上下文相结合的复杂集成系统。本文提出了一种简单的、可扩展的目标检测算法，相对于PASCAL VOC 2012的最佳结果，该算法相对改进了30%。

We achieved this performance through two insights. The first is to apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. The second is a paradigm for training large CNNs when labeled training data is scarce. We show that it is highly effective to pre-train the network—with supervision—for a auxiliary task with abundant data (image classification) and then to fine-tune the network for the target task where data is scarce (detection). We conjecture that the “supervised pre-training/domain-specific fine-tuning” paradigm will be highly effective for a variety of data-scarce vision problems.
我们取得这个性能主要通过两个理解：第一是应用了自底向上的候选框训练的高容量的卷积神经网络进行定位和分割物体。另外一个是使用在标签数据匮乏的情况下训练大规模神经网络的一个方法。我们展示了在有监督的情况下使用丰富的数据集（图片分类）预训练一个网络作为辅助性的工作是很有效的，然后采用稀少数据（检测）去调优定位任务的网络。我们猜测“有监督的预训练+特定领域的调优”这一范式对于数据稀少的视觉问题是很有效的。

We conclude by noting that it is significant that we achieved these results by using a combination of classical tools from computer vision and deep learning (bottom-up region proposals and convolutional neural networks). Rather than opposing lines of scientific inquiry, the two are natural and inevitable partners.
最后，我们指出，我们是通过结合计算机视觉和深度学习(自下而上区域提议和卷积神经网络)的经典工具取得这些结果的，这是非常重要的。这两者是自然的、不可避免的伙伴，而不是科学探索的对立路线。

Acknowledgments. This research was supported in part by DARPA Mind’s Eye and MSEE programs, by NSF awards IIS-0905647, IIS-1134072, and IIS-1212798, MURI N000014-10-1-0933, and by support from Toyota. The GPUs used in this research were generously donated by the NVIDIA Corporation.
致谢这项研究得到了美国国防部高级研究计划局Mind 's Eye和MSEE项目的部分支持，获得了美国国家科学基金会的IIS-0905647、IIS-1134072、IIS-1212798、MURI N000014-10-1-0933以及丰田公司的支持。本研究中使用的图形处理器是由英伟达公司慷慨捐赠的。

附录

A.Object proposal transformations
The convolutional neural network used in this work requires a fixed-size input of 227 × 227 pixels. For detection, we consider object proposals that are arbitrary image rectangles. We evaluated two approaches for transforming object proposals into valid CNN inputs.
本文使用的卷积神经网络需要固定尺寸的227 × 227像素的输入。对于检测，我们考虑对象建议是任意图像矩形。我们评估了两种将对象建议转换为有效CNN输入的方法。

The first method (“tightest square with context”) encloses each object proposal inside the tightest square and then scales (isotropically) the image contained in that square to the CNN input size. Figure 7 column (B) shows this transformation. A variant on this method (“tightest square without context”) excludes the image content that surrounds the original object proposal. Figure 7 column © shows this transformation. The second method (“warp”) anisotropically scales each object proposal to the CNN input size. Figure 7 column (D) shows the warp transformation.
第一种方法(“具有上下文的最紧密的正方形”)将每个对象建议包含在最紧密的正方形中，然后将该正方形中包含的图像按CNN输入大小缩放(各向同性)。图7列(B)显示了这个转换。这个方法的一个变体(“没有上下文的最紧密的正方形”)排除了围绕着原始对象建议的图像内容。图7列©显示了这个转换。第二种方法(“warp”)各向异性地将每个对象建议缩放到CNN输入大小。图7列(D)显示了warp transformation。
在这里插入图片描述
For each of these transformations, we also consider including additional image context around the original object proposal. The amount of context padding § is defined as a border size around the original object proposal in the transformed input coordinate frame. Figure 7 shows p = 0 pixels in the top row of each example and p = 16 pixels in the bottom row. In all methods, if the source rectangle extends beyond the image, the missing data is replaced with the image mean (which is then subtracted before inputing the image into the CNN). A pilot set of experiments showed that warping with context padding (p = 16 pixels) outper-formed the alternatives by a large margin (3-5 mAP points). Obviously more alternatives are possible, including using replication instead of mean padding. Exhaustive evaluation of these alternatives is left as future work.
对于每一个这些转换，我们还考虑在原始对象建议周围包括额外的图像上下文。上下文填充的数量§定义为在转换后的输入坐标框架中围绕原始对象建议的边框大小。图7显示了每个示例的顶部行中的p = 0像素，底部行中的p = 16像素。在所有方法中，如果源矩形扩展到图像之外，则用图像均值替换丢失的数据(然后在将图像输入CNN之前减去图像均值)。一组实验表明，带有上下文填充(p = 16像素)的扭曲比备选方案形成了一个大的边界(3-5 mAP点)。显然还有更多的选择，包括使用复制而不是平均填充。对这些替代方案的详尽评估留给未来的工作。

B. Positive vs. negative examples and softmax
Two design choices warrant further discussion. The first is: Why are positive and negative examples defined differently for fine-tuning the CNN versus training the object detection SVMs? To review the definitions briefly, for fine-tuning we map each object proposal to the ground-truth instance with which it has maximum IoU overlap (if any) and label it as a positive for the matched ground-truth class if the IoU is at least 0.5. All other proposals are labeled “background” (i.e., negative examples for all classes). For training SVMs, in contrast, we take only the ground-truth boxes as positive examples for their respective classes and label proposals with less than 0.3 IoU overlap with all instances of a class as a negative for that class. Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not
ground truth) are ignored.
有两种设计选择值得进一步讨论。第一个问题是:为什么在微调CNN与训练目标检测支持向量机时，正面和负面的例子定义不同?为了简单地回顾定义，为了进行微调，我们将每个对象建议映射到与之有最大IoU重叠(如果有的话)的ground-truth实例，并在IoU至少为0.5的情况下将其标记为匹配的ground-truth类的正数。所有其他建议都被标记为“背景”(即，所有类的反面例子)。相比之下，对于训练支持向量机，我们只将ground-truth boxes作为它们各自类的正例子，并且将与一个类的所有实例的IoU重叠小于0.3的标号建议作为该类的负例子。落入灰色地带的提案(超过0.3个IoU重叠，但没有重叠ground truth)被忽略。

Historically speaking, we arrived at these definitions because we started by training SVMs on features computed by the ImageNet pre-trained CNN, and so fine-tuning was not a consideration at that point in time. In that setup, we found that our particular label definition for training SVMs was optimal within the set of options we evaluated (which included the setting we now use for fine-tuning). When we started using fine-tuning, we initially used the same positive and negative example definition as we were using for SVM training. However, we found that results were much worse than those obtained using our current definition of positives and negatives.
从历史上讲，我们之所以能得到这些定义，是因为我们一开始是通过ImageNet预先训练的CNN计算出的特征来训练支持向量机的，所以当时没有考虑微调。在这个设置中，我们发现训练svm的特定标签定义在我们评估的选项集(其中包括我们现在用于微调的设置)中是最优的。当我们开始使用微调时，我们最初使用的是与SVM训练相同的正面和负面例子定义。然而，我们发现结果比那些使用我们目前的阳性和阴性定义得到的结果要差得多。

Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine tuning data is limited. Our current scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid overfitting. However, we also note that using these jittered examples is likely suboptimal because the network is not being fine-tuned for precise localization.
我们的假设是，这种定义正面和负面的差异并不重要，这是由于微调数据是有限的。我们目前的方案引入了许多“抖动”的例子(那些重叠在0.5和1之间，但不是ground truth的建议)，这将积极的例子的数量扩大了约30倍。我们推测，在对整个网络进行微调以避免过拟合时，需要这个大集合。然而，我们也注意到，使用这些抖动示例可能是次优的，因为网络没有为精确本地化进行微调。

This leads to the second issue: Why, after fine-tuning, train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector. We tried this and found that performance on VOC 2007 dropped from 54.2% to 50.9% mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.
这就引出了第二个问题:为什么在微调之后还要训练支持向量机?简单地应用微调网络的最后一层(21路softmax回归分类器)作为对象检测器会更干净。我们尝试了一下，发现VOC 2007的mAP性能从54.2%下降到了50.9%。这种性能下降可能来自几个因素的组合，包括用于细调的正示例的定义不强调精确定位，以及softmax分类器是在随机抽样的负示例上训练的，而不是在用于SVM训练的“hard negatives”子集上。

This result shows that it’s possible to obtain close to the same level of performance without training SVMs after fine-tuning. We conjecture that with some additional tweaks to fine-tuning the remaining performance gap may be closed. If true, this would simplify and speed up R-CNN training with no loss in detection performance.
这一结果表明，经过微调后，不需要训练支持向量机就可以获得接近相同水平的性能。我们推测，通过一些额外的微调，剩余的性能差距可能会被缩小。如果这是真的，这将简化和加快R-CNN训练，而不损失检测性能。

C. Bounding-box regression
We use a simple bounding-box regression stage to improve localization performance. After scoring each selective search proposal with a class specific detection SVM, we predict a new bounding box for the detection using a class-specific bounding-box regressor. This is similar in spirit to the bounding-box regression used in deformable part models [17]. The primary difference between the two approaches is that here we regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part locations.
我们使用一个简单的bounding-box回归阶段来提高本地化性能。在用特定于类的检测支持向量机对每个选择性搜索提议打分后，我们使用特定于类的bounding-box回归器预测一个新的检测bounding-box。这在本质上类似于在可变形部件模型[17]中使用的bounding-box回归。这两种方法的主要区别在于，这里我们从CNN计算的特征中回归，而不是从推断的DPM部分位置上计算的几何特征。

在这里插入图片描述

单词：
canonical 经典的，准确的，权威的
plateaued 达到稳定时期
propose 建议打算
algorithm算法
insights 深刻见解
high-capacity 大容量
bottom-up 自下而上
scare 缺乏的，不足的
supervised 有监督的
auxiliary 辅助的
histograms 直方图柱状图
extract 提取
rekindle 复燃
vigorously 激烈地
pedestrians 行人
receptive fields 感受野
semantic segmentation 语义分割
outperforms 优于
insufficient 不足
contemporaneous 同时期的
property 财产
non-maximum suppression 非极大值抑制
demonstrate 证明
region 地区范围
category-independent 独立于类别
region proposals 候选区域
forward propagating 前向传播
properties 属性
scale 规模
contrast 对比
scalable 可伸缩的
annotations 注释
single out 挑选出
hyperparameters 超参数
critical 至关重要的
homogeneous 均匀
overview 概述
evaluation 评价
ablation 消融
thresholds 阈值
multi-scale 多尺度
pyramid 金字塔
stagnated 停滞不前
abundant 丰富的
conjecture 猜想
arbitrary 任意的
valid 有效的
warrant 保证