译RCNN（Rich feature hierarchies for accurate object detection and semantic segmentation）

Rich feature hierarchies for accurate object detection and semantic segmentation

丰富的特征阶层，用于准确的对象检测和语义分割

论文网址： https://arxiv.org/abs/1311.2524

如有不恰当的地方，欢迎评论~

Abstract

摘要

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn.

在规范的PASCAL VOC数据集上测量的物体检测性能在过去几年中已经趋于稳定。表现最佳的方法是复杂的集合系统，通常将多个低级图像特征与高级上下文相结合。在本文中，我们提出了一种简单且可扩展的检测算法，相对于之前对VOC 2012的最佳结果，平均精度（mAP）提高了30％以上 - 实现了53.3％的mAP。我们的方法结合了两个关键的见解：（1）可以将高容量卷积神经网络（CNN）应用于自下而上的候选框，以便定位和分割对象;（2）当标记的训练数据稀缺时，监督预训练对于辅助任务，然后进行特定范围的微调，可以显着提高性能。由于我们将候选框与CNN网络结合起来，我们将方法称为R-CNN：具有CNN特征的候选框。我们还将R-CNN与最近提出的基于类似CNN架构的滑动窗口检测器OverFeat进行了比较。我们发现R-CNN在200类ILSVRC2013检测数据集上性能优于OverFeat。有关完整系统的源代码，请访问http://www.cs.berkeley.edu/~rbg/rcnn。

1. Introduction

简介

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [29] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [15], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

特征问题。最近十年，各种视觉识别任务的进展基于SIFT [29]和HOG [7]的使用。但是，如果我们看一下典型的视觉识别任务的表现，即PASCAL VOC物体检测[15]的表现，人们普遍认为2010 - 2012年进展缓慢，通过建立集合系统和采用已成功方法的微小变种来获得了小幅增长。

SIFT and HOG are blockwise orientation histograms,a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.

SIFT和HOG是块状方向梯度直方图，我们大致用 V1(初级视皮层)中的复杂细胞去表示，即灵长类视觉通路中的第一个皮层区域去表示。但是我们也知道识别发生在下游的几个阶段，这表明视觉识别可能是分级、特征计算多阶段进程以及更多的内容。

Fukushima’s “neocognitron” [19], a biologicallyinspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process.The neocognitron, however, lacked a supervised training algorithm. Building on Rumelhart et al. [33], LeCun et al. [26] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.

Fukushima的“neocognitron（感知机）”[19]，即一种用于模式识别的生物学启发的分层和移位不变模型，在这个建模过程是一个早期的尝试。然而，感知机缺乏有监督训练的算法。基于Rumelhart等人 [33]，LeCun等 [26]论文表明对于训练卷积神经网络（CNN）通过反向传播随机梯度下降是有效的，CNN是一类扩展感知机的模型。

未完待续====================================

CNNs saw heavy use in the 1990s (e.g., [27]), but then fell out of fashion with the rise of support vector machines.In 2012, Krizhevsky et al. [25] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x,0) rectifying non-linearities and “dropout” regularization).

CNN在20世纪90年代被大量使用（例如，[27]），但随着支持向量机（SVM）的兴起，它们已经过时了。2012年，Krizhevsky等人。 [25]通过在ImageNet大规模视觉识别挑战赛（ILSVRC）上显示出更高的图像分类准确度，重新燃起了对CNN的兴趣[9,10]。他们的成功来自于在120万张标记图像上训练大型CNN，以及LeCun CNN上的一些曲折（例如，max（x，0）整流非线性和“丢失”正则化）。

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

在ILSVRC 2012研讨会期间，ImageNet结果的重要性得到了激烈的争论。中心问题可以归结为以下几点：ImageNet的CNN分类结果在多大程度上推广到PASCAL VOC挑战的对象检测结果？

We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.

译RCNN（Rich feature hierarchies for accurate object detection and semantic segmentation）

猜你喜欢