Object detection 之 R-CNN

Rich feature hierarchies for accurate object detection and semantic segmentation

用于目标检测和语义分割的多特征层次

正式开始研究目标检测算法，从R-CNN开始，将‘一刀流’和‘二刀流’两种目标检测框架都了解研究一下。
目的是：

将目标检测算法中的先进技术和算法应用到人脸检测算法中，尝试着提出一个比MTCNN更好更快的人脸检测框架;
将语义信息这个概念应用到人脸识别中；

论文解读

1 摘要

原文	译文
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years.The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context.	基于标准PASCAL VOC数据集的目标检测算法在最近几十年里发展趋于平缓. 最好的方法是一个复杂系统的糅合，简单的将低层次的图片特征和高级的语义特征相结合。
In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%.	本文提出一种简单可扩展的检测算法，将mAP值提高了超过30%，达到了53.3%
Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features.	本文的方法主要包括2个方面：（1）采用高饱和的CNN网络自上而下的提出建议区域进行定位和分割；（2）由于具有标签的数据集稀少，利用预训练监督算法来辅助虚训练，能达到不错的加速效果。由于将建议的区域和CNN结合，所以本文提出的方法命名为：R-CNN。

2 R-CNN模型

R-CNN模型包括3个部分：

generate category-independent region proposals 第一步生成和物体种类无关的候选框
extract features 第二步提取每一个候选框中的特征
linear SVMs 第三步进行分类，输出每个候选框中的结果

Region proposals找出图片中所有存在目标的候选区域

本文中用的方法是selective research选择性搜索，根据的是论文《Selective Search for object recognition》IJCV 2012。没有采用滑窗法，selective research提供~2000个候选框，相比较滑窗法大大减小了第二步提取特征的计算量，并且没有丢失目标信息。

Feature extraction提取图片特征

R-CNN中使用AlexNet提取每张图片的4096维特征，然后这些特征在第三步中用一组SVM进行分类识别。
需要注意的是，在第一步selective research产生的候选框长度都是不固定的，但是AlexNet的输入是固定，所以需要进行一定的变换。作者通过实验发现，warping with context padding(p=16 pixels)带语义填充的直接缩放效果最好。

SVM

SVM没什么好说的，对想检测的每一类都训练一个svm，每张图片有2000个候选框，每个候选框能提取4096的特征，所以每个SVM的输入是4096。所以输入的特征是2000x4096，分类器是4096xN，N为类别数。

3 模型训练

Supervised pre-training监督预训练

利用imagenet数据集预训练CNN模型。因为缺少boudingbox labels，所以只能进行分类训练，最终模型的分类错误率2.2%（top-1）。

Domain-specific fine-tuning特定任务的模型调整

这部分结合原文细细理解一下

原文	译文
To adapt our CNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals.	为了使我们的CNN模型更好的应对新领域里新问题（而不是imagenet里的分类任务），仅仅使用wraped的候选区域对模型继续进行训练。
Aside from replacing the CNN’s ImageNet-specific 1000-way classification layer with a randomly ini-tialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background), the CNN architecture is unchanged.	除了将imagenet的1000个softmax分类器数目改成了N+1，N表示待检测目标的类别数，1表示背景，CNN网络的其他结构没有改变。
For VOC, N = 20 and for ILSVRC2013, N = 200.	对于VOC数据集，共有20类；对于ILSVRC2013数据集，共有200类
We treat all region proposals with≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.	对于和正式值IoU≥ 0.5的看作正类，< 0.5看作负类
We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization.	fine-tuning的学习率从0.001开始，这样既不会破坏pre-trained的结果，又能进行微调改进。
In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.	每一个mini-batch训练中，选去32个正样本和96个背景框，一个mini-batch=128个样本。这是由于相对于背景来说，含有目标的正样本很少。

Object category classifiers目标分类器

举个例子，我们想训练一个二分类来识别car，显然如果一个图片被car充斥 ‘enclosing’，那么这张图算postive；如果一个图片里没有car，那么这种图算negtive。
但是如何界定一张包括部分car的图片呢？
作者提出用IoU的阈值来确定，给一个阈值t，低于t的认为是negtive，高于t的认为是positve。作者通过大量的实验发现，最好的阈值是0.3。
一旦region proposals经过CNN提取到特征，针对每类训练一个SVM分类器。由于训练数据太大，不能再内存中进行训练，所以作者采取了标准的hard negative mining method, Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.

目标检测系列(1):R-CNN