1、论文总述

在这里插入图片描述

这篇paper在RCNN的基础上进行改进，最主要的是参考SPPnet提出了ROIpooling，将原图送进特征提取层，而不是将将每个ROI送进特征提取层；还有一个重要的点就是，多任务loss，实现了除SS提proposal外的端到端训练，即将分类loss和box的回归loss合并到一个损失函数中，实验证明效果更好，分类直接上softmax，抛弃了SVM，并发现效果也还好（这点好像与RCNN中的一本正经分析有点矛盾）。

In this paper, we streamline the training process for state of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to
classify object proposals and refine their spatial locations.
The resulting method can train a very deep detection
network (VGG16 [20]) 9× faster than R-CNN [9] and 3×
faster than SPPnet [11]. At runtime, the detection network
processes images in 0.3s (excluding object
proposal time)
while achieving top accuracy on PASCAL VOC 2012 [7]
with a mAP of 66% (vs. 62% for R-CNN).

The Fast RCNN method has several advantages:
1 Higher detection quality (mAP) than R-CNN, SPPnet
2.Training is single-stage, using a multi-task loss
3.Training can update all network layers

2、RCNN和SPPnet的缺点

RCNN的缺点：

R-CNN,however, has notable drawbacks:
1.Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss.Then, it fits SVMs to ConvNet features. These SVMs
act as object detectors, replacing the softmax classi-
fier learnt by fine-tuning. In the third training stage,
bounding-box regressors are learned.
2.Training is expensive in space and time. For SVM
and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as
VGG16, this process takes 2.5 GPU-days for the 5k
images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
3.Object detection is slow. At test-time, features are
extracted from each object proposal in each test image.
Detection with VGG16 takes 47s / image (on a GPU).

SPPnet的缺点：

SPPnet also has notable drawbacks.
Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs,
and finally fitting bounding-box regressors. Features are
also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional
layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the
accuracy of very deep networks （SPP层之前的卷积层的参数不能更新，限制了性能）

3、SPPnet不能更新SPP层之前的参数的原因

The root cause is that back-propagation through the SPP
layer is highly inefficient when each training sample (i.e.
RoI) comes from a different image, which is exactly how
R-CNN and SPPnet networks are trained. The inefficiency
stems from the fact that each RoI may have a very large
receptive field, often spanning the entire input image. Since
the forward pass must process the entire receptive field, the
training inputs are large (often the entire image).

一个重要的点就是：SPPnet和RCNN在训练的时候，每次每张图像只取一个ROI，然后从不同的图像中各取一个ROI，例如：从64个ROI来自64个不同的图像。

然后重点来了： 本文提出了一个新的训练时的ROI采样策略：

We propose a more efficient training method that takes
advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image.
Critically, RoIs from the same image share computation
and memory in the forward and backward passes. Making
N small decreases mini-batch computation. For example,
when using N = 2 and R = 128, the proposed training
scheme is roughly 64× faster than sampling one RoI from
128 different images (i.e., the R-CNN and SPPnet strategy).
One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue
and we achieve good results with N = 2 and R = 128
using fewer SGD iterations than R-CNN。