论文阅读:Fast R-CNN

1、论文总述

在这里插入图片描述

这篇paper在RCNN的基础上进行改进,最主要的是参考SPPnet提出了ROIpooling,将原图送进特征提取层,而不是将将每个ROI送进特征提取层;还有一个重要的点就是,多任务loss,实现了除SS提proposal外的端到端训练,即将分类loss和box的回归loss合并到一个损失函数中,实验证明效果更好,分类直接上softmax,抛弃了SVM,并发现效果也还好(这点好像与RCNN中的一本正经分析有点矛盾)。

In this paper, we streamline the training process for state of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to
classify object proposals and refine their spatial locations.

The resulting method can train a very deep detection
network (VGG16 [20]) 9× faster than R-CNN [9] and 3×
faster than SPPnet [11]. At runtime, the detection network
processes images in 0.3s (excluding object
proposal time)
while achieving top accuracy on PASCAL VOC 2012 [7]
with a mAP of 66% (vs. 62% for R-CNN).

The Fast RCNN method has several advantages:
1 Higher detection quality (mAP) than R-CNN, SPPnet
2.Training is single-stage, using a multi-task loss
3.Training can update all network layers

2、RCNN和SPPnet的缺点

RCNN的缺点:

R-CNN,however, has notable drawbacks:
1.Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss.Then, it fits SVMs to ConvNet features. These SVMs
act as object detectors, replacing the softmax classi-
fier learnt by fine-tuning. In the third training stage,
bounding-box regressors are learned.
2.Training is expensive in space and time. For SVM
and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as
VGG16, this process takes 2.5 GPU-days for the 5k
images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
3.Object detection is slow. At test-time, features are
extracted from each object proposal in each test image.
Detection with VGG16 takes 47s / image (on a GPU).

SPPnet的缺点:

SPPnet also has notable drawbacks.
Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs,
and finally fitting bounding-box regressors. Features are
also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional
layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the
accuracy of very deep networks (SPP层之前的卷积层的参数不能更新,限制了性能)

3、SPPnet不能更新SPP层之前的参数的原因

The root cause is that back-propagation through the SPP
layer is highly inefficient when each training sample (i.e.
RoI) comes from a different image
, which is exactly how
R-CNN and SPPnet networks are trained. The inefficiency
stems from the fact that each RoI may have a very large
receptive field, often spanning the entire input image. Since
the forward pass must process the entire receptive field, the
training inputs are large (often the entire image).

一个重要的点就是:SPPnet和RCNN在训练的时候,每次每张图像只取一个ROI,然后从不同的图像中各取一个ROI,例如:从64个ROI来自64个不同的图像。

然后重点来了: 本文提出了一个新的训练时的ROI采样策略:

We propose a more efficient training method that takes
advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image.
Critically, RoIs from the same image share computation
and memory in the forward and backward passes. Making
N small decreases mini-batch computation. For example,
when using N = 2 and R = 128, the proposed training
scheme is roughly 64× faster than sampling one RoI from
128 different images (i.e., the R-CNN and SPPnet strategy).
One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue
and we achieve good results with N = 2 and R = 128
using fewer SGD iterations than R-CNN。

4、Multi-task loss

在这里插入图片描述

在这里插入图片描述

这里对 ti和 vi的编码参考了RCNN的,是偏移量的预测,并不是坐标本身。

5、Truncated SVD for faster detection

在这里插入图片描述
用奇异值分解加速全连接层的计算,下图为用了SVD之后的各部分耗时对比。

在这里插入图片描述

6、Which layers to fine-tune?(检测时从哪个层开始finetune)

在这里插入图片描述

由上图可以看到,vgg网络来说,finetune的话从conv3-1开始比较好,前面的层没必要更新权重。

越小的网络需要从越前面的层开始,越重的可以从后面层开始。

7、 Does multi-task training help?

在这里插入图片描述

参考文献

1、Fast R-CNN

2、Fast R-CNN学习总结

发布了71 篇原创文章 · 获赞 56 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/j879159541/article/details/102918477