Fast R-CNN阅读笔记

论文提出的背景

Recently, deep ConvNets have significantly improved image classification and object detection accuracy.
深度卷积神经网络的使用大幅的提高了图像分类的精度，但是对于检测问题，仍然有两大挑战：
- First, numerous candidate object locations (often called “proposals”) must be processed.
需要生成大量的候选框
- Second, these candidates provide only rough localization that must be refined to achieve precise localization.
候选框的定位只是粗略的，需要通过微调实现精确定位

本文工作

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
提出了一种单阶段的训练方法来实现候选对象的分类和定位调优

与其他模型对比：

速度
training：9x R-CNN 3x SPP
test: 213x R-CNN 10x SPP
性能（mAP）：
R-CNN 62%
Fast R-CNN 66%

R-CNN模型的缺点

Training is a multi-stage pipeline. R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
多阶段训练过程。首先对卷积网络在真实框上进行微调，随后将候选框输入网络，将提取的特征向量输入到SVM中进行分类，最后进行边框回归。
Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk.
训练耗时且需要大量内存。为了进行SVM分类和边框回归，需要将特征向量存储到硬盘中。
Object detection is slow. At test-time, features are extracted from each object proposal in each test image.
速度慢。对于每个候选框都要输入一遍网络提取特征（重复计算）。

SPP模型

优点
Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output.
每个候选框的特征是直接从特征图上通过maxpooling到固定大小
缺点
Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors.
多阶段

Fast RCNN模型结构

first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts fixed-length feature vector from the feature map.
Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:
one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes.
首先将整张图片输入几层卷积层提取特征，随后对于每个候选框，通过一个ROI pooling层得到固定大小的特征向量，将这些特征向量输入两个FC分支，一条进行分类，一条进行回归。

ROI Pooling

RoI max pooling works by dividing the h * w RoI window into an H * W grid of sub-windows of approximate size h/H * w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel
ROI Pooling将代表候选区域的大小为（h，w）的特征图分成(H，W)个格子，每个格子的大小近似为（ h/H ， w/W）。对于特征图的每个通道Pooling操作时独立的。

采样方法

minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image.
采用层次化的采样方法，首先采样N张图片，随后采样R/N个ROI。