刚刚才开始研读R-CNN系列的论文，如果理解有偏差，还请多多指教！

Fast R-CNN

Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9 faster than R-CNN, is 213 faster at test-time, and achieves a higher mAP on PASCAL VOC
2012.Compared to SPPnet, Fast R-CNN trains VGG16 3 faster, tests 10 faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

Fast R-CNN是一类基于区域特征的卷积网络算法，建立在前人研究比较深入的网络结构的基础之上。实现了速度和精度的提高。

1. Introduction

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches
(e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

物体检测难度高于分类，目前采用的multistage pipelines （多级流水线）的方式既满而且精度不高。

Complexity arises because detection requires the ac- curate localization of objects, creating two primary chal- lenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these can- didates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

现阶段目标定位的问题在于：1.proposals的数目过大；2.这些proposals一般只有粗略标记；

1.1 R-CNN and SPPnet

The Region-based Convolutional Network method (R- CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

传统R-CNN的缺点是：

Training is a multi-stage pipeline. R-CNN first fine- tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classi- fier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.

R-CNN首先用log损失函数微调proposals，然后再将卷积神经网络学得的特征送入SVM用于目标检测，替代之前通过微调学习的softmax分类器。最后第三阶段才是学习检测bounding-box的回归模型。所以就有很多冗余步骤和不必要的计算。

Training is expensive in space and time. For SVM and bounding-box regressor training, features are ex- tracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features re- quire hundreds of gigabytes of storage.

在时间和空间上的训练开销很大。

Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

目标检测速度很慢。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were pro- posed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object pro- posal using a feature vector extracted from the shared fea- ture map. Features are extracted for a proposal by max- pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pool- ing [15]. SPPnet accelerates R-CNN by 10 to 100 at test time. Training time is also reduced by 3 due to faster pro- posal feature extraction.

R-CNN很慢是因为它为每一个proposals都计算一遍卷积神经网络的正向传递，所以proposals重复的部分就有可能被计算好多次导致计算的极大浪费。而SPPnet会计算整个图像的卷积特征图，然后将计算结果在整个网络内共享，后续的处理就可以建立在共享的计算结果的基础上进行计算。通过max poling将proposals中的特征图转化为固定大小的输出从而进入后面的计算。

SPPnet also has notable drawbacks. Like R-CNN, train- ing is a multi-stage pipeline that involves extracting fea- tures, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning al- gorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurpris- ingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

当然了，SSPnet也有缺点——虽然改进了共享的问题节省了一部分计算，但是在特征提取，用log函数微调网络，训练SVM和bounding-box的回归检测中并没有改进。而且由于微调算法不能更新到spatial pyramid poling中的卷积层中，所以使得SPPnet的精度无法很高。

1.2. Contributions

We propose a new training algorithm that fixes the disad- vantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN be- cause it’s comparatively fast to train and test. The Fast R- CNN method has several advantages:

Higher detection quality (mAP) than R-CNN, SPPnet

Training is single-stage, using a multi-task loss

Training can update all network layers

No disk storage is required for feature caching
Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT Li- cense at https://github.com/rbgirshick/fast-rcnn.

对于上面的问题，Fast R-CNN提出了解决方案，有以下几个优点：
1.目标检测精度更高；
2.训练模型由multistage变成了single-stage，使用multi-task loss训练；
3.训练结果共享，可以用于更新所有的层；
4.不需要磁盘缓存提取的特征。

2. Fast R-CNN architecture and training

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each ob- ject proposal a region of interest (RoI) pooling layer ex- tracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output lay- ers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

输入是一整张图像和若干个proposals。然后通过若干卷积和最大池化之后得到卷积特征图。每个午第的ROI都会被池化层处理，输出一个固定大小的特征向量。这些特征向量会通过两个全连接层导出两个输出：一个是softmax概率判断所属类别；一个是bounding-box的回归检测偏移。

2.1. The RoI pooling layer

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small fea- ture map with a fixed spatial extent of H W (e.g., 7 7), where H and W are layer hyper-parameters that are inde- pendent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

RoI池化层使用最大池化将任何有效的RoI内的特征转换成具有H×W（例如，7×77×7）的固定空间范围的小特征图（上文提到过的固定输出），其中H和W是层的超参数，独立于任何特定的RoI。在本文中，RoI是卷积特征图中的一个矩形窗口。每个RoI由指定其左上角(r,c)及其高度和宽度(h,w)的四元组(r,c,h,w)定义。

就像上文说的，一次计算整张图的卷积特征，然后将ROI部分再通过ROI pooling layer提取具体的特征（固定大小），提取的特征向量送到两个全连接层进行物体归类和位置回归的学习。

RoI max pooling works by dividing the h w RoI win- dow into an H W grid of sub-windows of approximate size h/H w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pool- ing is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI最大池化通过将大小为h×w的RoI窗口分割成H×W个网格，子窗口大小约为h/H×w/W，然后对每个子窗口执行最大池化，并将输出合并到相应的输出网格单元中。同标准的最大池化一样，池化操作独立应用于每个特征图通道。RoI层只是SPPnets中使用的空间金字塔池层的特殊情况，其只有一个金字塔层。

2.2. Initializing from pre-trained networks

We experiment with three pre-trained ImageNet [4] net- works, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network de- tails). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

作者预训练了三个ImageNet网络，每一个都有五个最大池化层和五到十三个卷积层。当用预训练初始化Fast R-CNN网络的时候，要经历三个变换。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

首先，最后一个最大池化层要被RoI池化层代替，而池化层的输出大小要与全连接层的设计挂钩，即H,W固定相等。

Second, the network’s last fully connected layer and soft- max (which were trained for 1000-way ImageNet classifi- cation) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 cat- egories and category-specific bounding-box regressors).

第二，神经网络的最后一个全连接层和softmax层（原被设计用来训练1000个分类）要被替换为上述的两个全连接层——分类器和位置回归检测。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

最后，网络把输入修改为输入图像数据集和对应的RoI集。

2.3. Fine-tuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

用反向传播训练所有网络权重是Fast R-CNN的重要能力。先说一下为什么无法更新：

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是因为训练数据集来自不同的图像导致每个RoI都可能会有一个非常大的感受野，通常可能覆盖整个输入图片。因为前向传播必须处理整个感受野，倒是训练的输入规模很大（甚至达到整幅图像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R- CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N im- ages and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computationsmoothL1 (x) =0.5x2 if |x| < 1|x| − 0.5 otherwise,
(3)and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64 faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

改进基于充分利用提取特征进行共享。在Fast R-CNN网络训练中，随机梯度下降（SGD）的小批量数据由分层采样获得，首先采样N个图像，然后从每个图像采样R/N个RoI。重要的是，来自同意图像的RoI在前向和后向传播中共享计算和内存。减小N就等于减小了每个小批次的计算。例如当N=2和R=128时，比起原来从128幅图像各取一个的方式快了大约64倍。

One concern over this strategy is it may cause slow train- ing convergence because RoIs from the same image are cor- related. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

理论上因为来自同一图片，所以相似度比较高，收敛会比较慢，但是实际上并没有出现这种情况。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box re- gressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The compo- nents of this procedure (the loss, mini-batch sampling strat- egy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分层采样，Fast R-CNN还使用了精细训练，在微调阶段同时优化softmax分类器和bounding-box的回归，改进了原来分为三个阶段独立训练softmax分类器、SVM和回归模型的做法。下面将详细描述这一过程（包括损失，小批量采样策略，通过RoI池化层的反向传播和SGD超参数）。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distri- bution (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K +1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, tk = tk, tk, tk , tk , for is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.
The hyper-parameter λ in Eq. 1 controls the balance be- tween the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use λ = 1.
We note that [6] uses a related loss to train a class- agnostic object proposal network. Different from our ap- proach, [6] advocates for a two-network system that sepa- rates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).
Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uni- formly at random (as is common practice, we actually iter- ate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-each of the K object classes, indexed by k. We use the pa- rameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:
in which Lcls(p, u) = -log pu is log loss for true class u.

这里说的是最后两个输出层。一个负责分类，一个负责位置回归。第一个输出在K+1个类别上的离散概率分布（对于每个RoI），p=(p0,…,pK)。通常，通过全连接层的K+1个输出上的Softmax来计算p。第二个输出层输出检测框回归偏移，对于第k个分类，计算tk=(tx,ty,tw,th)。我们使用

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014

中给出的tk的参数。其中tk明确了一个尺度不变的转化方式和一个对于物体proposals在log-space保持高宽比偏移的方法。
对于每一个RoI，都给定了实际分类u和bounding-box目标位置作为标签。我们用一个多任务损失函数L同时训练这两个任务。公式如图；

The second task loss, Lloc, is defined over a tuple of true bounding-box regression targets for class u, v = (vx; vy; vw; vh), and a predicted tuple tu = (tux ; tuy ; tuw; tuh ),
again for class u. The Iverson bracket indicator function [u 1] evaluates to 1 when u 1 and 0 otherwise. By convention the catch-all background class is labeled u = 0. For background RoIs there is no notion of a ground-truthbounding box and hence Lloc is ignored. For bounding-box regression, we use the loss

is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity. The hyper-parameter in Eq. 1 controls the balance between
the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use = 1. We note that [6] uses a related loss to train a classagnostic
object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN
[9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

对于类u，第二个损失Lloc是定义在true bounding-box回归目标真值目标元组u,v=(vx,vy,vw,vh)和预测元组tu=(tux,tuy,tuw,tuh)上的损失。 Iverson括号代表函数[u≥1]当u≥1的时候为值1，否则为0。按照惯例，背景类标记为u=0。对于背景RoI，没有检测框真值的概念，因此Lloc被忽略。对于检测框回归，我们使用损失Lloc(tu,v)=∑i∈{x,y,w,h} smoothL1(tui−vi)(2)其中：smoothL1(x)={0.5x2|x|−0.5 if |x|<1 otherwise |x|−0.5；这表征了L1损失对于异常值的敏感度不如在R-CNN和SPPnet中使用的L2损失。当回归目标无界时，具有L2损失的训练可能需要仔细调整学习速率，以防止爆炸梯度。公式(3)消除了这种灵敏度。

公式(1)中的超参数λ控制两个任务损失之间的平衡。我们将回归目标真值vi归一化为具有零均值和单位方差。所有实验都使用λ=1。

我们注意到有论文使用相关损失来训练一个类别无关的目标候选网络。与我们的方法不同的是另外一篇论文提出构建一个分离定位和分类的双网络系统。OverFeat，R-CNN和SPPnet也分开训练分类器和bounding-box定位器，但是这些方法使用逐级训练，这对于Fast RCNN来说不是最好的选择。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uni- formly at random (as is common practice, we actually iter- ate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-each of the K object classes, indexed by k. We use the pa-rameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:L(p, u, tu, v) = Lcls(p, u) + λ[u ≥ 1]Lloc(tu, v), (1) in which Lcls(p, u) = log pu is log loss for true class u.The second task loss, Lloc, is defined over a tuple of
true bounding-box regression targets for class u, v = (vx, vy, vw, vh), and a predicted tuple tu = (tu, tu, tu, tu), the examples labeled with a foreground object class, i.e. u 1. The remaining RoIs are sampled from object pro- posals that have a maximum IoU with ground truth in the in- terval [0.1, 0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

小批量采样。在微调期间，每个SGD的小批量训练样本由N=2个图像构成，均匀地随机选择（正如通常的做法，我们实际上迭代数据集的排列）。我们使用大小为R=128的小批量训练数据，从每个图像采样64个RoI。如在一篇论文中中，我们从候选框中获取25％的RoI，这些候选框与检测框ground-truth的IoU至少为0.5。这些RoI只包括用前景对象类标记的样本，即u≥1。剩余的RoI从候选框中采样，该候选框与检测框真值的最大IoU在区间[0.1,0.5)。这些背景样本，并用u=0标记。0.1的阈值下限似乎充当难负样本重训练的启发式算法。在训练期间，图像以0.5概率水平翻转。不使用其他数据增强算法。

Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling
layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward
because the forward pass treats all images independently. Let xi 2 R be the i-th activation input into the RoI pooling
layer and let yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes yrj = xi(r;j), in which i(r; j) = argmaxi02R(r;j) xi0 . R(r; j) is the index set of inputs in the sub-window over which the output unit yrj max pools. A single xi may be assigned to several different
outputs yrj . The RoI pooling layer’s backwards function computes
partial derivative of the loss function with respect to each input variable xi by following the argmax switches:

通过RoI池化层的反向传播。反向传播通过RoI池化层。为了看起来更清晰，我们假设每个小批量训练样本(N=1)只有一个图像，想扩展到N>1也很简单，因为前向传播算法独立地处理所有图像。

令xi∈ℝ是到RoI池化层的第i个激活输入，并且令yrj是来自第r个RoI层的第j个输出。RoI池化层计算yrj=xi∗(r,j)，其中i∗(r,j)=argmaxi′∈R(r,j)xi′。R(r,j)是输出单元yrj最大池化的子窗口中的输入的索引集合。单个xi可以被分配给几个不同的输出yrj。

RoI池化层反向传播函数通过遵循argmax switches来计算关于每个输入变量xi的损失函数的偏导数：∂L∂xi=∑r∑j[i=i∗(r,j)]∂L∂yrj(4）
换句话说，对于每个小批量训练样本RoI r和对于每个池化输出单元yrj，如果i是yrj通过最大池化选择的argmax，则将这个偏导数∂L/∂yrj积累下来。在反向传播中，偏导数∂L/∂yrj已经由RoI池化层顶部的层的反向传播函数计算。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0:01 and 0:001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0:001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0:0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0:0005 (on weights and biases) are used.

SGD超参数。用于Softmax分类和检测框回归的全连接层的权重分别使用具有方差0.01和0.001的零均值高斯分布初始化。偏置项biases初始化为0。所有层的权重学习率为1倍的全局学习率，偏置项比biases为2倍的全局学习率，全局学习率为0.001。当对VOC07或VOC12 trainval训练时，我们运行SGD进行30k次小批量迭代，然后将学习率降低到0.0001，再训练10k次迭代。当我们训练更大的数据集，我们运行SGD更多的迭代，如下文所述。使用0.9的动量和0.0005的参数衰减（权重和偏置）。

2.4. Scale invariance

We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.
The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

我们探索两种实现尺度不变对象检测的方法：（1）通过“brute force”学习和（2）通过使用图像金字塔。在“brute force”方法中，在训练和测试期间以预定义的像素大小处理每个图像。网络必须直接从训练数据学习尺度不变性目标检测。

多尺度方法通过图像金字塔向网络提供近似尺度不变性。在测试时，图像金字塔用于大致缩放-归一化每个proposal。在多尺度训练期间，我们在每次图像采样时随机采样金字塔尺度，作为数据增强的形式。由于GPU内存限制，我们只对较小的网络进行多尺度训练。

3. Fast R-CNN detection

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. Attest-time, R is typically around 2000, although we will consider cases in which it is larger ( 45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area .
For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k | r) = pk. We then
perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN.

一旦Fast R-CNN网络被微调完毕，检测相当于运行前向传播（假设proposal是预先计算的）。网络将图像（或图像金字塔，编码为图像列表）和待计算概率的R个候选框的列表作为输入。在测试的时候，R通常在2000左右，虽然我们将考虑将它变大（约45k）的情况。当使用图像金字塔时，每个RoI被缩放，使其最接近5中的224^2个像素。

对于每个测试的RoI r，正向传播输出的是分类的后验概率分布p和相对于r的预测的检测框偏移（K个类别中的每一个都将获得其相对于自己的更细致的检测框预测）。我们使用估计的概率Pr(class=k|r)≜pk为每个对象类别k计算检测置信度r。然后，我们使用R-CNN算法和对每个类别独立执行非最大抑制。

3.1. Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].
In this technique, a layer parameterized by the u x v weight matrix W is approximately factorized as

using SVD. In this factorization, U is a u x t matrix comprising the first t left-singular vectors of W, Σt is a t x t diagonal matrix containing the top t singular values of W, and V is v x t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u; v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix ΣtV^T (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

对于图像整体的分类，与卷积层相比，计算全连接层花费的时间较少。相反，为了检测物体，要处理的RoI的数量很大，并且接近一半的正向传递时间用于计算全连接层（参见图2）。大的全连接层容易通过用截断SVD压缩来加速。

在这种技术中，层的u×v权重矩阵W通过SVD被近似分解为：W≈UΣtV^T(5)；在这种分解中，U是一个u×t的矩阵，包括W的前t个左奇异向量，Σt是t×t对角矩阵，其包含W的前t个奇异值，并且V是v×t矩阵，包括W的前t个右奇异向量。截断SVD将参数计数从uv减少到t(u+v)个，如果t远小于min(u,v)，则SVD会显著降低计算量。为了压缩网络，对应于W的单个全连接层由两个全连接层替代，在它们之间没有非线性。这些层中的第一层使用权重矩阵ΣtVT（没有偏置），并且第二层使用U（其中原始偏置项biases与W相关联）。当RoI的数量大时，这种简单的压缩方法可以显著加速。

4. Main results

Three main results support this paper’s contributions:

State-of-the-art mAP on VOC07, 2010, and 2012

Fast training and testing compared to R-CNN, SPPnet

Fine-tuning conv layers in VGG16 improves mAP

支撑本文的成果有三个：
1.VOC07，2010和2012的最高的mAP。
2.相比R-CNN，SPPnet，快速训练和测试。
3.在VGG16中微调卷积层改善了mAP。

5.Design evaluation

文章还通过实验数据说明了各种因素对训练的影响，其中包括：
5.1. Does multitask training help?
5.2. Scale invariance: to brute force or finesse?
5.3. Do we need more training data?
5.4. Do SVMs outperform softmax?
5.5. Are more proposals always better?
5.6. Preliminary MS COCO results

Fast R-CNN 论文详读