1、论文总述

在这里插入图片描述

首先感谢两位大神的文章，写的特别清楚，一篇知乎的，一篇CSDN的，链接为：

所以我也不写那么详细了，只是记录一下看论文时的一些标记。

网络测试时的总的思路就是：先用Selective Search方法选出2000个目标可能存在于的区域，然后分别送进CNN网络提取特征，每个ROI都得到一个4096维的特征向量，接着把这些向量送进20个SVM分类器（二分类分类器），即可得到每个ROI的类别，然后每个类别NMS之后选出的那些高概率的ROI再送进位置精修，得到一个好的定位效果。

注：这个开山之作可能是第一次引入了迁移学习的思想，由于分类数据集比检测数据集大很多，所以网络训练时，是先将特征提取网络在分类数据集ImageNet ILSVC 2012训练好，然后在VOC检测数据集上进行微调，微调时的学习率一般都要小十倍（为了记忆原来学到的特征）。

还有一个上两篇博客没有说清楚的点就是：SVM训练和测试时用来分类的正样本的4096维特征向量的来源不同：训练时是通过每个GT经过特征提取网络得到的，测试时通过Selective Search方法得到的2000个ROI经过特征提取网络得到的，因为SVM训练需要的样本相对于CNN较少，这样SVM得到的分类精度更高一些，而测试时候就没有标注信息了，所以这时SVM就就直接用的SS方法产生的框，对他们进行分类，然后极大值进行抑制，再送进位置精修。

Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [21],which has been successful for both object detection [39] and
semantic segmentation [5].
At test time, our method generates around 2000 category-independent region proposals for
the input image, extracts a fixed-length feature vector from
each proposal using a CNN, and then classifies each region
with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s
shape.
Figure 1 presents an overview of our method and
highlights some of our results. Since our system combines
region proposals with CNNs, we dub the method R-CNN:
Regions with CNN features.

2、2012年AlexNet的一个争议

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following:
To what extent do the CNN classification results on ImageNet generalize to
object detection results on the PASCAL VOC Challenge?
We answer this question by bridging the gap between
image classification and object detection. This paper is the
first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared
to systems based on simpler HOG-like features. To achieve
this result, we focused on two problems: localizing objects
with a deep network and training a high-capacity model
with only a small quantity of annotated detection data.

争议：AlexNet的分类结果能多大程度上泛化到检测结果上？？

作者在这篇论文里通过迁移学习，实现了训练数据少，但网络模型容量大。

注：关于网络模型容量的思考：

最近实验室每周在开论文分享交流会，听了几次后，对CNN的容量有了个新的认识，举例来说：目标检测网络中，通过anchor的引入，相当于引入了一个目标所有可能存在区域的空间；Siamese跟踪网络中通过score map的引入相当于引入了一个当前帧目标可能存在的所有的位置；双目视差网络中通过一个 cost volum空间的引入，相当于引入了一个两张视图所以视差可能性的空间，总的来说就是，CNN通过大数据和自己模型参数容量大的优势对不同的训练目标产生一个对应的目标的解的搜索空间，然后通过后续的网络结构和合理的损失函数去找到最优解或者次优解。

3、为啥没用sliding windows

Unlike image classification, detection requires localizing (likely many) objects within an image. One approach
frames localization as a regression problem. However, work
from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they
report a mAP of 30.5% on VOC 2007 compared to the
58.5% achieved by our method). An alternative is to build a
sliding-window detector. CNNs have been used in this way
for at least two decades, typically on constrained object categories, such as faces [32, 40] and pedestrians [35]. In order
to maintain high spatial resolution, these CNNs typically
only have two convolutional and pooling layers. We also
considered adopting a sliding-window approach.
However, units high up in our network, which has five convolutional
layers, have very large receptive fields (195 × 195 pixels)
and strides (32×32 pixels) in the input image, which makes
precise localization within the sliding-window paradigm an
open technical challenge。（5个卷积层步长32 感受野大导致sliding-window方式的定位困难？？）

4、Run-time analysis.（测试时候的速度分析，作者觉得自己的方法比以前算法效率要高）

Two properties make detection efficient.
First, all CNN parameters are shared across all categories.
Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word
encodings. The features used in the UVA detection system
[39], for example, are two orders of magnitude larger than
ours (360k vs. 4k-dimensional).
The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The
only class-specific computations are dot products between
features and SVM weights and non-maximum suppression.
In practice, all dot products for an image are batched into
a single matrix-matrix product. The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N,where N is the number of classes

作者的意思是：虽然每个ROI没有共享CNN特征提取网络，但是所有的类别都共享了这些特征，然后在SVM时才分类。

5、为什么不用softmax直接分类，而是加了SVM

Our hypothesis is that this difference in how positives
and negatives are defined is not fundamentally important
and arises from the fact that fine-tuning data is limited.
Our current scheme introduces many “jittered” examples
(those proposals with overlap between 0.5 and 1, but not
ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large
set is needed when fine-tuning the entire network to avoid
overfitting. However, we also note that using these jittered
examples is likely suboptimal because the network is not
being fine-tuned for precise localization.

This leads to the second issue: Why, after fine-tuning,
train SVMs at all? It would be cleaner to simply apply the
last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector.
We tried this and found that performance on VOC 2007 dropped
from 54.2% to 50.9% mAP. This performance drop likely
arises from a combination of several factors including that
the definition of positive examples used in fine-tuning does
not emphasize precise localization and the softmax classi-
fier was trained on randomly sampled negative examples
rather than on the subset of “hard negatives” used for SVM
training.

主要原因是：CNN训练和SVM训练时对正负样本的定义不一样，CNN定义的比较松，而SVM定义的比较严格，所以如果直接用softmax的话，定位精度下降，导致map下降，还有hard negatives” used for SVM。

6、Ablation studies

在这里插入图片描述
可以看到，微调过后，fc6 fc7才能发挥其作用，不微调的话，cnn就像hog一样的特征向量

We start by looking at results from the CNN without
fine-tuning on PASCAL, i.e. all CNN parameters were
pre-trained on ILSVRC 2012 only. Analyzing performance
layer-by-layer (Table 2 rows 1-3) reveals that features from
fc7 generalize worse than features from fc6. This means
that 29%, or about 16.8 million, of the CNN’s parameters
can be removed without degrading mAP. More surprising is
that removing both fc7 and fc6 produces quite good results
even though pool5
features are computed using only 6% of
the CNN’s parameters. Much of the CNN’s representational
power comes from its convolutional layers, rather than from
the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only
the convolutional layers of the CNN. This representation
would enable experimentation with sliding-window detectors, including DPM, on top of pool5
features.

7、 Relationship to OverFeat

There is an interesting relationship between R-CNN and
OverFeat: OverFeat can be seen (roughly) as a special case
of R-CNN. If one were to replace selective search region
proposals with a multi-scale pyramid of regular square regions and change the per-class bounding-box regressors to a single bounding-box regressor, then the systems would
be very similar ( modulo some potentially significant differences in how they are trained: CNN detection fine-tuning,using SVMs, etc.). It is worth noting that OverFeat has
a significant speed advantage over R-CNN: it is about 9x
faster, based on a figure of 2 seconds per image quoted from
[34]. This speed comes from the fact that OverFeat’s sliding windows (i.e., region proposals) are not warped at the
image level and therefore computation can be easily shared
between overlapping windows. Sharing is implemented by
running the entire network in a convolutional fashion over
arbitrary-sized inputs. Speeding up R-CNN should be possible in a variety of ways and remains as future work