Rich feature hierarchies for accurate object detection and semantic segmentation

Introduction

This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.

detection requires localizing (likely many) objects within an image:

One approach frames localization as a regression problem.
An alternative is to build a sliding window detector.

we solve the CNN localization problem by operating within the “recognition using regions” paradigm

At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs.

The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning

Object detectin with R-CNN

Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class-specific linear SVMs.

Module design

Region proposalswe use selective search to enable a controlled comparison with prior detection work

Feature extraction

Test-time detection

Run-time analysis

Training

Supervised pre-training

Domain-specific fine-tuning

Aside from replacing the CNN’s ImageNet- specific 1000-way classification layer with a randomly ini- tialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background)

We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.

In each SGD iteration, we uni- formly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.

Object category classifiers
The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, . . . , 0.5} on a validation set. We found that selecting this threshold care- fully is important.

Results on PASCAL VOC 2010-12

Results on ILSVRC2013 detection

Visualization,ablation,and modes of error

Visualizing learned features

The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right. That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform non- maximum suppression, and then display the top-scoring re- gions.

Ablation studies

Performance layer-by-layer, without fine-tuning.
Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.

Performance layer-by-layer, with fine-tuning.
The boost from fine-tuning is much larger for fc 6 and fc 7 than for pool 5 , which suggests that the pool 5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

Comparison to recent feature learning methods.

Network architectures

Detection error analysis

Bounding-box regression

we train a linear regression model to predict a new detection window given the pool 5 features for a selective search region pro- posal

Qualitative results

The ILSVRC2013 detection dataset

Dataset overview

Region proposals

Selective search was run in “fast mode” on each image

Training data

Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training, and (3) bounding-box regressor training.

Validation and evaluation

Ablation study

Relationship to OverFeat

OverFeat can be seen (roughly) as a special case of R-CNN. If one were to replace selective search region proposals with a multi-scale pyramid of regular square re- gions and change the per-class bounding-box regressors to a single bounding-box regressor

Semantic segmentation

CNN features for segmentation.

Results on VOC 2011.

Conclusion

We achieved this performance through two insights.

The first is to apply high-capacity convolutional neural net- works to bottom-up region proposals in order to localize and segment objects.
The second is a paradigm for train- ing large CNNs when labeled training data is scarce.

We show that it is highly effective to pre-train the network— with supervision—for a auxiliary task with abundant data (image classification) and then to fine-tune the network for the target task where data is scarce (detection).

Appendix

A. Object proposal transformations

tightest square with context;
tight- est square without context;
warp.

B. Positive vs. negative examples and softmax

Why are positive and negative examples defined differ- ently for fine-tuning the CNN versus training the object de- tection SVMs?

Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine-tuning data is limited.

Why, after fine-tuning, train SVMs at all?

the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classi- fier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.

C. Bounding-box regression

$\hat G_x = P_w d_x(P) +P_x$
$\hat G_y = P_h d_y(P) +P_y$
$\hat G_w = P_w \exp(d_w(P))$
$\hat G_h = P_h \exp(d_h(P))$

$\text w_\star = \underset{\hat \text w_\star}{\arg\min} \sum_i^N(t_\star^i - \hat \text w_\star^T \phi_5(P^i))^2 + \lambda||\hat \text w_\star||^2$

$t_x = (G_x - P_x)/P_w$
$t_y = (G_y - P_y)/P_h$
$t_w = \log(G_w/P_w)$
$t_h = \log(G_h/P_h)$