Rich feature hierarchies for accurate object detection and semantic segmentation
RCNN object detection
Model structure
整体模型包含三个自模型:
* first generates category-independent region proposals, these proposals define the set of candidate detections available to our detector.
* second is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* third is a set of class specific linear SVMs
Region proposals
selective search
Feature extraction
extract a 4096-dim feature vector from region proposal using Caffe
为了使region的大小为227*227,we warp all pixels in a tight bounding box around it to the required size
Test-time detection
In the test phase, 2000 regions and warp regions were extracted using selective search and input into CNN to extract features. Then, for each category, SVM was used to score, and for all scored regions in a picture, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union overlap with a higher scoring selected region large than a learned threshold.
Training
Supervised pre-training
pre-training
Domain-specific fine-tuning
To apply the CNN in the model to new tasks (detection) and new domains (warped proposal windows), SGD is used, followed by training. N+1 categories, 1 refers to the background.
We treat all region proposals with >= 0.5 IoU overlap with a ground-truth box as posistives and the rest as negatives.
In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.
We bias the sampling towards positive windows because they are extremely rare compared to background.
Object category classifiers
When a picture tightly contains an object, it is obviously a positive class, and when a little object does not contain it, it is obviously not contained. How to judge when there is only an intersection with the object?
We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, …, 0.5} on a validation set.
Since the training data is too large to fit in memory, we adopt the standard hard negative mining method
Visualization, ablation and modes of error
Visualizing learned features
We propose a simple (and complementary) non-parametric method that directly shows what the network learned.
We compute the unit’s activations on a large set of held-out region proposals, perform non-maximum suppression, and then display the top-scoring regions.
Bounding-box regression
After scoring each proposal using SVM, a new bounding-box regressor is predicted.
We regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part location.