RCNN paper notes

Rich feature hierarchies for accurate object detection and semantic segmentation

RCNN object detection

Model structure

整体模型包含三个自模型:
* first generates category-independent region proposals, these proposals define the set of candidate detections available to our detector.
* second is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* third is a set of class specific linear SVMs

Region proposals

selective search

Feature extraction

extract a 4096-dim feature vector from region proposal using Caffe
为了使region的大小为227*227,we warp all pixels in a tight bounding box around it to the required size

Test-time detection

In the test phase, 2000 regions and warp regions were extracted using selective search and input into CNN to extract features. Then, for each category, SVM was used to score, and for all scored regions in a picture, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union overlap with a higher scoring selected region large than a learned threshold.

Training

Supervised pre-training

pre-training

Domain-specific fine-tuning

To apply the CNN in the model to new tasks (detection) and new domains (warped proposal windows), SGD is used, followed by training. N+1 categories, 1 refers to the background.

We treat all region proposals with >= 0.5 IoU overlap with a ground-truth box as posistives and the rest as negatives.

In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.

We bias the sampling towards positive windows because they are extremely rare compared to background.

Object category classifiers

When a picture tightly contains an object, it is obviously a positive class, and when a little object does not contain it, it is obviously not contained. How to judge when there is only an intersection with the object?
We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, …, 0.5} on a validation set.

Since the training data is too large to fit in memory, we adopt the standard hard negative mining method

Visualization, ablation and modes of error

Visualizing learned features

We propose a simple (and complementary) non-parametric method that directly shows what the network learned.

We compute the unit’s activations on a large set of held-out region proposals, perform non-maximum suppression, and then display the top-scoring regions.

Bounding-box regression

After scoring each proposal using SVM, a new bounding-box regressor is predicted.
We regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part location.

write picture description here

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326721627&siteId=291194637