[target detection] detailed explanation of RCNN algorithm

Reprinted from: https://blog.csdn.net/shenxiaolu1984/article/details/51066975#fnref:5

Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

Region CNN (RCNN) can be said to be the pioneer work of target detection using deep learning . The author Ross Girshick has won many times in the target detection competition of PASCAL VOC. In 2010, he led the team to win the Lifetime Achievement Award. Now he works for FAIR, which is owned by Facebook. 
The idea of ​​this article is concise, and after a multi-year plateau period of the DPM method, the effect has improved significantly. A series of target detection algorithms including this article: RCNNFast RCNNFaster RCNN represent the current frontier level of target detection, and the source code based on Caffe is given on github .

Thought

This paper addresses two key issues in object detection.

Problem 1: Speed

Classical object detection algorithms use a sliding window method to determine all possible regions in turn. In this paper, a series of candidate regions that are more likely to be objects are extracted in advance , and then only features are extracted on these candidate regions for judgment.

Question 2: Training Set

Classical object detection algorithms extract artificially set features (Haar, HOG) in regions. This paper needs to train a deep network for feature extraction . Two databases are available: 
a larger recognition library (ImageNet ILSVC 2012): which classifies objects in each image. 10 million images, 1000 categories. 
A smaller detection library (PASCAL VOC 2007): Labels the class and location of objects in each image. Ten thousand images, 20 categories. 
This paper uses the recognition library for pre-training, and then uses the detection library to tune the parameters. Finally, it is evaluated on the detection library.

process

The RCNN algorithm is divided into 4 steps 
- One image generates 1K~2K candidate regions 
- For each candidate region, use the deep network to extract features 
- The features are sent to the SVM  classifier of each class to determine whether they belong to this class 
- Use regression The device finely corrects the position of the candidate frame 
write picture description here

Candidate region generation

The Selective Search 1 method was used to generate about 2000-3000 proposals from one image. The basic idea is as follows: 
- Use an over-segmentation method to divide the image into small areas 
- Look at the existing small areas and merge the two areas with the highest probability . Repeat until the entire image is merged into one region position 
- output all regions that have ever existed, so-called candidate regions

The candidate region generation and subsequent steps are relatively independent, and can actually be performed using any algorithm.

Merge rules

The following four areas are merged first: 
- similar in color (color histogram) 
- similar in texture (gradient histogram) 
- small in total area 
after merging - after merging, the total area occupies a large proportion in its BBOX

The third is to ensure that the scale of the merge operation is relatively uniform, and avoid a large area "eaten" other small areas one after another.

Example: There is an area abcdefgh. A good way to merge is: ab-cd-ef-gh -> abcd-efgh -> abcdefgh. 
The bad merge method is: ab-cdefgh ->abcd-efgh ->abcdef-gh -> abcdefgh.

Article 4: Guarantee the shape rules after the merger.

Example: The image on the left is suitable for merging, and the image on the right is not suitable for merging. 
write picture description here

The above four rules only concern the color histogram, texture histogram, area and position of the region. The merged regional features can be directly calculated from the sub-region features, which is faster.

Diversification and Postprocessing

In order not to miss candidate regions as much as possible, the above operations are performed simultaneously in multiple color spaces (RGB, HSV, Lab, etc.). In a color space, use different combinations of the above four rules for merging. All results of all color spaces and all rules, after deduplication, are output as candidate regions.

The author provides the source code of Selective Search , which contains many .p files and .mex files, so it is difficult to examine the specific implementation.

Feature extraction

preprocessing

Before using the deep network to extract features, the candidate regions are first normalized to the same size of 227×227. 
Here are some details that can be changed: the size of the external expansion, whether to maintain the original proportion when deforming, and whether to directly intercept the area outside the frame or fill in ash. Will slightly affect performance.

pre-training

The network structure is 
basically borrowed from Hinton's 2012 classification network on ImageNet 2 , slightly simplified 3
write picture description here 
The features extracted by this network are 4096-dimensional and then fed into a 4096->1000 fully connected (fc) layer for classification. 
The learning rate is 0.01.

The training data 
uses all the data of ILVCR 2012 for training, input a picture, and output a 1000-dimensional category label.

Tuning training

The network structure 
also uses the above network, and the last layer is replaced with a fully connected network of 4096->21. 
The learning rate is 0.001, and each batch contains 32 positive samples (belonging to 20 classes) and 96 backgrounds.

The training data 
uses the training set of PASCAL VOC 2007, input a picture, and output a 21-dimensional category label, representing 20 categories + background. 
Consider the one with the largest overlap area between a candidate frame and all the calibration frames on the current image. If the overlap ratio is greater than 0.5, the candidate box is considered to be the class of the calibration; otherwise, the candidate box is considered to be the background.

Category judgment

The classifier 
uses a linear SVM two-class classifier to discriminate each type of target. The input is the 4096-dimensional feature output by the deep network, whether the output belongs to this category. 
Since there are many negative samples, the hard negative mining method is used. The ground-truth box for the 
positive sample class.  The negative sample examines each candidate frame. If the overlap with all the calibration frames of this class is less than 0.3, it is considered as a negative sample.

 

location refinement

The measure of the object detection problem is the overlapping area: many detection results that seem to be accurate, often because the candidate frame is not accurate enough, the overlapping area is very small. Therefore, a position refinement step is required. 
The regressor is 
refined using a linear ridge regressor for each class of targets. Regular term λ = 10000λ=10000
The input is the 4096-dimensional feature of the pool5 layer of the deep network, and the output is the scaling and translation in the xy direction. 
The training samples 
are determined as candidate boxes of this class, and the candidate boxes whose overlap area with the ground truth is greater than 0.6.

result

In 2014, when the paper was published, DPM has entered a bottleneck period, and the improvement even with complex features and structures is very limited. This paper introduces deep learning into the field of detection, and increases the detection rate on PASCAL VOC from 35.1% to 53.7% in one fell swoop . 
The first two steps in this paper (candidate region extraction + feature extraction) are independent of the category to be detected and can be shared between different categories. These two steps take about 13 seconds on the GPU. 
When detecting multiple classes at the same time, only the last two steps (discrimination + refinement) need to be multiplied, which are simple linear operations and fast. These two steps only take 10 seconds for the 100K category.

Based on this paper, the subsequent fast RCNN 4 (see this blog ) and faster RCNN 5 (see this blog ) have made rapid progress in speed, basically solving the target detection problem on PASCAL VOC.


  1. J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013  .
  2. A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012 
  3. All layers are serial. The relu layer is an in-place operation and is drawn to the left. 
  4. Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE International Conference on Computer Vision. 2015. 
  5. Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal networks.” Advances in Neural Information Processing Systems. 2015. 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324377333&siteId=291194637