[Deep learning] target detection network structure RCNN

The algorithm is divided into 4 steps:

1. Generate 1k～2k candidate regions

Use the selective search method to generate about 2000-3000 candidate regions from an image:

(1) Use an over-segmentation method to divide the image into small areas

(2) Check the existing small area, merge the two areas with the highest possibility, and repeat until the entire image is merged into one area.

Prioritize merging the following four areas: those with similar colors (color histograms), those with similar textures (gradient histograms), and the total area after merging is small (to ensure that the merging operation is more uniform, and to avoid one large area from "eating" other small areas one after another ), the total area occupies a large proportion in its bbox after merging (to ensure that the shape is regular after merging).

The above four rules only involve the color histogram, texture histogram, area, and position of the region. The combined region features can be directly calculated from the subregion features, which is faster.

(3) Output all areas that have ever existed, the so-called candidate area

In order not to miss the candidate area as much as possible, the above is done in multiple color spaces at the same time (RGB, HSV, Lab, etc.). In a color space, use different combinations of the above four rules to merge. All results of all color spaces and all rules are output as candidate regions after removing duplication.

The candidate region generation and subsequent steps are relatively independent, and can actually be performed using any algorithm.

2. For each candidate region, use a deep network to extract features

Preprocessing: Before using the deep network to extract features, first normalize the candidate area to the same size 227*227. Here are some details that can be changed: the size of the expansion, whether to maintain the original ratio when deformed, and whether to directly intercept the area outside the frame or fill the gray. Will slightly affect performance.

Network pre-training: borrowing from hinton's 2012 classification network on image net, slightly simplified. 4096-dimensional features are extracted, and then sent to a fully connected layer of 4096->1000 for classification, with a learning rate of 0.01. [Training data: use all data of ilvcr 2012 for training, input a picture, and output a 1000-dimensional category label]

Network tuning training: The same applies to the above network, and the last layer is replaced with a 4096->21 fully connected network. The learning rate is 0.001, and each batch contains 32 positive samples (belonging to) and 96 backgrounds (negative samples) (designed in advance) [Training data: use PASCAL VOC 2007 training set, input a picture, and output 21-dimensional The category label indicates 20 categories + background. Consider the one with the largest overlap area between a candidate frame and all the calibration frames (gt) on the current image. If the overlap ratio is greater than 0.5, the candidate frame is considered to be the calibrated category; otherwise, the candidate frame is considered to be the background. 】

3. The features are sent to the svm classifier of each category to determine whether they belong to that category

For each type of target, a linear svm two-class classifier is used to discriminate. The input is the 4096-dimensional feature output by the deep network, whether the output belongs to this category. Because there are a lot of negative samples, use hard negative mining (hard negative is to send those stubborn and tricky mistakes every time, and then send them back to continue practicing, until your grades no longer improve. This process is called'hard negative mining') method .

Positive sample: the true value calibration box of this type (there are many boxes selected by gt+).
Negative sample: Investigate each candidate box. If the overlap with all calibration boxes in this category is less than 0.3, it is considered as a negative sample.

Use the nms algorithm to select the final few boxes. If the two boxes overlap a lot, select the one with high confidence.

4. Use the regression to fine-tune the position of the candidate frame

This step does not change the number of final frames, but only changes the size and shape.

The target detection problem is measured by the overlap area: many seemingly accurate detection results are often because the candidate frame is not accurate enough, and the overlap area is small. Therefore, a position refinement step is required.
Regressor: For each type of target, use a linear ridge regressor (in linear regression problems, if the penalty term of the L2 norm is added to the regression model, it is ridge regression; if the penalty term of the L1 norm is added, then For the lasso regression, refer to: https://blog.csdn.net/aoulun/article/details/78688572 ) for refinement (can't understand the refinement process, a regression algorithm?) . The regular term λ=10000λ=10000. The input is the 4096-dimensional feature of the deep network pool5 layer, and the output is the scaling and translation in the xy direction.
Training sample: Determine the candidate frame of this class, and the candidate frame with the true value overlap area greater than 0.6 (Use the framed part for training?) .

The training process and the testing process are completely separated. The cnn network and svm classifier are all trained based on the train set. When testing, input the picture and output the box.

This article reference: https://blog.csdn.net/shenxiaolu1984/article/details/51066975

Recommended reference (more detailed): https://www.cnblogs.com/zyber/p/6672144.html

[Deep learning] target detection network structure RCNN

Guess you like