Target detection originator algorithm - RCNN study notes

1. Background and introduction

Target Detection

Before learning the target detection algorithm, let's first understand what target detection is.
For me, my entry project for learning neural networks is handwritten digit recognition (I believe that for most people, entry should start from this project). The data set for the recognition of digits and the data used for the final test are both an image with only one digit, and then this image is used as input and sent to the model for prediction. The same is true for cat and dog recognition (see Teacher Wu Enda's course for details). Our data set either only contains dogs or only cats, and the final prediction of the model often only gives a result, whether it is a cat or a dog. Such classification tasks are often referred to as object recognition tasks.
insert image description here
(Input a cat picture as shown in the figure below, and the result of "cat" will be given after the prediction of the model) The task of target detection is to locate the position of the target in the picture while identifying the object to be classified. (As shown below)
insert image description here

The birth of RCNN

The full name of RCNN is Regions with CNN Features , which is a pioneering work that applies deep learning to the field of object detection proposed by Ross Girshick in the paper " Rich feature hierarchies for accurate object detection and semantic segmentation ".

RCNN increased the detection rate from 35.1% to 53.7% on the PASCAL VOC2012 data set, making CNN the norm in the field of target detection, and it also made everyone start to explore the great potential of CNN in other computer vision fields. At the same time, it must be mentioned that this algorithm is a two-stage target detection algorithm.

The source code given by the author: RCNN

2. Detailed explanation of RCNN algorithm

2.1 First understanding of RCNN network structure

insert image description here
The picture is from the RCNN paper proposed by the author Ross Girshick . From the figure, we can clearly see that the input image will first go through the method of extracting the region proposals (Region Proposals), obtain 2000 such regions from the original image, and then send the region proposal containing the target to the CNN network Feature extraction is performed, and then the obtained features are sent to a classifier for target classification.

Seeing this, I believe that everyone will have a lot of questions. I was the same at the beginning. Regarding the generation method of these candidate frames and how to select the appropriate candidate frame containing the target and send it to the CNN network for training, and the final classification method, etc. Waiting, there will be great doubts, and then we will gradually uncover the mystery of RCNN, the target detection algorithm.

2.2 Training process of RCNN algorithm

Process: input image --> generation of candidate frames --> resize each candidate frame into a size of 227*227 (because the selected feature extraction network is AlexNet) --> send it into the feature extraction network --> Train each class of SVM classifier to classify the output features of CNN --> regression of bbox

2.2.1 Generation of Region proposals (candidate boxes)

There are many ways to generate Region proposals, such as:
objectness
selective search
category-independen object proposals
constrained parametric min-cuts
multi-scale combinatorial grouping
Ciresa
Among them, the method used in RCNN is the selective search method. The advantage of the selective search method is that it is fast and has a high recall rate, so it is the most commonly used method for region generation. The selective search algorithm is mainly divided into the following steps:

Step1: Generate the region set R (~2k) of region proposals.
Step2: Calculate the similarity S={s1,s2,…} of each adjacent region in the region set R.
Step3: Find the two areas with the highest similarity, merge them into a new set, and add it to R.
step4: Remove all subsets related to step3 from S.
step5: Calculate the similarity between the new set and all subsets.
step6: Loop operation until S is empty, then jump out of the loop.

Among them, the method of calculating the similarity between regions in step2 includes calculating the color distance, texture distance, etc. between regions, and after integrating various distances, weighting is obtained to obtain the final similarity between regions. The specific implementation can be learned in this blog. The following picture shows the demonstration effect:
insert image description here

2.2.2 Selection and pre-training of CNN convolutional network

For the CNN feature extraction network, RCNN chooses the classic AlexNet network (or VGG16). You can use a large open source data set to train the network model, such as ImageNet ILSVC2012, and then get a 1000-category pre-training model.
After saving such model weights, we can fine-tune the network weights through transfer learning to adapt to the classification tasks we need, that is, to obtain the generalization ability of the network by retraining the fully connected layer, that is, fit-tune application of technology.
Since we choose the AlexNet network here, the output layer of the final network will send each input Warped region (that is, select the candidate region map for feature extraction, here we need to resize the map of each candidate frame we send in to the size of 227*227) with4096Dimensional eigenvectors are output in the form of2000In terms of candidate boxes, it will generate2000 * 4096dimension output.
It should be noted that the use of ImageNet to pre-train the network structure layer is for the ability of the network to obtain the recognized category, not the ability of bbox regression.
In addition, ImageNet contains 1000 categories, and when RCNN performs migration learning on the VOC dataset, it needs to classify 21 categories (20 categories plus background category, a total of 21 categories).
( Picture from )

RCNN calculates the IOU between the area of ​​the candidate frame and the labeled box in Ground-truth. If IOU>0.5, it means that there are more overlaps between the candidate and the labeled box, which can be regarded as a positive example (positive), and vice versa is considered a counterexample.
The training strategy is to use SGD (stochastic gradient descent method), the learning rate is 0.001, and the mini-batch is 128.

2.2.3 Extract and save the eigenvector of RP

After the network training in Section 2.2.2, we can call the saved model weights, input the candidate regions, and then the seventh layer of AlexNet will output a 4096-dimensional feature matrix, and 2000 candidate regions will output 2000 A feature matrix of 4096 dimensions. These feature matrices will be sent to the next step to help us identify objects and classify them.

2.2.4 SVM classifies the content of RP

From Section 2.2.3, we can get a feature matrix of 2000*4096 (2000 means that the number of RP boxes is 2000, which is not necessarily 2000 in reality, and 4096 represents the feature vector of each RP box). The author used SVM for classification. (The reason for choosing SVM for classification is that SVM supports small sample training, and overfitting will not occur when the samples are not balanced).
Take the VOC dataset as an example. In addition to 20 categories of target classifications, there is also a background category in the VOC dataset, so there are 21 categories in total. Therefore, we need to train 21 SVM classifiers, each of which contains 4096 weights. , the feature vector assigned to each RP:
insert image description here
In the classification process of SVM, we regard the PR area with IOU (between the PR box and the label box) lower than the threshold of 0.3 as Negative, and the PR with the IOU higher than the threshold of 0.7 Regions are considered positive, and the rest are all discarded. SVM will give a prediction label, then compare this label with the true label, calculate the loss, and backpropagate to adjust the weight of each category.
The author uses a hard-negative mining training method, which solves the problem of improving the classification accuracy of the SVM classifier. In other words, at the beginning of classification, negative samples often have little intersection with positive samples, or in other words are complete negative samples (maybe none of these negative samples can contain the target or partially contained), so after After such a training, our SVM classifier is likely to misjudgment many close to positive samples but not positive samples as positive samples. At this time, these misjudgments are added to the patch, sorted according to the probability value and then retrained. classifier, and then repeat the above process to reach a certain termination condition, that is, the accuracy of the classifier reaches a certain threshold, and the training is stopped. In fact, in simple terms, the hard-negative mining method is to help our classifier find those samples that are easy to misclassify and add them to the negative samples for repeated training to enhance the ability of our classifier.

2.2.5 Regression of predicted boxes

Prediction Frame Regression and nms Column
This column is written very clearly, listing the translation and scaling formulas of the prediction frame, as well as the loss during training.

Guess you like

Origin blog.csdn.net/ycx_ccc/article/details/128100583