The most complete explanation in the history of R-CNN

1: Getting to know R-CNN first

The R-CNN series of papers (R-CNN, fast-RCNN, faster-RCNN) is the originator of object detection using deep learning, among which fast-RCNN and faster-RCNN follow the idea of ​​R-CNN.

The full name of R-CNN is region with CNN features. In fact, its name is a good explanation.Use CNN to extract the medium Region Proposals, featuesand then perform SVM classification and bbox regression

[Network structure]

insert image description here

Next, I will explain the core ideas in R-CNN from the training phase and the testing phase.

Two: training steps

1. Determination of RP

First introduce the algorithm, which is used to search out 2000 images Selective Searchfrom the input image during the training process . The main steps of the Selective Search algorithm:Region Proposal

  • Use an over-segmentation method to divide the image into small regions (1k~2k)
  • Calculate the similarity between all neighboring regions, including color, texture, scale, etc.
  • Merge areas with high similarity together
  • Calculate the similarity between the merged area and the adjacent area
  • Repeat steps 3 and 4 until the entire picture becomes a region

In each iteration, larger regions are formed and added to the list of region proposals. This bottom-up approach can create Region Proposals of different scales from small to large, as shown in the figure:
insert image description here

2. Model pre-training

During the actual test, the model needs to extract the features in the RP through CNN for subsequent classification and regression. Therefore, how to train CNN has become the top priority.

Due to the small amount of object label training data, it is not enough to train a good CNN model from scratch if you want to directly use the method of randomly initializing CNN parameters. Based on this, supervised pre-training is used to train AlexNet using a large dataset ( ImageNet ILSVC 2012 ), and a 1000-category pre-trained (Pre-trained) model is obtained.
insert image description here

3.Fine-Tunning

Because when the R-CNN model is actually tested, it uses CNN to search for 2000 Region Proposalextracted features on each input image in the VOC test set. The RP sizes are different, and AlexNet requires the input image size to be 227×227, so it is necessary to resize the RP to transform them into 227×227. before deforming,We first add padding of 16 around the candidate frame, and then perform anisotropic scaling. This deformation increases mAp by 3 to 5 percent

While the original CNN model extracts features for ImageNet datasets without deformation , it now extracts features for VOC detection datasets and deformed images . So, in order to adapt our CNN to new tasks (i.e. detection tasks) and new domains (deformed recommendation windows),It is necessary to tune the parameters of CNN in a specific field, that is, fine-tunning. Region ProposalIt is fine-tuned by searching from each VOC training image .

(Remarks: There is another reason. If you do not perform fine-tuning for specific tasks, but use CNN as a feature extractor, the features learned by the convolutional layer are actually the basic shared feature extraction layer, similar to the SIFT algorithm. , can be used to extract the features of various pictures, and the features learned by f6 and f7 are features for specific tasks. For example: for face gender recognition, the convolutional layer in front of a CNN model is The learned features are similar to learning the common features of faces, and then the features learned by the fully connected layer are the features for gender classification )

First, the PASCAL VOC data set was Selective Searchsearched, and 2000 Region Proposalfine-tuning of the Pre-trained model were found. Replace the last 1000-way fully connected layer (classification layer) of the original pre-training model with a 21-way classification layer (20 types of objects + background), and then calculate the IoU of each region proposal and ground truth, for IoU>0.5 The region proposal is regarded as a positive sample, otherwise it is a negative sample (ie background). In addition, since there are many candidate regions in a picture, the number of negative samples is far greater than the number of positive samples, so it is necessary to upsample the positive samples to ensure a balanced sample distribution. In the process of each iteration, hierarchical sampling is selected, and two images are sampled in each mini-batch, from which 32 positive samples and 96 negative samples are randomly selected to form a mini-batch (128, positive and negative ratio: 1:3 ). We use a learning rate of 0.001 and SGD for training.
insert image description here

4. Extract and save the feature vector of RP

The CNN network for extracting features is not trained after pre-training and fine-tuning, and it is fixed. It is only used as a tool for feature extraction.. Although the CNN network is trained to classify the region proposal, in practice, the function of this CNN is only to extract the feature of each region proposal.
Therefore, we input the VOC training data set, SS searches out 2000 RPs and inputs them into CNN for forward propagation, and then saves the 4096-dimensional features of AlexNet's FC7 layer for subsequent SVM classification.

5. SVM training

The authors use SVM for classification. For each class, an SVM classifier will be trained, so there are N (21) classifiers in total. Let's take a look at how to train and use the SVM classifier.
insert image description here

In the SVM classification process, IOU<0.3 is used as a negative example, ground-truth (that is, the object is completely framed, and the default IOU>0.7) is a positive example, and the rest are all discarded. Then the SVM classifier will also output a predicted labels, and then use the labels and truth labels to compare, calculate the loss, and then train the SVM.

in,there is a detail, that is, because SVM is trained with small samples, there will be far more negative samples than positive samples. In response to this situation, the author uses the hard negative mining method (initial training with all samples, but the negative samples may be far more positive samples, after a round of training, the negative samples with the highest score, which are the most likely to be misjudged, are added to the new The sample training set is used for training, and the above steps are repeated until the stopping condition is reached (for example, the performance of the classifier is no longer improved), so that SVM is suitable for small sample training, and it can still prevent overfitting when the samples are unbalanced.

Why does the author want Hard Negatives? Because the number of negative samples is huge, among which the proportion of Pos samples is extremely low, too many negative samples directly lead to a slow optimization process, because many negative samples are far away from the interface, which is hardly helpful for optimization (the maximum interval of SVM classification, only support vectors more useful). The role of Data-minig is to remove those Easy-examples that have little effect on optimization and keep Hard-examples close to the interface.


Nuclear energy ahead:

Some people may wonder why SVM classification is used exclusively, instead of the last 21 layers of softmax layer classification of CNN? Here I will focus on explaining, only personal understanding:
Careful people will find that when training CNN to extract features before, the set IOU is more than 0.5 as a positive sample, and less than 0.5 is a negative sample. But in the SVM classification, only the bbox completely surrounds the object (it can also be understood as IOU>0.7) is a positive sample, and the IOU is less than 0.3 is a negative sample. The former is large sample training, and the latter is small sample training. For the training of CNN, a large amount of data is required , otherwise it is easy to overfit, so the threshold is set low. For example, a bounding box may only contain a part of the object, so I also mark it as a positive sample for training CNN; however During svm training, because svm is suitable for small-sample training , the IOU requirements for training sample data are relatively strict. We only mark it as an object category when the bounding box includes the entire object (IOU>0.7). , IOU<0.3 is labeled as a negative sample, and then train svm. Because of this feature, only low IOU can be used for softmax classification, so it is very likelyAs a result, the position of the final bbox frame is inaccurate (if the gap between bbox and GT is too large, linear regression will fail to converge), and the accuracy of class recognition is not high. According to the experiment, the MAP will drop by several percentage points. If you insist on increasing the IOU, it will lead to too few training data samples and overfitting. This is the real " you can't have both " ah, in fact the culprit isSmall VOC training volume is too small, limiting too many optimization operations. Therefore, SVM was finally selected to complete the classification, and CNN was only used to extract features.


After doing this, the accuracy will also be greatly improved, and the accuracy of the subsequent bbox regression will also be improved.

6. Training of bbox regression

The RP with GT's IOU>0.6 is used as a positive sample for regression training. For details, please see this article
insert image description here
insert image description here
. In fact, it is training ddd matrix tottThe process of aligning the t matrix.

Three: Test steps

After explaining the training process of R-CNN, let's talk about the testing process now. I will explain it in five steps as follows:

Step1: Determination of Region proposal

After the VOC test image is input, the SS search method is used to sort from large to small according to the similarity, and 2000 region proposals are screened out.

step2: Features extraction of RP

The RP is resized into 227×227, and then input into the CNN feature extraction network respectively, and 2000 4096-dimensional features are obtained.

step3: SVM classification

Input (2000,4096)the dimensional matrix into the SVM classifier, and finally get (2000,21)the matrix. The 21 column values ​​in each row represent the possibility that the RP belongs to each class. By setting the backgroud threshold α in advanceα and the thresholdβ of the class it belongs to ββ , screen outthe mmm RP regions.

step4:BoundingBox-Regression

Input (2000,4096)the dimension matrix (4096,4)into the regression matrix dd ofIn d ,(2000,4)the offset matrix is ​​finally output. Represents the position offset of the RP center point and the size transformation of the bbox. The mm
filtered out by the SVMThe eigenvectors corresponding to the m RP regions form a matrix(m,4096) and (4096,4)substituteddIn d ,(m,4)the offset matrix is ​​finally output.

step5:Non-maximum suppression processing

Only draw the mm filtered out by the SVMCorrected detection boxes for m RP regions . Considering the large amount of cumbersome overlap of the bbox,non-maximum suppression (NMS)to obtain the final detection result.

Attach a sample diagram of each step of processing, as follows:
insert image description here

Four: Problems with R-CNN

  • Long training time: The main reason is that there are multiple trainings in stages, and the feature map must be calculated separately for each region proposal, resulting in longer overall time.
  • Large footprint: The feature map of each region proposal must be written to the hard disk for storage in subsequent steps.
  • multi-stage: The model proposed in the article includes multiple modules, each module is independent of each other, and the training is also separate. This will lead to low accuracy, because there is no training linkage in the whole, and they do not share split training. Naturally, the most important CNN feature extraction will not be done too well.
  • The test time is long. Since the calculation is not shared, the feature map is also calculated for each proposal separately for the test image, so the test time is also very long.

  So far, I have given an in-depth explanation of the entire process and details of R-CNN. I hope it will be helpful to everyone. If you have any questions or suggestions that you don’t understand, please leave a comment below.

I am a Jiangnan salted fish struggling in the CV quagmire, let's work hard together and leave no regrets!

Guess you like

Origin blog.csdn.net/weixin_43702653/article/details/123973629