[RCNN Series] Summary of RCNN Papers

Object detection paper summary

【RCNN Series】
RCNN
Fast RCNN
Faster RCNN



foreword

Summary of some classic papers.


1. Pipeline

insert image description here
First pass in the Input image, use the Selective Search (older) algorithm to search for areas that may have objects in the image, and save them to the local disk. Then these obtained proposals (about 2000) are passed into CNN, that is, convolutional network to extract features (only features are extracted without classification prediction), and finally an SVM is connected to predict classification (20 types of objects + 1 type of background).

2. Model design

1.warp

When obtaining region proposals, the author added 16 pixels to each original proposal. For example, the red frame is the proposal calculated by SS (selective search), and the author extracts 16 more pixels around it, that is, the final result is a yellow frame. This can obtain more edge information and prevent some features from being truncated.
insert image description here

After obtaining region proposals, because AlexNet has a fully connected layer, it needs to be unified into 227*227a size before it can be input into CNN. While at Reshape, the author tried many methods.
insert image description here
Column A indicates the proposal, column B indicates that the original image is directly zoomed according to the proposal, that is, the size of 227*227 on the original image is intercepted, and column C indicates the proportional scaling, that is, the length is scaled to 227, and the width is scaled to the same multiple (if At this time, if the width does not reach 227, the pixel average value is used for padding), D means unequal scaling, that is, the length and width are respectively scaled to 227. Each picture has two lines. The first line indicates that each original proposal is not increased by 16 pixels, and the second line indicates that each original proposal is increased by 16 pixels. In the end, the author uses D to scale differently, and D has the best effect.

2.SVM

The author did not design the 21 SVM classifiers separately, but wrote them in the form of matrix operations.
(x1,x2,...,x4096)Indicates that the feature vector obtained by CNN is the last 4096-dimensional vector of the fully connected layer.
(w1,w2,...,w4096)Representation parameters
Since there are a total of 2000 candidate areas, a one- 2000*4096dimensional matrix is ​​formed, and the final output result is 21 categories, so it is multiplied by a one- 4096*21dimensional parameter matrix. Finally, output 2000*21the classification result of the dimension.
insert image description here

3. Threshold setting

More than 2,000 proposals are obtained through the SS algorithm, but many of them are not the object frames we want. At this time, the division of positive and negative samples is involved.
In CNN: positive samples are proposals with GT (true label frame) IoU ≥ 0.5, others are negative samples (that is, divided into background classes), and the batchsize=128=32 (positive)+96 (negative) of the input CNN There are 32 positive samples and 96 negative samples to balance the input of positive and negative samples.

Reason:
0.5 threshold has the best experimental effect.
Most of the regions extracted by the SS algorithm do not contain the target, that is, most of them should be negative samples. CNN needs a lot of data for training, and the 0.5 threshold increases the number of positive samples by 30 times.

In SVM: the positive sample is GT, the negative sample and GT's IoU ≤ 0.3 (obtained from threshold experiments), the author found that if the threshold is set to 0.5, the effect will be worse. And those with IoU greater than 0.3 will be discarded, neither positive samples nor negative samples. Because these samples may be relatively easy to distinguish. For SVM, those samples that are easy to distinguish will not affect the value of SVM parameters. For SVM, samples far away from the support vector will not affect the SVM effect. That is, the key is to distinguish support vectors. That is, the difficult case mining mentioned in the paper.
insert image description here

注:其实作者一开始是没打算微调AlexNet,直接使用ImageNet训练得到的AlexNet。所以是先制定了SVM的正负样本划分阈值。后来引入了微调才需要划分正负样本重新训练,但发现CNN用SVM的划分方法效果太差,所以设定了不同的划分方法。

4. box regression

Box regression is the author's later improved method to get a more accurate prediction target box. It is to shift the predicted target frame to be closer to GT.
insert image description here
Why not directly predict the coordinates of the box but predict the offset? Because such size targets can have the same loss
insert image description here

3. Thinking

1. Why use SVM additionally?

1. The author found that the classification accuracy of CNN will be reduced directly

2. The positive and negative sample divisions of CNN and SVM are different. The definition of positive samples in CNN is not an accurate target frame, which will lead to errors

3. The negative samples of CNN are randomly sampled, and most of them are easily identifiable negative samples. Here you can use SVM to mine difficult cases. For SVM, samples far away from support vectors will not affect the effect of SVM, that is, the key is to distinguish support vectors.

2. Difficult example mining

Difficult example mining is to send difficult negative samples into the network for retraining until there is no improvement. Difficult negative samples refer to those proposals that are easily predicted as positive samples by the network, that is, false positives. For example, when there are half of the targets in the roi, although it is still a negative sample, it is easy to be judged as a positive sample. , this roi is hard negative, training hard negative is of great help to improve the classification performance of the network, because it is equivalent to a set of wrong questions. How to judge it as a difficult negative sample? It is also very simple. We first use the initial sample set to train, and then use the trained network model to predict the negative samples in the negative sample set, and select the negative sample with the highest score, that is, the negative sample that is most likely to be judged as a positive sample, as a difficult sample, and add Negative samples are concentrated, the network is retrained, and the cycle repeats.

3. Classic architecture

After AlexNet won the classification competition championship in 2012, CNN became popular. The author is also based on this thinking: How to expand the classification network to the VOC target detection data set? There are two problems to be solved in order to expand: positioning and small data sets. For the positioning problem, the author uses the SS (selectiv search) algorithm to first filter out the region where there may be objects, that is, the ROI. Dataset problem Since the target detection dataset is very small compared to Imagenet (VOC2007 and 2012 have a total of more than 20,000 pieces), such a small amount of data is not enough to train a relatively deep CNN, so the author thought of pre-training and fine-tuning.

4. Disadvantages

RCNN is the first two-stage detector of the RCNN family and is certainly imperfect:

1. Staged training. The process is too cumbersome, and each region proposal has to be sent to CNN to run again and train SVM classification (Fast RCNN improvement point).
2. The SS algorithm is too weak and must be saved to the local disk (Faster RCNN improvement point)

Guess you like

Origin blog.csdn.net/m0_46412065/article/details/128176894