[target detection] Detailed explanation of Faster RCNN algorithm

Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal networks.” Advances in Neural Information Processing Systems. 2015.

This paper is another masterpiece in 2015 by the Ross Girshick team , a leader in the target detection field, following RCNN[ 1 ] and fast RCNN[ 2 ]. The target detection speed of the simple network reaches 17fps , and the accuracy rate is 59.9% on PASCAL VOC; the complex network reaches 5fps, and the accuracy rate is 78.8%.

Thought

From RCNN to fast RCNN, and then to the faster RCNN in this paper, the four basic steps of target detection (candidate region generation, feature extraction, classification, location refinement) are finally unified into a deep network framework . All calculations are not repeated and are completely done in the GPU, which greatly improves the running speed.
write picture description here

Faster RCNN can be simply regarded as a system of "region generation network + fast RCNN", and the region generation network is used to replace the Selective Search method in fast RCNN. This paper focuses on solving three problems in this system:

  1. How to Design Region Generating Networks
  2. How to train a region generating network
  3. How to let region generation network and fast RCNN network share feature extraction network

Region Proposal Network

The core idea of ​​RPN is to use a convolutional neural network to directly generate region proposals, and the method used is essentially a sliding window. The design of RPN is ingenious. RPN only needs to slide once on the last convolutional layer, because the anchor mechanism and frame regression can obtain region proposals with multi-scale and multi-aspect ratios.

write picture description here

We directly look at the above RPN network structure diagram (using the ZF model), given the input image (assuming a resolution of 600*1000), after the convolution operation, the convolution feature map of the last layer (about 40*60 in size) is obtained. ). On this feature map, a 3*3 convolution kernel (sliding window) is used to convolve the feature map. The last convolutional layer has a total of 256 feature maps, then this 3*3 area can be convolved to obtain a 256 Dimensional feature vector, followed by cls layer and reg layer for classification and border regression respectively (similar to Fast R-CNN, except that the categories here are only target and background). Each feature region corresponding to the 3*3 sliding window simultaneously predicts the region proposals of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2, 2:1) of the input image. The mechanism of this mapping is called for the anchor. So for this 40*60 feature map, there are about 20,000 (40*60*9) anchors in total, that is, 20,000 region proposals are predicted.

What are the benefits of this design? Although the sliding window strategy is also used now, the sliding window operation is performed on the feature map of the convolution layer, and the dimension is reduced by 16*16 times compared with the original image (after 4 times of 2*2 pooling operations in the middle); The scale uses 9 anchors, corresponding to three scales and three aspect ratios, and the border regression is followed, so even the windows outside these 9 anchors can get a region proposal that is closer to the target.

​ The detection framework used by the NIPS2015 version of Faster R-CNN is target detection by RPN network + Fast R-CNN network separation. The overall process is the same as Fast R-CNN, except that the region proposal is now extracted by the RPN network (instead of the original one. selective search). At the same time, in order to allow the RPN network and the Fast R-CNN network to share the weights of the convolutional layers, the author used a 4-stage training method when training the RPN and Fast R-CNN:

​ (1) Use the pre-trained model on ImageNet to initialize network parameters and fine-tune the RPN network;

​ (2) Use the RPN network in (1) to extract the region proposal to train the Fast R-CNN network;

​ (3) Use the Fast R-CNN network of (2) to re-initialize the RPN and fix the convolutional layer for fine-tuning;

​ (4) Fix the convolutional layer of Fast R-CNN in (2), and fine-tune the network using the region proposal extracted by the RPN in (3).

​ The RPN and Fast R-CNN after weight sharing will improve the accuracy of target detection.

​ Using the trained RPN network, given the test image, you can directly get the region proposal after edge regression, sort the RPN network according to the category score of the region proposal, and select the first 300 windows as the input of Fast R-CNN for the target. Detection, using VOC07+12 training set training, VOC2007 test set testing mAP reaches 73.2% (selective search + Fast R-CNN is 70%), the target detection speed can reach 5 frames per second (selective search+Fast R-CNN is 2~3s one).

​ It should be noted that the latest version has combined the RPN network and the Fast R-CNN network - the proposal obtained by the RPN is directly connected to the ROI pooling layer, which is a real implementation using a CNN network. A framework for end-to-end object detection.

Region Generative Networks: Structure

The basic idea is to discriminate all possible candidate boxes on the extracted feature map. Due to the subsequent position refinement step, the candidate frame is actually relatively sparse.
write picture description here

Feature extraction

Key content
The original feature extraction (gray box above) contains several layers of conv+relu, which can be directly applied to the common classification network on ImageNet. In this paper, two kinds of networks are tested: ZF [ 3 ] with 5 layers, VGG-16 [ 4 ] with 16 layers, the specific structure will not be repeated.
An additional conv+relu layer is added to output 51*39*256 dimensional features.

Candidate region (anchor)

The feature can be regarded as a 256-channel image with a scale of 51*39. For each position of the image, 9 possible candidate windows are considered: three areas {1282, 2562, 5122} × {1282, 2562, 5122} × three The ratios are {1:1,1:2,2:1}{1:1,1:2,2:1}. These candidate windows are called anchors. The figure below shows 51*39 anchor centers and 9 anchor examples.
write picture description here

In the entire faster RCNN algorithm, there are three scales.
Original image scale : The size of the original input. Not subject to any restrictions and does not affect performance.
Normalized scale : The size of the input feature extraction network, which is set during testing, opts.test_scale=600 in the source code. Anchor is set at this scale. The relative size of this parameter and the anchor determines the range of objects you want to detect.
Network input scale : The size of the input feature detection network, which is set during training, and is 224*224 in the source code.

Window classification and position refinement

The classification layer (cls_score) outputs the probability that 9 anchors belong to the foreground and background at each position; the window regression layer (bbox_pred) outputs the parameters that the 9 anchors corresponding to the window should be translated and zoomed at each position.
For each location, the classification layer outputs the probability of belonging to the foreground and background from the 256-dimensional features; the window regression layer outputs 4 translation scaling parameters from the 256-dimensional features.

Locally, these two layers are fully connected networks; globally, since the network has the same parameters at all locations (51*39 in total), it is actually implemented with a 1×1 convolutional network.

In the actual code, 51*39*9 candidate positions are sorted according to the score, the highest part is selected, and then 2000 candidate results are obtained through Non-Maximum Suppression. Then it is fed into the classifier and regressor.
So Faster-RCNN, like RCNN and Fast-RCNN, belongs to the 2-stage detection algorithm.

Region Generative Networks: Training

sample

Examine each image in the training set:
a. For each calibrated ground-truth candidate region, the anchor with the largest overlap ratio is recorded as the foreground sample
b. For the remaining anchors in a), if the overlap ratio with a calibration is greater than 0.7, Denote it as a foreground sample; if its overlap ratio with any calibration is less than 0.3, denote it as a background sample
c. For the remaining anchors in a) and b), discard them.
d. Anchors that cross the image boundary are discarded

cost function

Minimize two costs at the same time:
a. Classification error
b. Window position deviation of foreground
samples For details, please refer to the "Classification and Position Adjustment" paragraph in fast RCNN .

Hyperparameters

The original feature extraction network is initialized with ImageNet's classification samples, and the rest of the new layers are randomly initialized.
Each mini-batch contains 256 anchors extracted from one image, with foreground and background samples 1:1.
The first 60K iterations have a learning rate of 0.001, and the last 20K iterations have a learning rate of 0.0001.
The momentum is set to 0.9, and the weight decay is set to 0.0005. [ 5 ]

shared features

Both Region Generation Network (RPN) and fast RCNN require a raw feature extraction network (grey box below). This network uses ImageNet's classification library to get the initial parameters W0W0, but how to fine-tune the parameters to meet the needs of both parties at the same time? This article explains three methods.
write picture description here

Rotate training

a. Starting from W0W0, train the RPN. Use RPN to extract candidate regions on the training set
b. Start from W0W0, train Fast RCNN with candidate regions, and the parameters are recorded as W1W1
c. Start from W1W1, train RPN... For
specific operations, only perform two iterations, and freeze during training some layers. The experiments in the paper use this method.
As Ross Girshick explained in ICCV '15 lecture Training R-CNNs of various velocities, there is no underlying reason for this approach, mainly because of "implementation issues, and deadlines".

Approximate joint training

Train directly on the structure above. When calculating the gradient backward, the extracted ROI area is treated as a fixed value; when updating the parameters backward, the incremental merge from RPN and from Fast RCNN is input to the original feature extraction layer.
This method is similar to the previous method, but can reduce the training time by 20%-25%. This method is included in the published python code .

joint training

Train directly on the structure above. However, when calculating the gradient backward, the influence of changes in the ROI area should be considered. The derivation is beyond the scope of this paper, please refer to the 2015 NIP paper [ 6 ].

experiment

In addition to the basic performance mentioned at the beginning, there are some notable conclusions

  • Compared with the Selective Search method (black), the recall of the RPN method (red and blue) in this paper does not decrease much when the candidate regions generated per image are reduced from 2000 to 300. The purpose of the RPN method is clearer .
    write picture description here
  • Trained with the larger Microsoft COCO library [ 7 ] and tested directly on PASCAL VOC, the accuracy improves by 6%. It shows that the faster RCNN has good migration and no over fitting.
    write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324892751&siteId=291194637