Fast RCNN paper summary

1. The Region Proposal method can obtain higher quality than the traditional sliding window method.
The more commonly used Region Proposal methods are: SelectiveSearch (SS, selective search), Edge Boxes (EB).

2. Fast RCNN target detection process

The first step is to pass the complete image through several convolutional layers and max pooling layers to obtain a feature map.

The second step is to use the selective search algorithm to extract object proposals, namely RoI, from the complete picture.

In the third step, according to the mapping relationship, the feature map corresponding to each object proposal can be obtained.

In the fourth step, the feature map obtained in the third step is passed through the RoI pooling layer to obtain a fixed-size feature map (which becomes smaller).

In the fifth step, after 2 layers of fully connected layers (fc), a fixed-size RoI feature vector is obtained.

In the sixth step, the feature vector passes through the respective FC layers, and two output vectors are obtained: the first is the classification, using softmax, and the second is the bounding box regression of each class.


3. Fast RCNN framework
4. Advantages of Fast RCNN
(1) Higher accuracy (mAP) than RCNN and SPPnet
(2) Training is single-stage, using multi-task loss
(3) All network layers can be updated in the training phase (SPPnet can only update the FC layer, limited mAP)
(4) For feature cache, no disk storage is required

5. RoI pooling layer
    Function: (1) Locate the rol in the image to the corresponding patch in the feature map
                (2) Downsample this feature map patch into a fixed-size feature and then pass it to the fully connected layer


6. multi-task loss

There are two losses: classification loss (L cls ), which is an N+1-way softmax output, where N is the number of categories and 1 is the background;

                        The regression loss (L loc ) is a 4xN output regressor, which means that a separate regressor is trained for each category.    

          

= 1 in the paper, used to adjust the balance between the two losses;

When the picture is the background: k*=0 → [k* ≥ 1]=0

When the picture is not the background: k*≥ 0 → [k* ≥ 1]=1

Here, the loss of the regressor is not L2, but a smooth L1 (preventing gradient explosion), as follows:

          
 7. Design Evaluation
(1) The effect of multi-task training is better
(2) The effect of a single size of the image is similar to that of multiple sizes
(3) More data can significantly improve the effect (data enhancement method in the paper: 50% probability of horizontal flip)
(4) Do not blindly increase the number of propsals, which may lead to a decrease in mAP
(5) It is not necessary to finetune all conv layers





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324943896&siteId=291194637