CPNDet: Corner Proposal Network for Anchor-free, Two-stage Object Detection (paper reading)

Paper: https://arxiv.org/abs/2007.13816

Code: https://github.com/Duankaiwen/CPNDet (expected to open source in August 2020)

1. Papers and Abstracts

                                  《Corner Proposal Network for Anchor-free,Two-stage Object Detection》

                                                      Corner suggestion network for two-stage target detection without anchor  

      Abstract: The ultimate goal of target detection is to determine the category and location of the detected target from the picture. This paper proposes a two-stage detection framework without anchors. In the first stage, the corners are determined to extract regions of interest, and then labels are attached to these regions in the second stage. Experiments have proved that this two-stage detector can be trained end-to-end, and is very effective in improving the recall rate and detection accuracy of the target. Our method is called CPNDet, which can detect any different shapes and can effectively avoid false detections. This method achieves 49.2% mAP on the COCO data set, which is the most advanced detection effect currently. The reasoning stage of this method is also very efficient. When the mAP is 41.6%/39.7%, the detection speed is 26.2/43.3FPS, which is very competitive. Open source code address: https://github.com/Duankaiwen/CPNDet .

2. Method and implementation

      2.1 Anchor-based or Anchor-free?

      The author's core point is: the method based on anchor points can better locate the target of any shape, so it has a higher recall rate.

      Anchor-based methods need to set anchor points based on experience. These algorithms are not flexible enough and are easy to miss the detection of objects with special shapes. As shown in the following table, the anchor-based Faster R-CNN has a low recall rate, especially for large targets and the target aspect ratio is greater than 5:1 (because there is no anchor point defined under this aspect ratio).

       As shown below:

Caption

       2.2 One-stage or Two-stage?

       In view of the inaccurate prediction of CornerNet corner labels, the author adopts a two-stage method to solve this shortcoming. The author first classifies the obtained region of interest into two categories to remove the misdetection, and then re-labels the category with a multi-category. Since 80% of the proposals were removed in the first stage of the second classification, the detection efficiency was improved. The second stage uses the information inside the proposals, so the classification is more accurate.

       The loss function of the first stage of the second classification is as follows:

Caption

       N: the number of positive samples;

       pm: is the probability score of the object;

       IOUm: The maximum intersection ratio between the object and the ground-truth.      

       The loss function of the second stage classifier is as follows:

         q(m,c): The score of the m-th object belonging to the c-th category.

      2.3 The overall architecture of the network

           The algorithm framework is shown in the figure below:

Caption

             The overall loss function:             

              Note: The first two items represent the loss of key point positioning and offset, and the latter two items represent the loss of two-class and multi-class.

3. Reasoning stage and implementation details

      1. Reasoning stage

          The inference phase is similar to the training process, except that a threshold is set to filter out low-quality proposals. Since the predictions of the proposals are low, the author chose a relatively low threshold (the threshold in the paper is set to 0.2), so that more proposals are retained. When the threshold is set to 0.2, the average retained interest area accounts for about 20% of the total number, and the calculation amount used for the two classification is about 10% of the calculation amount for the multi classification, which makes the second stage more competitive.

          For each proposal retained in the second stage, S1 represents the classification score of the corner point (because the scores of the two corner points are different, the average score of the two corner points is taken as S1), and S2 represents the score of the multi-category. Calculate the final category score through the following formula: Sc = (S1 + 0.5)(S2 + 0.5), and finally standardize Sc to [0,1]. Finally, the author selects the 100 proposals with the highest scores for evaluation. It can be seen from the following table that the two classifiers play a role of complementary information, which greatly improves the accuracy of detection.

Caption

            2. Implementation details

            The author uses the Pytorch platform, and the code refers to CornerNet, mmdetection and Centernet. The author uses CornerNet and CenterNet as baselines, and HG-104, HG-52 and DLA-34 as the backbone network to extract corner information. The changes to the backbone network mainly refer to CenterNet, but the deformable convolutional layer is replaced with a normal layer (personal opinion, may be due to consideration of deployment issues).

             In the training process, except for the pre-trained model used in DLA-34, the other models are randomly initialized. Use Cascade corner pooling to make the network better detect corners. The input of the picture is 512x512, and the feature map obtained after four times downsampling is 128x128. Using the Adam optimizer, the batch size is set to 48, the initial learning rate is 2.5 × 10 −4, and the learning rate becomes 0.1 times the previous after 5K iterations of training.

             The test process uses horizontally flipped data to enhance the data. When using multi-scale testing, the resolutions are (0.6×, 1×, 1.2×, 1.5×, and 1.8×), and soft-NMS is used to filter the reframes, and the final selection The 100 boxes with the highest scores are evaluated.

4. Experiment and effect

      Accuracy comparison:

Caption

         Speed ​​comparison:

Caption

          Effect comparison:

Caption

5. Summary

      In this article, the author proposes a non-anchor, two-stage target detection framework. This method first extracts the key points, and generates the region of interest by combining the key points. Then the region of interest is sent to a two-stage classifier to filter out false detections. This algorithm greatly improves the recall rate and accuracy of the target, and the final result ranks in the forefront of existing detection algorithms.

       More importantly, the non-anchor-based detection method extracts proposals more flexibly, and improves the detection accuracy through the two-classifier. When we apply this excellent two-stage detection framework, the detection task can be completed efficiently. Therefore, whether to use a one-stage detector or a two-stage detector does not seem so important.

 

The above is the blogger’s understanding of the paper. If you need to communicate, please leave a message or QQ:1151583746.

Guess you like

Origin blog.csdn.net/Guo_Python/article/details/107715384