目标检测 - Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

0. Preface

  • Relevant information:
  • Basic information of the paper
    • Field: target detection
    • Author unit: University of Hong Kong & Tongji University & ByteDance
    • Posting time: 2020.11
  • One sentence summary: Use a fixed number of learnable boxes/features (not related to backbone) instead of anchors, thereby converting the original one/two-stage detection method to set prediction form

1. What problem to solve

  • The current mature algorithms for target detection are based on Dense prior (dense prior, such as anchors, reference points)
  • But there are many problems with dense priors
    • Many similar results will be detected, and post-processing (such as NMS) is required to filter.
    • Many-to-one label assignment problem (the author described it as many-to-one positive and negative sample assignment). Guessing means that when we set pred and gt, there is generally not a one-to-one relationship. There may be multiple preds, see See which one is more consistent with gt.
    • The detection result is very closely related to the prior (the number and size of anchors, the degree of confidentiality of reference points, the number of proposals generated)
  • DETR analysis
    • Belongs to sparse detector.
    • Existing problems: each object query interacts with the global information of the image, the convergence speed is slow during training, and the overall process is more complicated.

2. What method was used

  • Method comparison

    • Think that the one-stage method is the Dense method, such as retinanet/yolo/ssd.
    • The previous two-stage method belongs to dense-to-sparse, that is, the RPN is Dense, and the filtered rois is sparse.
    • What we hope to propose is the sparse method, that is, by obtaining learned proposals.
    • image-20201126102927309
  • Sparse R-CNN overall structure

    • Data input includes an image, a set of proposal boxes and proposal features.
    • Use FPN as Backbone to process images
    • In the figure below Proposal Boxes: N*4is a set of parameters, with the backbone have nothing
    • The proposals features in the figure below have nothing to do with backbone
    • image-20201126103439702
  • Learnable porposal box

    • Has nothing to do with backbone
    • Can be regarded as the statistical probability of the potential location of the object
    • Parameters can be updated during training
    • This structure can be used. It can be seen from here that the previous one-stage method using dense prior is wasteful.
  • Learnable proposal feature

    • Has nothing to do with backbone
    • The previous proposal box was a relatively concise, but way to describe the object, but it lacked a lot of information, such as the shape and posture of the object.
    • Proposal feature is used to represent more object information.
  • Dynamic instance interactive head

    • Obtain the features of each object through the proposal boxes and ROI methods, and then combine with the proposal feature to obtain the final prediction result.
    • The number of heads is the same as the number of learnable boxes, namely head/learnable proposal box/learnable proposal feature one-to-one correspondence.
    • image-20201126112747592

3. How effective is it

  • image-20201126113016200

4. What are the problems & what can be learned

  • It feels like removing the entire transformer on the basis of DETR, which is quite interesting.
  • There is also a feeling that removing the anchor of retinanet and replacing it with learnable proposal box/featue also has such a good effect.

Guess you like

Origin blog.csdn.net/irving512/article/details/110181911