Article Directory
0. Preface
- Relevant information:
- Basic information of the paper
- Field: target detection
- Author unit: University of Hong Kong & Tongji University & ByteDance
- Posting time: 2020.11
- One sentence summary: Use a fixed number of learnable boxes/features (not related to backbone) instead of anchors, thereby converting the original one/two-stage detection method to set prediction form
1. What problem to solve
- The current mature algorithms for target detection are based on Dense prior (dense prior, such as anchors, reference points)
- But there are many problems with dense priors
- Many similar results will be detected, and post-processing (such as NMS) is required to filter.
- Many-to-one label assignment problem (the author described it as many-to-one positive and negative sample assignment). Guessing means that when we set pred and gt, there is generally not a one-to-one relationship. There may be multiple preds, see See which one is more consistent with gt.
- The detection result is very closely related to the prior (the number and size of anchors, the degree of confidentiality of reference points, the number of proposals generated)
- DETR analysis
- Belongs to sparse detector.
- Existing problems: each object query interacts with the global information of the image, the convergence speed is slow during training, and the overall process is more complicated.
2. What method was used
-
Method comparison
- Think that the one-stage method is the Dense method, such as retinanet/yolo/ssd.
- The previous two-stage method belongs to dense-to-sparse, that is, the RPN is Dense, and the filtered rois is sparse.
- What we hope to propose is the sparse method, that is, by obtaining learned proposals.
-
Sparse R-CNN overall structure
- Data input includes an image, a set of proposal boxes and proposal features.
- Use FPN as Backbone to process images
- In the figure below
Proposal Boxes: N*4
is a set of parameters, with the backbone have nothing - The proposals features in the figure below have nothing to do with backbone
-
Learnable porposal box
- Has nothing to do with backbone
- Can be regarded as the statistical probability of the potential location of the object
- Parameters can be updated during training
- This structure can be used. It can be seen from here that the previous one-stage method using dense prior is wasteful.
-
Learnable proposal feature
- Has nothing to do with backbone
- The previous proposal box was a relatively concise, but way to describe the object, but it lacked a lot of information, such as the shape and posture of the object.
- Proposal feature is used to represent more object information.
-
Dynamic instance interactive head
- Obtain the features of each object through the proposal boxes and ROI methods, and then combine with the proposal feature to obtain the final prediction result.
- The number of heads is the same as the number of learnable boxes, namely head/learnable proposal box/learnable proposal feature one-to-one correspondence.
3. How effective is it
4. What are the problems & what can be learned
- It feels like removing the entire transformer on the basis of DETR, which is quite interesting.
- There is also a feeling that removing the anchor of retinanet and replacing it with learnable proposal box/featue also has such a good effect.