Detailed explanation of DETRs Beat YOLOs on Real-time Object Detection paper

Thesis title: DETRs Beat YOLOs on Real-time Object Detection

Paper address: https://arxiv.org/abs/2304.08069

Paper code: mirrors / facebookresearch / ConvNeXt · GitCode

Wait until I graduate to beat it, don’t beat YOLO, the majority of graduate students disagree

1. Summary

        Recently, end-to-end Transformer-based detectors (DETRs) have achieved remarkable performance. However, the high computational cost of DETRs has not been effectively addressed, limiting their practical applications and preventing them from fully exploiting the advantages of no post-processing, such as non-maximum suppression (NMS). This paper first analyzes the impact of NMS on inference speed in modern real-time object detectors and establishes an end-to-end speed benchmark. To avoid the inference delay caused by NMS, we propose Real-Time DEtection TRanformer (RT-DETR), the first real-time end-to-end object detector to our knowledge. Specifically, we design an efficient hybrid encoder to efficiently handle multi-scale features by decoupling intra-scale interactions and cross-scale fusion, and propose IoU-aware query selection to improve object query initialization. Furthermore, our proposed detector supports flexible adjustment of inference speed by using different decoder layers without retraining, which facilitates the practical application of real-time object detectors. Our RT-DETR-L achieves 53.0% AP and 114 FPS on T4 GPU on COCO val2017, RT-DETR-X achieves 54.8% AP and 74 FPS, superior in both speed and accuracy All YOLO detectors at the same scale. In addition, our RT-DETR-R50 achieves 2.2% AP improvement over DINO-Deformable-DETR-R50 in terms of accuracy and about 21 times higher in FPS.

2. Main contributions

The main contributions of this paper are summarized as follows:

(1) We propose the first real-time end-to-end object detector, which not only outperforms the current state-of-the-art real-time detectors in terms of accuracy and speed, but also requires no post-processing, so its inference speed does not delay and maintains Stablize

(2) We analyze the impact of NMS on real-time detectors in detail and draw conclusions about CNN-based real-time detectors from a post-processing perspective

(3) Our proposed IoU-aware query selection shows excellent performance improvement in our model, which provides new ideas for improving the initialization scheme of target queries

(4) Our work provides a feasible solution for the real-time implementation of an end-to-end detector, and the proposed detector can flexibly adjust the model size and inference speed by using different decoder layers without retraining

3. Related work

3.1 Real-time object detector
       After years of continuous development, the YOLO series has become synonymous with real-time object detectors, which can be roughly divided into two categories: anchor-based and anchor-free. Judging from the performance of these detectors, the anchor point is no longer the main factor limiting the development of YOLO. However, the above detectors produce a large number of redundant bounding boxes, which need to be filtered by NMS in the post-processing stage. Unfortunately, this leads to a performance bottleneck, and the hyperparameters of NMS have a significant impact on the accuracy and speed of the detector. We think this is incompatible with the design philosophy of real-time object detectors.

3.2 End-to-End Object Detectors
       End-to-end object detectors are famous for their simplified pipeline. DETR eliminates the manually designed anchor and NMS components in traditional detection pipelines. Instead, it adopts bipartite graph matching and directly predicts one-to-one object sets. By adopting this strategy, DETR simplifies the detection process and alleviates the performance bottleneck caused by NMS. Despite DETR's obvious advantages, it suffers from two major problems: slow training convergence and hard-to-optimize queries. Many DETR variants have been proposed to address these issues. Specifically, Deformable-DETR accelerates the training convergence of multi-scale features by enhancing the efficiency of the attention mechanism. Conditional DETR and Anchor DETR reduce the difficulty of query optimization. DAB-DETR introduces a 4D reference point and iteratively optimizes the prediction box layer by layer. DN-DETR speeds up training convergence by introducing query denoising. DINO improves on previous work and achieves state-of-the-art results. Although we are continuously improving the components of DETR, our goal is not only to further improve the performance of the model, but also to create a real-time end-to-end object detector.

3.3 Multi-scale features for object detection
       Modern object detectors have demonstrated the importance of exploiting multi-scale features to improve performance, especially for small objects. FPN introduces a feature pyramid network that fuses features from adjacent scales. Subsequent work extended and enhanced this structure, and it has been widely used in real-time object detectors. Although the deformable attention mechanism alleviates the computational cost to some extent, the fusion of multi-scale features still leads to high computational burden. To solve this problem, some works try to design computationally efficient DETR. Efficient DETR reduces the number of encoder and decoder layers by using dense priors to initialize object queries. Sparse DETR selectively updates encoder flags to reduce the computational overhead of the decoder. Lite DETR reduces the update frequency of low-level features by interleaving, which enhances the efficiency of the encoder. Although these studies have reduced the computational cost of DETR, the goal of these works is not to generalize DETR as a real-time detector.

4. Real-time DETR

4.1 Model Overview The
       proposed RT-DETR consists of a backbone network, a hybrid encoder, and a transformer-decoder with an auxiliary prediction header. An overview of the model architecture is shown in the figure.

Specifically:

(1) First, use the output features of the last three stages of Backbone's S3, S4, and S5 as the input of the encoder;

(2) Then, a hybrid encoder converts multi-scale features into a sequence of image features via intra-scale interaction and cross-scale fusion (as described in Section 4.2);

(3) Subsequently, a fixed number of image features are selected from the encoder output sequence using IoU-aware query selection as the initial target query for the decoder;

(4) Finally, the decoder with an auxiliary prediction head iteratively refines the object query to generate boxes and confidence scores.

4.2 High-efficiency hybrid encoder
       (1) Computational bottleneck analysis. To speed up training convergence and improve performance, Zhu et al. [43] propose to introduce multi-scale features and propose a deformable attention mechanism to reduce computation. However, although improvements in the attention mechanism reduce the computational overhead, the sharp increase in the length of the input sequence still makes the encoder a computational bottleneck, hindering the real-time implementation of DETR. As reported in [17], the encoder accounts for 49% of the GFLOPs of Deformable-DETR [43] but only contributes 11% of the AP. To overcome this obstacle, we analyze the computational redundancy present in multiscale transformer encoders and design a set of variants to demonstrate that the simultaneous interaction of intra-scale and cross-scale features is computationally inefficient.

        High-level features are extracted from low-level features that contain rich semantic information about objects in an image. Intuitively, feature interaction for concatenated multi-scale features is redundant. As shown in Figure 5, in order to test this idea, the authors rethink the encoder structure and design a series of variants with different encoders.

       This group of variants gradually improves model accuracy while significantly reducing computational cost by decoupling multi-scale feature interactions into two-step operations of intra-scale interaction and cross-scale fusion. We first remove the multi-scale transform encoder in DINO-R50 as baseline A. Next, different forms of encoders are plugged in to produce a series of variants based on baseline A, as follows:

• A → B: Variant B inserts a single-scale transformer encoder that uses a layer of transformer blocks. The features of each scale share the encoder for inter-scale feature interaction, and then the output multi-scale features are concatenated.
• B → C: Variant C introduces cross-scale feature fusion on the basis of B, feeding cascaded multi-scale features into the encoder for feature interaction.
• C → D: Variant D decouples the inter-scale interaction and cross-scale fusion of multi-scale features. First, a single-scale transformer-encoder is used for intra-scale interaction, and then a PANet-like structure is used for cross-scale fusion.

• D → E: Variant E further optimizes the intra-scale interaction and cross-scale fusion of multi-scale features on the basis of D, adopting an efficient hybrid encoder designed by us.

       (2) Hybrid design

       Based on the above analysis, the authors rethink the encoder structure and propose a new efficient hybrid encoder. As shown in Fig. 3, the proposed encoder consists of two modules, namely the attention-based intra-scale feature interaction (AIFI) module and the neural network-based cross-scale feature fusion module (CCFM). AIFI further reduces computational redundancy based on variant D, which only performs intra-scale interactions on S5. The authors believe that applying self-attention operations to high-level features with richer semantic concepts can capture the connections between conceptual entities in images, which facilitates the detection and recognition of objects in images by subsequent modules. Meanwhile, intra-scale interactions of lower-level features are unnecessary due to the lack of semantic concepts and the risk of duplication and confusion in interactions with high-level features. In order to test this idea, the intra-scale interaction was performed only for S5 in variant D, and the experimental results are shown in Table 3, see row DS5. Compared to variant D, DS5 significantly reduces latency (35% faster), but improves accuracy (0.4% higher AP). This conclusion is crucial for the design of real-time detectors.

        CCFM is also optimized based on variant D, inserting several fusion blocks consisting of convolutional layers in the fusion path. The role of the fusion block is to fuse adjacent features into a new feature, and its structure is shown in Figure 4. The fusion block contains N RepBlocks, and the two path outputs are fused by element-wise addition.
 

This process can be expressed as follows:

Q= K = V = Flatten (S_{5}^{})

F_{5} = Reshape (Attn (Q,K,V) )

Output = CCFM ( { S_{3}S_{4}F_{5}} )

Among them, Attn means multi-head self-attention, and Reshape means to restore the shape of the feature to the same shape as S5, which is Faltten's inverse operation.

4.3 IoU-aware query selection

       Object queries in DETR are a set of learnable embedding vectors that are optimized by a decoder and mapped to classification scores and bounding boxes by a prediction head. DINO's object query is to use the classification score to select the top K features from the encoder to initialize the object query. However, due to the inconsistent distribution of classification scores and location confidence, some predicted boxes have high classification scores but are not close to GT boxes, which leads to selection of boxes with high classification scores and low IoU scores, while boxes with low classification scores and high IoU scores are selected. Boxes are discarded. This impairs the performance of the detector. To address this issue, this paper proposes IoU-aware query selection by constraining the model to produce high classification scores for features with high IoU scores and low classification scores for features with low IoU scores during training. Therefore, the predicted boxes corresponding to the top K encoder features selected by the model according to the classification scores have high classification scores and high IoU scores. This paper reformulates DETR's binary matching as follows:

{L}( \hat{y}, y) = L{box} ( \hat{b}, b) +  L{cls}( \hat{c},c,IoU)

        Through IoU-aware query selection, there is a significant improvement in COCO MAP. where, \hat{y} and  y denote the predicted and ground-truth values, respectively, and c and b denote the categories and bounding boxes, respectively. We introduce the IoU score into the objective function of the classification branch to achieve consistency constraints on the classification and localization of positive samples.

       Specifically, we first select the top K (K = 300 in experiments) encoder features according to the classification score, and then visualize the scatterplots with classification scores greater than 0.5. Red and blue points are computed from models trained using normal query selection and IoU-aware query selection, respectively. The closer the points are to the upper right corner of the plot, the higher the quality of the corresponding features, i.e. the classification labels and bounding boxes are more likely to describe real objects in the image. According to the visualization results, the most striking feature is that a large number of blue points are concentrated in the upper right corner of the graph, while red points are concentrated in the lower right corner. This shows that models trained with IoU-aware query selection can produce more high-quality encoder features.

        In addition, the distribution characteristics of these two types of points are also quantitatively analyzed. There are 138% more blue points than red points in the figure, that is, there are more red points with a classification score less than or equal to 0.5, which can be regarded as low-quality features. Then, analyze the IoU score of features with a classification score greater than 0.5, and find that when the IoU score is greater than 0.5, there are 120% more blue points than red points. Quantitative results further demonstrate that IoU-aware query selection can provide more encoder features with accurate classification (high classification score) and precise localization (high IoU score) for object queries, thereby improving the accuracy of the detector.
 

Guess you like

Origin blog.csdn.net/Zosse/article/details/131488928