Baidu RT-DETR algorithm principle analysis | A new level of target detection beyond YOLO?


0. Preface

Insert image description here

Paper address: https://arxiv.org/abs/2304.08069

Code address: https://github.com/PaddlePaddle/PaddleDetection

Chinese translation: https://blog.csdn.net/weixin_43694096/article/details/131353118


This blog post will introduce Baidu’s RT-DETR, first let’s take a look at RT-DETRthe title of the paper, which claims to “ defeat YOLO in real-time object detection .” While RT-DETRit does appear to be surpassing in some respects from a data perspective, there is still much that needs further research and exploration compared YOLOto what is time-tested . YOLOOf course, today when Transformertechnology is so hot, RT-DETRit is definitely a very eye-catching direction. Next, we’ll dive into that RT-DETR.

Insert image description here

Less demanding in terms of training time compared to state-of-the-art YOLOv8algorithms RT-DETR, only approximately 75 7575 to80 8080 training rounds (whereasYOLOv8typically300 300300 to400 400400 rounds of training). Additionally,RT-DETRthere is less reliance on data augmentation. Under the same test conditions,RT-DETRit performed better, with better performance and balance, and was equally fastYOLO.

Although YOLOthe detector performs well in object detection, it faces an important problem, which is the need to use NMS(non-maximum suppression) to process multiple overlapping detection boxes, which causes speed delays and makes effective optimization difficult.

To overcome this problem, researchers turned their attention to DETR(DEtection TRansformer),an Transformerarchitecture-based end-to-end object detector. Unlike YOLOPTZ, DETRthere is no need for NMSpost-processing, it can complete the entire target detection process directly in the network.

However, despite its advantages DETRin terms of elimination of requirements, its processing speed is significantly slower compared to series detectors. This means that although it is not required , it does not exhibit a clear advantage in terms of speed. Therefore, this problem prompted researchers to look for a way to design a real-time end-to-end object detector to overcome the delay impact on speed.NMSYOLONMSNMS

As a result, Baidu officially launched RT-DETR ( Real - Time DEtection TR ansformer), a real-time end-to-end detector based on the DETRarchitecture, which achieves SOTAexcellent performance in speed and accuracy.

Every time it is mentioned DETR, we have to say NMSthat NMSit is an important post-processing technology in the field of target detection, aiming to solve the problem of multiple overlapping detection frames generated by the detector. The core of this technology includes two key hyperparameters: confidence threshold and IoU(intersection-to-union ratio) threshold.

First, NMSthe detection frames with confidence lower than the set confidence threshold will be directly filtered out, that is, those frames with low confidence will be excluded. Then, for those detection boxes whose intersection ratio (IoU)is greater than the set threshold, the boxes with higher confidence will be retained, while the boxes with lower confidence will be excluded. This process will be performed iteratively until all detection frames of the target categories have been processed.IoUNMS

NMSThe execution time of the algorithm is mainly affected by two factors, namely the number of prediction boxes and the setting of the above two thresholds. To study this effect in detail, the authors conducted experiments using YOLOv5(anchor-based method) and YOLOv8(anchor-free method) to examine the number of detection boxes retained under different confidence thresholds, as well as the detector under different hyperparameter combinations. Performance and execution time on COCOvalidation set .NMS

Experimental results show that NMSnot only does it slow down the detector's inference, but also requires careful selection of appropriate hyperparameters to achieve optimal performance. This experimental result highlights the importance of designing real-time end-to-end object detectors to overcome the NMSintroduced performance bottlenecks and shortcomings.

To verify this point, we conduct experiments using YOLOv5 (anchor-based) and YOLOv8 (anchor-free). We first count the number of remaining prediction boxes after filtering the output boxes through different score thresholds under the same input image. We randomly select some scores from 0.001 to 0.25 as the threshold, count the remaining prediction boxes of the two detectors and plot them into a histogram, which intuitively reflects the sensitivity of NMS to its hyperparameters, as shown in Figure 2.
Insert image description here

In addition, we took YOLOv8 as an example to evaluate the accuracy of the model on the COCO val2017 data set and tested the execution time of the NMS operation under different NMS hyperparameters. It should be noted that the NMS post-processing operation we used in the experiment refers to TensorRT efficientNMSPlugin, which contains multiple CUDA kernels, including EfficientNMSFilter, RadixSort, EfficientNMS, etc. We only report the execution time of the EfficientNMS kernel. We conduct speed tests on T4 GPU, and the input images and preprocessing in the above experiments remain consistent. The hyperparameters we used and the corresponding results are shown in Table 1.
Insert image description here


1. RT-DETR structural design

Insert image description here

Next, let's introduce RT-DETRthe structure. From a structural point of view, RT-DETRit can be divided into three parts: the backbone network , the neck network and the head network . Let’s talk about these three parts separately.

1.1 Backbone network

Insert image description here

For the part, two types of classic and scalable backboneare used , and two versions are trained for each of them. If is includes and versions , and if it is , it includes and . The backbone is used to facilitate comparison with existing variants, while is used to compare with existing real-time detectors. It is worth noting that it is the backbone structure developed by Baidu itself.ResNet HGNetv2 backboneHGNetv2backboneRT-DETR LX ResNetbackbone RT-DETRRT-DETR-R50RT-DETR-R101RT-DETR-R50 / 101DETRRT-DETR-HGNet-L / XHGNetv2

YOLOThe similarity is that feature RT-DETRmaps of three different sizes will eventually be output, and their downsampling multiples relative to the resolution of the input image are 8 8 respectively.8x ,16 1616 times and32 3232 times. This is similar to mainstreamYOLOalgorithms. RT-DETRApart from this, there is nothing specialabout other aspects of the backbone structure

1.2 Neck network

Insert image description here

For the neck network part, a layer of RT-DETRis used . This neck network is called in the article . It consists of two parts: and . One thing worth noting about this module is that this module only processes the feature map.TransformerEncoderEfficient Hybrid EncoderAttention-based Intra-scale Feature Interaction (AIFI)CNN-based Cross-scale Feature-fusion Module (CCFM) AIFI S5

For AIFIthe module, it first pulls the two-dimensional S5features into a vector, and then gives it to AIFIthe module for processing. The mathematical process is multi-head self-attention and FFN, then, the output is returned Reshapeto the two-dimensional, recorded as F5, in order to complete the subsequent so-called " Cross-scale feature fusion”.

Insert image description here

For CCFMmodules, if YOLOyou look at this structure from the perspective of , this CCFMmodule is a FPN/PANstructure. Regarding CCFMthe module, Fusionthe article also gives a detailed structural diagram, which is composed of 2 22 1×1 1×11×1Convolution andNNIt is composed of N , so it is written as NNRepBlockhere.N , I think it is becauseRT-DETRcan be scaled. By adjustingCCFMtheRepBlocknumber of andEncoderthe encoding dimension of to controlHybrid Encoderthe depth and width of respectively, and at the same timebackbonemaking corresponding adjustments to can achieve the scaling of the detector.

Insert image description here

The reason why RT-DETRwe AIFIonly deal with the last S5feature is explained in the article because of two considerations:

  1. Previous DETRmodels, e.g. Deformable DETR, typically flatten feature maps from multiple scales into a very long vector. Doing so allows features between different scales to interact with each other, but results in huge computational effort and computational time. RT-DETRThis is thought to be DETRone of the main reasons why the current model is slower.
  2. RT-DETRIt is considered S5that features have deeper, more advanced and richer semantic features than shallower S3and features. S4These semantic features Transformerare more important for the model because they are very useful for distinguishing features of different objects, while shallow features are not very rich due to lack of good semantic features.

In summary, RT-DETR the author team believes that applying the encoder only to S5feature maps, rather than feature maps of all scales, can help significantly reduce the amount of calculation and increase the calculation speed without causing obvious damage to the performance of the model.

Regarding this idea, the author also conducted detailed experiments.

Insert image description here

Computational bottleneck analysis. In order to speed up training convergence and improve performance, Zhu et al. [43] suggested introducing multi-scale features and proposed a deformable attention mechanism to reduce the amount of calculation. However, although the improvement of the attention mechanism reduces the computational overhead, the substantial increase in input sequences still causes the encoder to become a computational bottleneck, hindering the real-time implementation of DETR. As reported in [17], the encoder accounts for 49% of the GFLOPs but only contributes 11% of the APs in Deformable-DETR [43]. To overcome this obstacle, we analyze the computational redundancy present in multi-scale transformer encoders and design a series of variants to demonstrate that simultaneous intra-scale and cross-scale feature interactions are computationally inefficient.

High-level features are extracted from low-level features, which contain rich semantic information of objects in the image. Intuitively, feature interactions on connecting multi-scale features are redundant. To verify this point of view, we rethought the structure of the encoder and designed a series of variants with different encoders, as shown in Figure 5. These variants gradually improve the accuracy of the model while significantly reducing the computational cost by decomposing multi-scale feature interactions into two-step operations of intra-scale interaction and cross-scale fusion (see Table 3 for detailed indicators). We first remove the multi-scale transformer encoder in DINO-R50 [40] as baseline A. Then, different forms of encoders are inserted to produce a series of variants based on baseline A, as described below:

A → B: Variant B inserts a single-scale Transformer encoder, which uses a layer of Transformer blocks. The features of each scale share the encoder, perform feature interaction within the scale, and then connect the output multi-scale features.
B → C: Variant C introduces scale-based feature fusion based on B, and inputs the connected multi-scale features into the encoder for feature interaction.
C → D: Variant D decouples the intra-scale interaction and cross-scale fusion of multi-scale features. First, a single-scale Transformer encoder is used for intra-scale interaction, and then a structure similar to PANet [21] is used for cross-scale fusion.
D → E: Variant E further optimizes the intra-scale interaction and cross-scale fusion of multi-scale features based on D, using the efficient hybrid encoder we designed.

1.3 Data enhancement and training strategies

For the data enhancement and training strategy part, RT-DETRthe data enhancement uses basic random color dithering, random flipping, cropping, Resizeand the input size of the image during verification and inference is unified to 640 640640 ,DETRthe processing method is quite different from that of the series, mainly to meet the real-time requirements. RT-DETRThe training strategy isDETRbasically the same as the series, and the optimizer also adopts itAdamW. The defaultCOCO train2017training is on6x , that is,72 7272epoch ._


2. Query Selection and Decoder

In order to further improve RT-DETRthe accuracy of , the author turned his attention to DETRtwo other key components of the architecture: Query SelectionandDecoder

Query SelectionThe function of is Encoderto select a fixed number of features from the output feature sequence object queries, which Decoderare mapped to confidence and bounding boxes by the prediction head after passing. Existing DETRvariants all use the classification scores of these features to directly select Top-Kfeatures. However, due to the inconsistency in the distribution of classification scores and scores, the prediction box with a high classification score IoUis not necessarily the GTclosest box to , which will harm the detector performance.IoUIoU

To solve this problem, the author proposes to IoU-aware Query Selectionconstrain the detector during training IoUto produce high classification scores for features with high , and to IoUproduce low classification scores for features with low . As a result, Top-Kthe prediction box corresponding to the features selected by the model based on the classification score has both a high classification score and a high IoUscore.

Insert image description here

Among them, y ˆ yˆy ˆsumyy __y represents the predicted and true values ​​respectively,y ˆ yˆy ˆ = { ˆ c ˆc ˆc, ˆ b ˆb ˆ b } Sumy= c , by = {c, b}y=c,b c c c andbbb represents the category and bounding box respectively. We introduce the IoU score into the objective function of the classification branch (similar to VFL) to achieve consistency constraints on the classification and localization of positive samples.
By visualizing the confidence scores of these encoder features andthescoreGTbetweenIoUIoU-aware Query Selection

For Decoder, the author has not adjusted its structure. The paper states that the purpose is to facilitate the use of high-precision DETRlarge detection models to DETRdistill lightweight detectors.

Insert image description here


3. Experimental results

3.1 Settings

Dataset: We Microsoft COCOconduct extensive experiments on the dataset to validate the proposed detector. In the ablation study, we COCO train2017train on and COCO val2017validate on the dataset. We use standard COCO APmetrics, using single-scale images as input.

Implementation details: We use the ResNetand HGNetv2series ImageNetof models pre-trained on as our backbone network, these models come from PaddleClas2. AIFIIncludes 1 11Transformer layer,CCMFthe fusion block in the default consists of33Composed of 3 piecesRepBlocks. InIoU-aware query selection, we select the first300 300300 encoder features to initialize the decoder object query. The training strategy and hyperparameters of the decoder almost followDINOthe settings of . We useAdamWthe optimizer for training with a base learning rate of0.0001, weight decay of0.0001, global gradient clipping norm of0.0001, and linear warm-up steps of2000. The learning rate setting of the backbone network follows the method of [4]. We also use an exponential moving average (EMA) with a decay rate of0.9999. If not specified,the configuration means training a total of12 1212epoch ._ The final reported results usethe configuration. Data augmentation includes random {color distortion, expansion, cropping, flipping, resizing} operations, referring to the setting of [36].

3.2 Comparison with SOTA

Insert image description here

Table 2 compares our proposed RT-DETRwith other real-time end-to-end object detectors. Our proposed RT-DETR-Limplementation 53.0%of APYOLO outperforms YOLO detectors of the same size in both speed and 114帧/秒accuracy RT-DETR-X. Furthermore, our proposed implementation of , and , outperforms state - of-the-art end-to - end detectors on equivalent backbone networks in both speed and accuracy.54.8%AP74帧/秒RT-DETR-R5053.1%AP108帧/秒RT-DETR-R10154.3%AP74帧/秒

Compared to real-time detectors. For a fair comparison, we compare the speed and accuracy of Scaled RT-DETRwith current real-time detectors in an end-to-end setting (see Section 3.2 for speed testing methods). In Table 2, we compare the scaled RT-DETRvs. The accuracy is significantly improved compared to , which increases YOLOv5 [10]、PP-YOLOE [36]、YOLOv6v3.0 [14]、YOLOv7 [33]和YOLOv8 [11], YOLOv5-L / PP-YOLOE-L / YOLOv7-Land reduces the number of parameters. Compared with , the accuracy is improved , , and the number of parameters is reduced . Compared with , the accuracy is improved , the speed is improved , and the number of parameters is reduced . Compared with , the accuracy is improved , the speed is improved , and the number of parameters is reduced .RT-DETR-L4.0% / 1.6% / 1.8%的AP111.1% / 21.3% / 107.3%FPS30.4% / 38.5% / 11.1%YOLOv5-X / PP-YOLOE-X / YOLOv7-XRT-DETR-X4.1% / 2.5% / 1.9%72.1% / 23.3% / 64.4%FPS22.1% / 31.6% / 5.6%YOLOv6-L / YOLOv8-LRT-DETR-L0.2% / 0.1%AP15.2% / 60.6%45.8% / 25.6%YOLOv8-XRT-DETR-X0.9%AP48.0%FPS1.5%

Compared to end-to-end detectors. Table 2 shows that RT-DETRstate-of-the-art performance is achieved among all end-to-end detectors using the same backbone network. Significantly improved accuracy, improved speed , and reduced number of parameters compared to DINO-Deformable-DETR-R50 [40]. Significantly improved accuracy compared to .RT-DETR-R502.2%AP(53.1% AP对比50.9% AP)21(108 FPS对比5 FPS)10.6%SMCA-DETR-R101 [6]RT-DETR-R1018.0%AP

3.3 Ablation research on hybrid encoders

Insert image description here

Table 3: Analytical experimental results of splitting multi-scale feature fusion into two-step operations of intra-scale interaction and cross-scale fusion.


To verify the correctness of our analysis regarding the encoder and the effectiveness of the proposed hybrid encoder, we evaluate T4 ``GPUthe metrics of a set of variants designed on, including AP, number of parameters and latency. The experimental results are shown in Table 3.

BCompared with the variant , Athe variant has improved 1.9%, APwhile the number of parameters has increased 3%and the delay has increased 54%. This demonstrates the importance of feature interactions within scales, but the original Transformerencoder is computationally expensive.

CThe variant Bimproves compared to the variant 0.7%, APthe number of parameters remains the same, but the latency increases 20%. This suggests that cross-scale feature fusion is also necessary.

DCompared with the variant , Cthe variant is improved 0.8%, APthe number of parameters is increased 9%, but the delay is reduced 8%. This demonstrates that decoupling intra-scale interactions and cross-scale fusion can improve accuracy while reducing computational effort.

Compared to the original Dvariant, latency DS5is reduced while improving . This demonstrates that within-scale interactions of lower-level features are unnecessary.35%0.4%AP

Finally, our proposed hybrid encoder equipped with the E variant has improved performance compared to the D 1.5%variant AP. Although the number of parameters increases 20%, the latency is reduced 24%, making the encoder more computationally efficient.

3.4 Ablation research on IoU-aware query selection

We IoUconduct a pruning study on perceptual query selection and present quantitative experimental results in 4. The query selection we adopt selects the previous K( K = 300) encoder feature as the content query based on the classification score, and the corresponding bounding box as the initial position query. We compared val2017the encoder features selected on the two query choices and calculated the proportion of classification scores greater than 0.5and simultaneous scores greater than 0.5, corresponding to “Propcls”the and “Propboth”columns, respectively. The results show that IoUencoder features selected through perceptual query selection not only increase the proportion of high classification scores ( 0.82%contrast 0.35%), but also provide more IoUfeatures with high classification scores and high scores ( 0.67%contrast 0.30%). We also val2017evaluate the accuracy of detectors trained using both types of query selection, where IoUperceptual query selection achieves 0.8%an improvement AP( 48.7% APvs.47.9% AP


Insert image description here

Table 4: IoUResults of perceptual query selection ablation study. Propclsand Propbothrepresent the proportion of classification scores greater than 0.5and greater than both scores, respectively 0.5.


3.5 Research on ablation of decoders

Insert image description here

Table 5: Ablation study results of the decoder. IDIndicates the index of the decoder layer, APindicating the model accuracy obtained by different decoder layers. DetkRepresents ka detector with decoder layers. Results are based on reporting using schedule settings RT-DETR-R50.


Table 5 shows RT-DETRthe accuracy and speed under different number of decoder layers. The detector achieves the best 53.1% APaccuracy when the number of decoder layers is 6. We also analyzed the impact of each decoder layer on inference speed and concluded that each decoder layer consumes approximately 0.5milliseconds. Furthermore, we find that the accuracy difference between adjacent layers of the decoder gradually decreases as the decoder layer index increases. Taking 6the layer decoder as an example, using only 5layers for inference only loses 0.1% AP( 53.1% APvs. 53.0% AP) accuracy while reducing latency by 0.5milliseconds ( 9.3milliseconds vs. 8.8milliseconds). Therefore, RT-DETRflexible adjustment of inference speed is supported by using different decoder layers without the need to retrain inference, thus facilitating the practical application of real-time detectors.


Summarize

In this blog, we have only touched RT-DETRthe surface and discussed its core principles and potential applications. More details still need to be discovered through code~

Thanks for reading this article, my name is Diffie Herman, if you find it useful, please like and follow to get more updates about deep learning and computer vision. If you have any questions or suggestions, please feel free to leave a message and let us explore the wonderful world of deep learning together.

If you are interested in improving YOLO, welcome to follow my column!

"YOLOv8 Improved Practical Combat"


references

Beyond YOLOv8, Feipiao launches RT-DETR, the most accurate real-time detector!

"Target Detection"-Chapter 33-A Brief Analysis of RT-DETR

おすすめ

転載: blog.csdn.net/weixin_43694096/article/details/133183315