"DETRs Beat YOLOs on Real-time Object Detection" accelerates DETR to real-time SOTA

insert image description here
Recently, I saw a fairly good DETR paper written by Baidu, which accelerated the model to real-time level by simplifying the encoder layer of DINO, and translated it for record.
Paper address: https://arxiv.org/pdf/2304.08069.pdf
Open source address: https://github.com/PaddlePaddle/PaddleDetection
insert image description here

model structure

insert image description here
The RT-Detr network first utilizes the features of the last three stages of the backbone network {S3, S4, S5} as the input of the encoder. The encoder converts multi-scale features into image feature sequences through intra-scale feature interaction (AIFI, which is actually a transformer layer according to the text) and cross-scale feature fusion module (CCFM). IoU-aware query selection is used to select a fixed number of image features to be used as initial object queries for the decoder. Finally, a decoder (same as DINO's decoder) with an auxiliary prediction head iteratively refines the object query to generate boxes and confidence scores.
insert image description here

Main innovation

1. Simplify the DINO encoder

This article believes that the main reason for the slow speed of DINO is the six-layer encoder layer that uses deformable attn for multi-scale feature fusion in DINO. The encoder occupies 49% of the FLOPS of the DINO model, but only provides 11% AP improvement. To overcome this hurdle, this paper analyzes the computational redundancy present in multiscale deformable ATTN encoders and designs a set of variants to demonstrate that the simultaneous interaction of intra-scale and inter-scale features is computationally inefficient.
This article proposes five encoder structures to replace the original encoder, as shown in the following figure:
insert image description here
A without any encoder as a baseline, B with attn for different scale features, C with multi-scale deformable attn, and C for different The scale features are attn and then multi-scale fusion (using PANET and other FPN) D, and the intra-scale interaction and cross-scale fusion of D-based multi-scale features are further optimized, and the E of the efficient hybrid encoder designed in this paper is adopted. . The COCO MAP of the five encoder structures are:
insert image description here
It can be seen that D and E, which perform intra-scale and inter-scale features respectively, have achieved better results in mAP and delay. E further reduces computational redundancy on the basis of D, and only performs intra-scale interactions on S5 features. This paper argues that applying self-attention operations to high-level features with richer semantic concepts can capture the connections between conceptual entities in images, which facilitates the detection and recognition of objects in images by subsequent modules. At the same time, avoid the risk of duplication and confusion due to the lack of semantic concepts and the interaction with advanced features.
The efficient encoder proposed in this paper can be expressed as:
Q = K = V = Flatten ( S 5 ) Q=K=V=Flatten(S_{5})Q=K=V=Flatten(S5) F 5 = R e s h a p e ( A t t n ( Q , K , V ) ) F_5=Reshape(Attn(Q,K,V)) F5=Reshape(Attn(Q,K,V)) O u t p u t = C C F M ( F 5 , S 4 , S 3 ) Output=CCFM(F_5,S_4,S_3) Output=CCFM(F5,S4,S3)

2. IoU-aware query selection

Object queries in DETR are a set of learnable embedding vectors that are optimized by a decoder and mapped to classification scores and bounding boxes by a prediction head. DINO's object query is to use the classification score to select the top K features from the encoder to initialize the object query. However, due to the inconsistent distribution of classification scores and location confidence, some predicted boxes have high classification scores but are not close to GT boxes, which leads to selection of boxes with high classification scores and low IoU scores, while boxes with low classification scores and high IoU scores are selected. Boxes are discarded. This impairs the performance of the detector. To address this issue, this paper proposes IoU-aware query selection by constraining the model to produce high classification scores for features with high IoU scores and low classification scores for features with low IoU scores during training. Therefore, the predicted boxes corresponding to the top K encoder features selected by the model according to the classification scores have high classification scores and high IoU scores. In this paper, the binary matching of DETR is reformulated as follows:
L ( y ^ , y ) = L bbox ( b ^ , b ) + L cls ( c ^ , c , I o U ) L(\hat y,y)=L_{ bbox}(\hat b,b) + L_{cls}(\hat c,c,IoU)L(y^,y)=Lbb o x(b^,b)+Lcls(c^,c,I o U )
insert image description here
is significantly improved on COCO MAP through IoU-aware query selection.

result

The RT-detr training and testing proposed in this paper are carried out on (640, 640), and 72 epochs are trained. At first glance, the comparison with DINO (1333, 800 input) is not very fair. At the same time, if this simplified encoder is applied to mask DINO, I don't know if it will have such an amazing effect.
insert image description here

Guess you like

Origin blog.csdn.net/blanokvaffy/article/details/130230385