ICCV 2023 | The mixed training strategy breaks through the upper limit of the large target detection model, creating a new SOTA for COCO and LVIS

Summary · Highlights

The Shangtang-based model team proposed a training framework Co-DETR suitable for DETR detectors, which can greatly improve model performance without changing the reasoning structure and speed. This is the first detector to achieve 66.0AP on COCO using only ViT-L with 304M parameters . Co-DETR has achieved the best results across the board on several important benchmarks of target detection. In addition, this study has also achieved a substantial lead on the long-tailed LVIS dataset, which is +2.7AP and +6.1AP higher than the previous SOTA method on the val and minival validation sets, respectively .

Paper name: DETRs with Collaborative Hybrid Assignments Training

c8d3d2454af110d4938299479344c0e4.png

a5665d9c238d342ac07791ce471484fa.png

Ranking viewing link: https://paperswithcode.com/paper/detrs-with-collaborative-hybrid-assignments

22e590768d3781070ec9e29da28b416e.gif

 overview 

How does a sparse supervisory signal affect the learning ability of a detector? Is the slow convergence problem of the DETR detector caused by sparse supervision and insufficient learning?

In the current DETR detector, in order to achieve end-to-end detection, the label assignment strategy used is binary matching, so that one ground-truth can only be assigned to one positive sample.

In this case, only a very small part of sparse queries are used as positive samples and receive the supervision of regression. Exactly which aspects of the detector's learning ability are affected by this sparse supervisory signal is currently unknown. In addition, there are no relevant quantitative indicators to measure the extent of this impact.

To further explore these issues, we first visualize the feature maps output by the Deformable-DETR+R50 encoder.

As can be seen from the figure, the visualization of the Deformable-DETR feature is a mess, and it is basically impossible to see any connection with the object in the original picture. In addition, some strange high activation patterns appear at the edges of the feature maps.

b361dc02ae6ac05448537aa35533f857.png

However, contrary to the above binary matching, in traditional detectors (such as Faster-RCNN, ATSS), a ground-truth will be assigned to multiple anchors according to the positional relationship (for the convenience of explanation, this article uses anchor, proposal, point and other priors are collectively referred to as anchor) as positive samples.

Considering that the anchors are densely arranged on the feature map, a point may correspond to multiple anchors of different sizes and aspect ratios, and objects of different sizes will be matched to anchors of different scales. Then this one-to-many allocation method can provide dense and scale-sensitive supervision information, so we guess that this label allocation method can provide position supervision for more regions on the feature map, which can make the detector's features Learn better.

In order to compare the difference between these two different label assignment methods on the feature map, we directly replaced the decoder of Deformable-DETR with ATSS head, and compared them using the same visualization method.

As shown in the figure, the high activation area in the feature map visualization of ATSS well covers the foreground part of the picture, while the background part is basically not activated. Combining these visualization results, we believe that it is the difference between the two distribution methods that makes the encoder feature expression ability in the DETR model weakened.

In addition to visualization, we also constructed an index to measure the feature map and attention discriminability. The purpose is to quantify the visualization results. The specific calculation method is as follows. Simply put, the L2 norm of each scale feature is calculated, normalized and then averaged on the scale.

2a3422adff4ddfbe0b14fc420bcc49f1.png

After obtaining the discriminability score, we calculated its response to the foreground and background, and performed quantitative analysis using the IoF-IoB curve. The calculation methods of IoF and IoB are similar, as shown in the following formula.

13f7cab738e6a5cde3c22783d94848f0.png

Simply put, the pixels inside the target frame are regarded as the foreground, and the pixels outside the frame are the background, and then the corresponding masks of the foreground and background can be obtained. According to this mask and the discriminability score, IoF and IoB can be calculated.

618f80d60215340837955e05dd6b47bc.png

Through the IoF-IoB curve, we find that one-to-one matching impairs the learning of encoder features and attention in the decoder, respectively. So in this case, can the DETR model not only enjoy the end-to-end reasoning ability brought by one-to-one matching, but also learn better features and attention like one-to-many matching? This article explores these questions from two perspectives, based on the results of visualization and metrics analysis.

6ae266b067d09d6119d5a7fd82594de2.png

In order to allow the DETR detector to take advantage of the one-to-many matching, we introduced two improvements to the DETR-based training framework, corresponding to the encoder feature learning and decoder attention learning mentioned above. Newly added modules are not used after training.

(1) In the above analysis, we found that inserting a traditional ATSS detection head after the encoder can make the features of the encoder more prominent.

Inspired by this, in order to enhance the learning ability of the encoder, we first use the multi-scale adapter to convert the features output by the encoder into multi-scale features.

For DETR using single-scale features, the structure of this adapter is similar to the simple feature pyramid. For DETR with multi-scale features, this structure is the identity map. We then feed the multi-scale features to multiple different auxiliary detection heads, which all use one-to-many label assignment.

Due to the lightweight structure of the detection head of the traditional detector, it brings less additional training overhead.

(2) In order to enhance the attention learning of the decoder, we propose a customized positive sample query generation.

In the above analysis, we found that the anchors in traditional detectors are densely arranged and can provide dense and scale-sensitive supervision information.

So can we use the anchor in the traditional detector as a query to provide sufficient supervision for the learning of attention? Of course it is possible. In the previous step, the auxiliary detection heads have assigned their respective positive sample anchors and their matching ground-truth.

We choose to directly inherit the label assignment results of the auxiliary detection head, and convert these positive sample anchors into positive sample queries and send them to the decoder. There is no need for binary matching when calculating the loss, and the previous assignment results are directly used.

Compared with other methods that introduce auxiliary queries, these works will inevitably introduce a large number of negative sample queries, and we only introduce positive samples in the decoder, so the additional training cost is also small.

01da6fe0b432aeb760c455b2ced944df.gif

 result 

db16269b1a103f459b4d0c5b1b2b4d27.png

We first conducted experiments on multiple single-scale and multi-scale DETR models. Co-DETR can bring about a big improvement, especially the SOTA model DINO-5scale can increase from 49.4 to 51.2, which is almost an increase of 2 points. In addition, we are also experimenting on a larger backbone, such as Swin-L, and the results show that it can also bring about an improvement of 1.7 points.

e14cc8c07e2315bad7cfd053e26dca9c.png

When we applied Co-DETR to DINO, we used R50 and Swin-L as the backbone network. Under the comparison of the same model size, we can achieve the best performance.

We also verified the effectiveness and scale up capability of the proposed Co-DETR on a large model. The reason for this verification is that with the huge parameters of the large model, the differences between many methods will be directly smoothed out. We use ViT-L with 304M parameters as the backbone network, pre-training on the Objects365 dataset, and fine-tuning downstream. After fine-tuning on the COCO dataset, Co-DETR further broke through the upper limit of target detection performance with the support of the large model, becoming the first detector to reach 66.0AP.

In addition, we also fine-tuned on the long-tail distribution dataset LVIS, using only detection boxes for supervision during training. Co-DETR achieved 67.9AP and 71.9AP on LVIS val and minival respectively, which are +2.7AP and +6.1AP higher than the previous SOTA method , achieving a very obvious performance lead.

129fbb08e07b52f9e9dbeacc1874b760.png

This study also investigates the proposed method in terms of ablation experiments, such as the criteria for selecting auxiliary heads, the conflicts brought by multiple auxiliary heads with different label assignment strategies, etc.

We observed that as the number of different auxiliary heads used increases, the performance of the model first increases and then decreases. This study carried out a quantitative analysis on this, pointed out that it was caused by the conflict between auxiliary heads, and proposed an index to measure the degree of conflict. According to this index, we calculated how much conflicts caused by various types of auxiliary headers and the optimal selection strategy.

6d578993cfd059c392b759af97d6f33a.gif

Relevant information

The project has been open sourced, and all students are welcome to use and communicate.

Paper address: 

https://arxiv.org/pdf/2211.12860v4.pdf

Code open source:

https://github.com/Sense-X/Co-DETR

e14f1520bbf2294253a2b95f54393ca7.gif

‍‍‍

986ecf8834aae628eaab986d5744d2d4.png

Guess you like

Origin blog.csdn.net/c9Yv2cf9I06K2A9E/article/details/132463310