Co-DETR: DETRs and collaborative hybrid allocation training paper study notes

论文地址:https://arxiv.org/pdf/2211.12860.pdf

代码地址: GitHub - Sense-X/Co-DETR: [ICCV 2023] DETRs with Collaborative Hybrid Assignments Training

Summary

The authors propose a new cooperative mixed task training scheme, namely Co-DETR, to learn more efficient detr-based detectors from multiple label assignment methods. This new training scheme trains ATSS and Faster RCNN etc. Multiple parallel auxiliary heads under supervision of multi-label assignments can easily improve the encoder's learning ability in end-to-end detectors. Furthermore, the authors conduct additional customized positive queries by extracting positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. During inference, these auxiliary headers are discarded, and therefore, our method does not introduce additional parameters and computational cost to the original detector, while requiring no handcrafted non-maximum suppression (NMS).

Too long to watch version:

This article is based on DINO, and expands the memory output by the encoder of DINO Deformable-transformer according to the size of each layer of feature maps (plus a last layer of downsampling, a total of five layers of feature maps). On these five layers of feature maps Use one-to-many label assignment supervised training, namely ATSS and faster-rcnn. Except for the different sources of feature maps, the rest of the training methods are the same as ATSS and faster-rcnn. Afterwards, the output results of ATSS and faster-rcnn will be encoded into query as the input of the auxiliary decoder. This decoder is the same as the main decoder. This part of the query does not need to be binary matched when calculating loss. The structure diagram in the paper shows the algorithm flow very well.


If you don’t know much about DINO, you can read my series of blog posts:

DINO code study notes (1)_athrunsunny's blog-CSDN blog

DINO code study notes (2)_athrunsunny's blog-CSDN blog

DINO code study notes (3)-CSDN blog

DINO code study notes (4)-CSDN blog

Or, the original author of DINO also explained his article on Zhihu:DINO

And you can deepen your understanding of Deformable-transformer throughDeformable-DETR code study notes_athrunsunny's blog-CSDN blog 

Introduction to ATSS:

ATSS (Adaptive Training Sample Selection) is a target detection algorithm used to accurately detect and locate target objects in images. It is a single-stage target detection method designed to solve the difficulties of traditional single-stage methods in dealing with dense targets and large-scale targets.

The core idea of ​​ATSS is to select training samples based on the similarity (center point distance) between the target and the anchor box. Traditional single-stage target detection methods use a fixed positive and negative sample sampling strategy, but this strategy may lead to sample imbalance problems when dealing with dense targets and large-scale targets. ATSS improves the performance of the model by introducing an adaptive sample selection mechanism to dynamically select positive and negative samples based on the similarity between the target and the anchor box.

Specifically, ATSS first obtains the IOU of each anchor box by calculating the similarity between the target and each anchor box on each layer of feature map. Then, the anchor boxes are sorted according to IOU, and a part of the anchor boxes with the highest IOU are selected as positive samples. Next, ATSS uses an adaptive method to select negative samples, that is, based on the distribution of positive samples, it selects anchor boxes that are less similar to positive samples as negative samples. This can effectively reduce the number of negative samples and improve the model's learning ability for difficult samples.


1 Introduction

        In this paper, the authors try to make detr-based detectors superior to traditional detectors while maintaining their end-to-end advantages. To address this challenge, the authors focus on the intuitive disadvantage of one-to-one set matching, namely that it explores less aggressive queries. This will lead to serious training inefficiencies. The author analyzes this in detail from two aspects: the latent representation generated by the encoder and the attentional learning of the decoder. The authors first compared the discriminability scores of latent features between deform-detr and one-to-many label assignment methods, where the authors simply replaced the decoder with the ATSS header. The discriminability score is expressed using the feature 12 norm on each spatial coordinate. Given the output of the encoder F∈RC×H×W, the discriminability score map S∈R1×H×W can be obtained. The higher the score of the target in the corresponding area, the easier it is to be detected. As shown in Figure 2, the authors demonstrate the IoF-IoB curve (IoF: intersection of foreground, IoB: intersection of background) by applying different thresholds to the discriminative score (see Section 3.4 for details). The higher the IoF-IoB curve of ATSS, the easier it is to distinguish the foreground and background.

        The author further visualizes the discriminability score map S in Figure 3. It is obvious that the one-to-many label assignment method fully activates the features of some salient regions, but is not fully explored in the one-to-one set matching. For the exploration of decoder training, the authors also show the IoF-IoB curve of cross-attention scores in decoders based on deform-detr and Group-DETR, which introduces more forward queries in the decoder. The illustration in Figure 2 shows that too few positive queries can also affect attentional learning, while adding more positive queries in the decoder can slightly alleviate this situation.

        This important observation prompted the authors to propose a simple yet effective method - the collaborative hybrid task training scheme (Co-DETR). The key idea of ​​CoDETR is to use universal one-to-many label assignment to improve the training efficiency and effectiveness of encoders and decoders. More specifically, the authors integrated an auxiliary head with the output of a transformer encoder. These heads can be supervised by versatile one-to-many label assignments such as ATSS, FCOS, and Faster RCNN. Different label assignments enrich the supervision of the encoder output, which forces it to be discriminative enough to support training convergence of these heads. To further improve the training efficiency of the decoder, the authors carefully encode the coordinates of positive samples, including positive anchors and positive proposals, in these auxiliary headers. They are sent to the original decoder as sets of positive queries to predict pre-assigned classes and bounding boxes. Positive coordinates for each auxiliary head as an independent group, isolated from other groups. The versatile one-to-many label assignment can introduce a large number of (positive query, ground truth) pairs to improve the training efficiency of the decoder. Note that only the original decoder is used during inference, therefore, the proposed training scheme only introduces additional overhead during the training process.

        The authors conducted extensive experiments to evaluate the efficiency and effectiveness of the proposed method. As shown in Figure 3, Co-DETR greatly alleviates the feature learning problem of the encoder in one-to-one set matching. Being a plug-and-play method, it is easy to combine it with different DETR variants, including DAB-DETR, DeformableDETR and DINO-Deformable-DETR.

As shown in Figure 1, Co-DETR achieves faster training convergence and even higher performance. Specifically, the authors improved the basic deform-detr by 5.8% AP in 12 epoch training and 3.2% AP in 36 epoch training. The state-of-the-art DINO-Deformable-DETR using Swin-L can still increase AP from 58.5% to 59.5% on COCO val. Surprisingly, combining the vit-L backbone achieved 66.0% AP on COCO test development and 67.9% AP on LVIS val, building a new state-of-the-art detector with smaller model size .

2. Related work

One-to-many label assignment

For one-to-many label assignment in target detection, multiple candidate boxes can be assigned to the same ground-truth box as positive samples during the training stage. In classic anchor-based detectors, such as Faster-RCNN and RetinaNet, sample selection is guided by a predefined IoU threshold and matching IoU between anchors and annotation boxes. Anchor-free FCOS uses the center prior to assign a positive value to the spatial position near the center of each bounding box. An adaptive mechanism is introduced in one-to-many label allocation to overcome the limitations of fixed label allocation. ATSS performs adaptive anchor selection through the statistical dynamic IoU values ​​of the closest k anchors. PAA adaptively divides anchor points into positive and negative samples in a probabilistic manner. In this paper, the authors propose a cooperative hybrid allocation scheme to improve encoder representation through an auxiliary header of one-to-many label allocation.

One-to-one set matching 

The groundbreaking transformer-based detector DETR integrates a one-to-one set matching scheme into target detection to achieve complete end-to-end target detection. The one-to-one set matching strategy first calculates the global matching cost through Hungarian matching, and assigns only one positive sample with the smallest matching cost to each truth box. DNDETR demonstrated slow convergence due to the instability of one-to-one set matching, so denoising training was introduced to eliminate this problem. DINO inherits DAB-DETR's advanced query formula and incorporates improved contrast denoising technology to achieve state-of-the-art performance. Group-DETR constructs group-wise one-to-many label assignment, leveraging multiple positive object queries, similar to the mix-and-match scheme in H-DETR. Based on the above follow-up work, this paper proposes a new perspective on one-to-one set matching collaborative optimization.

3. Method of this article

        ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​Input images are fed into the backbone and encoder to generate latent features. Multiple predefined object queries are interacted with them in the decoder via cross-attention. The authors introduce Co-DETR to improve the feature learning of the encoder and the attention learning of the decoder through a collaborative mixed task training scheme and custom positive query generation. The author will describe these modules in detail and give insights into why they work well.

3.2. Collaborative Hybrid Assignments Training

        To alleviate the sparse supervision of the encoder output due to fewer positive queries in the decoder, we combine versatile auxiliary heads with different multi-label assignment paradigms, such as ATSS and Faster R-CNN. Different label assignments enrich the supervision of the encoder output, which forces it to be discriminative enough to support training convergence of these heads. Specifically, given the latent feature F of the encoder, we first convert it into a feature pyramid {F1,···,FJ} through a multi-scale adapter, where J represents the downsampling stride of 2^{2+J}Feature mapping. Similar to ViT-Det, the feature pyramid is constructed from a single feature map in a single-scale encoder, while we use bilinear interpolation and 3 × 3 convolution for upsampling. For example, for single-scale features from the encoder, we sequentially apply downsampling (3×3 convolution with stride 2) or upsampling operations to produce a feature pyramid. For the multi-scale encoder, we only downsample the coarsest features among the multi-scale encoder features F to build a feature pyramid. K collaborative heads are defined, and in the corresponding label allocation method Ak, for the i-th collaborative head, {F1,···,FJ} is sent to it to obtain the predicted value ^Pi at the i-th head, Ai is used to calculate the supervision targets for positive and negative samples in Pi. Denoted as G and GT, this process can be expressed as:

where {pos} and {neg} represent the pairs of (j, positive or negative coordinates in Fj) determined by Ai. j represents the feature index in {F1,···,FJ}. B{pos} is the set of positive coordinates in space. P{pos} i and P{neg} i are the supervision targets under the corresponding coordinates, including categories and regression offsets. Specifically, we describe the details of each variable in Table 1.

The loss function can be defined as:

 Note that for negative samples, the regression loss is discarded. The training objectives of K auxiliary head optimization are:

3.3. Customized Positive Queries Generation

        ​​​​In the one-to-one set matching paradigm, each GT box will only be assigned to a specific query as the supervision target. Too few positive queries can lead to inefficient cross-attention learning in the transformer-decoder, as shown in Figure 2. To mitigate this situation, we carefully generate enough custom positive queries based on label assignment Ai in each auxiliary header. Specifically, given the positive coordinate set B{pos} i∈RMi×4 of the i-th auxiliary head, where Mi is the number of positive samples, the additional customized positive query Qi∈RMi×C can be generated in the following way :

where PE(·) represents the position encoding, and the corresponding feature is selected from E(·) according to the index pair (j, the positive or negative coordinate in Fj).

        Therefore, during training, there are K+1 sets of queries contributing to a single one-to-one set matching branch and K branches with one-to-many label assignments. The auxiliary one-to-many label assignment branch shares the same parameters as the L decoder layers in the original main branch. All queries in the auxiliary branch are considered positive queries and therefore the matching process is abandoned. Specifically, the loss of the first decoder layer in the i-th auxiliary branch can be expressed as:

Pi,l is the output prediction value of the l-th decoder layer of the i-th auxiliary branch. Finally, the training goals of Co-DETR are:

 Among them, ~Ldecl represents the loss in the original one-to-one set matching branch, and λ1 and λ2 are the coefficients of the balance loss.

3.4. Why Co-DETR works

        ​​​​Co-DETR brings a significant improvement over detr-based detectors. Next, we attempt to examine its effectiveness both qualitatively and quantitatively. We conducted a detailed analysis based on Deformable-DETR with ResNet50 backbone using 36epoch settings.

Enrich the encoder’s supervisions 

        Intuitively, too few positive queries lead to sparse supervision, since only one query for each GT is supervised by the regression loss. Positive samples under the one-to-many label assignment method receive more localization supervision, which helps to enhance latent feature learning. To further explore how sparse supervision hinders model training, the authors examine in detail the latent features produced by the encoder. An IoF-IoB curve is introduced to quantify the discriminability score of the encoder output. Specifically, given the latent features F of the encoder, inspired by the feature visualization in Figure 3, the authors calculate IoF (intersection of foreground) and IoB (intersection of background). Given the features Fj∈RC×Hj×Wj of the j-layer encoder, the l2 norm bFj∈R1×Hj×Wj is first calculated and resized to the image size H×W. The discriminant score D(F) is calculated by averaging the scores of each level:

The resize operation is omitted. We visualize the discriminant scores of ATSS, deform-detr and co-deform-detr in Figure 3. Compared with Deformable-DETR, both ATSS and Co-Deformable-DETR have stronger key target area identification capabilities, while Deformable-DETR is almost free of background interference. Therefore, we define the indicators of foreground and background as 1(D(F) > S)∈RH×W and 1(D(F) < S)∈RH×W respectively. S is a predefined score threshold, 1(x) is 1 if x is true and 0 otherwise. For the mask of foreground Mf g∈RH×W, if the (h, w) point is within the foreground, the element Mf g h is 1, otherwise it is 0. The foreground intersection area (IoF) if g can be calculated as:

Specifically, we calculate the intersection area (IoB) of the background area in a similar manner and plot the IoF and IoB curves by changing S in Figure 2. Obviously, ATSS and co-deform-detr obtain higher IoF values ​​than deform-detr and Group-DETR under the same IoB value, which indicates that the encoder representation benefits from one-to-many label assignment.​ 

Improve the cross-attention learning by reducing the instability of Hungarian matching 

        Hungarian matching is the core scheme of one-to-one set matching. Cross-attention is an important operation that helps positive queries encode rich object information. To do this requires adequate training. The authors observed that Hungarian matching introduces uncontrollable instability because the ground truth value assigned to a specific positive query in the same image changes during training. Next, the authors give a comparison of instability in Figure 5. The authors find that their method contributes to a more stable matching process. In addition, in order to quantify the degree of optimization of cross-attention, the IoF-IoB curve of the attention score is also calculated. Similar to the calculation of feature discriminability scores, the author sets different thresholds on the attention scores to obtain multiple IoF-IoB pairs. The comparison between Deformable-DETR, Group-DETR and CoDeformable-DETR is shown in Figure 2. The authors found that the IoF-IoB curves of detr with more positive queries are generally above deform - detr, which is consistent with their motivation.

3.5. Comparison with other methods

Differences between our method and other counterparts

        Group-DETR, H-DETR and SQR [2] implement one-to-many assignment through one-to-one matching with repeated groups and repeated ground-truth boxes. Co-DETR explicitly assigns multiple spatial coordinates as positive numbers to each ground truth value. Therefore, these dense supervision signals are directly applied to the latent feature map, giving it stronger discriminative power. In contrast, Group-DETR, HDETR and SQR lack this mechanism. Although more positive queries are introduced in these counterparts, the one-to-many assignment implemented by Hungarian matching still suffers from the instability problem of one-to-one matching. The authors' approach benefits from the stability of ready-made one-to-many assignments and inherits their specific way of matching between positive queries and ground truth boxes. Group-DETR and H-DETR cannot reveal the complementarity between one-to-one matching and traditional one-to-many assignment. To the best of our knowledge, we are the first to perform quantitative and qualitative analyzes of detectors using traditional one-to-many assignment and one-to-one matching. This helps to better understand their differences and complementarities so that DETR's learning capabilities can be naturally improved by leveraging off-the-shelf one-to-many task designs, without requiring additional specialized one-to-many design experience.

No negative queries are introduced in the decoder 

        Repeated object queries inevitably bring a large number of negative queries to the decoder and significantly increase GPU memory. However, the author's method only handles positive coordinates in the decoder and therefore consumes less memory, as shown in Table 7.

4. Experiment

Implementation details 

        Incorporate Co-DETR into the current detr-like pipeline and keep training settings consistent with the baseline. When K = 2, ATSS and Faster-RCNN are used as auxiliary heads, and when K = 1, only ATSS is retained. More details about the auxiliary head can be found in the supplementary material. The author chooses the number of learnable object queries to be 300, and the default settings {λ1, λ2} are {1.0, 2.0}. For Co-DINODeformable-DETR++, the authors used copypaste with large-scale dithering.

4.2. Main Results

        In this section, the effectiveness and generalization ability of Co-DETR on different DETR variables are empirically analyzed in Table 2 and Table 3. All results are reproduced using mmdetection. Collaborative hybrid task training is first applied to single-scale detr with C5 features. Surprisingly, Conditional-detr and DAB-DETR achieved 2.4% and 2.3% AP gains respectively over the baseline under longer training schedules. For Deformable-DETR with multi-scale features, the detection performance significantly improves from 37.1% to 42.9%. When the training time is increased to 36 times, the overall improvement (+3.2% AP) still applies. Furthermore, we performed experiments on the improved Deformable-DETR (denoted Deformable-DETR++) following [16] and observed an AP gain of +2.4%. The state-of-the-art DINO-Deformable-DETR using this method achieves an AP of 51.2%, which is +1.8% higher than the competitive benchmark.

        Based on two state-of-the-art baselines, we further extend the backbone capacity from ResNet50 to swin-L. As shown in Table 3, Co-DETR achieves 56.9% AP and greatly exceeds the deformation-detr++ baseline (+1.7% AP). The performance of DINO-Deformable-DETR using swin-L can still be improved from 58.5% AP to 59.5% AP.

4.3. Comparisons with the state-of-the-art

        The author applies the K = 2 method to deformable-detr++ and DINO. In addition, our Co-DINO-Deformable-DETR adopts quality focal loss and NMS. A comparison of COCO values ​​is reported in Table 4. Our method converges much faster compared to other competitors. For example, Co-DINO-DeformableDETR can easily achieve 52.1% AP using only 12 epochs when using the ResNet-50 backbone. Our approach using SwinL achieves 58.9% AP on the 1x scheduler, even surpassing other state-of-the-art 3x scheduler frameworks. More importantly, our best model Co-DINO-DeformableDETR++ achieved 54.8% AP using ResNet-50 and 60.7% AP using swin-L under 36 epoch training, significantly better than all models with the same backbone of existing detectors.

        ​​​​To further explore the scalability of the method, the authors expanded the backbone capacity to 304 million parameters. This large-scale backbone ViT-L is pre-trained using a self-supervised learning method (EVA-02). We first pre-trained Co-DINO-Deformable-DETR on Objects365 for 26 epochs using ViT-L, and then fine-tuned on the COCO dataset for 12 epochs. During the fine-tuning phase, the input resolution is randomly selected between 480×2400 and 1536×2400. Please see the supplementary information for detailed settings. Our results are evaluated using test time increments. Table 5 gives the latest comparative COCO test development benchmark. At a smaller model size (304M parameters), Co-DETR set a new record of 66.0% AP on COCO test development, +0.5% AP higher than the previous best model InternImage-G.

        The authors also demonstrated the best results of Co-DETR on the long-tail LVIS detection data set. In particular, we use the same Co-DINO-Deformable-DETR++ model on COCO, but choose FedLoss as the classification loss to compensate for the impact of imbalanced data distribution. Here, we only apply bounding box supervision and report object detection results. Table 6 gives the comparison results. Co-DETR Swin-L surpassed ViT-Det with 56.9% and 62.3% AP LVIS val and minival. The method backbone AP of MAE-pretrained ViT-H and GLIPv2 increased by 3.5% and 2.5% respectively. The authors further fine-tune the Objects365 pre-trained Co-DETR on this dataset. Without increasing the test time, the author's method achieves the best detection performance of 67.9% and 71.9% on LVIS val and minival respectively. Compared to InternImage-G with test-time enhancement of 3 billion parameters, the authors achieved +4.7% and +6.1% AP gains on LVIS val and minival while reducing the model size to 1/10.

4.4. Ablation Studies

        Unless otherwise stated, all ablation experiments were performed on deform-detr with ResNet-50 backbone. By default, the author chooses the number of auxiliary headers K to 1 and sets the total batch size to 32. More ablations and analyzes can be found in the supplementary material.

Criteria for choosing auxiliary heads 

        The authors further examine the criteria for selecting auxiliary avatars in Tables 7 and 8. The results in Table 8 show that any auxiliary header with one-to-many label assignment can consistently improve the baseline, and ATSS can achieve the best performance. We found that when K is chosen to be less than 3, the accuracy continues to improve as K increases. It is worth noting that when K = 6, there is a performance degradation, which we speculate is caused by severe conflicts between auxiliary heads. If feature learning is inconsistent between auxiliary heads, the improvement will be destroyed as K continues to become larger. The optimization consistency of the bulls is analyzed in the follow-up and supplementary information. To sum up, we can choose any head as the auxiliary head. In order to achieve the best performance when K≤2, we usually choose ATSS and Faster-RCNN as the auxiliary head. We won't use too many different heads, like 6 different heads, to avoid optimization conflicts.

Conflicts analysis

        When the same spatial coordinates are assigned to different foreground boxes or as background in different auxiliary heads, conflicts arise that confuse the training of the detector. We first define the distance between head Hi and head Hj, and the average distance of Hi to measure the optimization conflict as:

        ​​​​where KL, D, I, and C are KL divergence, data set, input image and class activation maps (CAM) respectively. As shown in Figure 6, when K > 1, we calculate the average distance between auxiliary heads, and when K = 1, we calculate the distance between the DETR head and a single auxiliary head. We find that the distance metric for each auxiliary head is insignificant when K = 1, an observation consistent with our results in Table 8: when K = 1, the DETR head can improve synergistically with any head . When K increases to 2, the distance metric increases slightly and our method reaches the best performance, as shown in Table 7. When K increases from 3 to 6, the distance surges, indicating that there are serious optimization conflicts between these auxiliary heads leading to performance degradation. However, the ATSS baseline with 6 different heads reaches 49.5% AP, and replacing ATSS with 6 different heads reduces it to 48.9% AP. Therefore, we speculate that too many different auxiliary avatars, such as more than 3 different avatars, will exacerbate conflict. To summarize, optimization conflicts are affected by the number of various auxiliary headers and the relationship between these headers.

Should the added heads be different?

        In the authors' analysis, co-training of two ATSS heads (49.2% AP) still improved the model of one ATSS head (48.7% AP), because ATSS is complementary to the DETR head. In addition, a variety of complementary auxiliary heads are introduced instead of being the same as the original head, such as Faster-RCNN, which can bring better gains (49.5% AP). Note that this does not contradict the above conclusion; on the contrary, since the conflicts are not significant, we can obtain the best performance with fewer different heads (K ≤ 2), but when using many different heads (K > 3) will face serious conflicts.

The effect of each component.

        The authors performed component ablation to thoroughly analyze the impact of each component in Table 9. Since dense spatial supervision makes the encoder features more discriminative, incorporating the auxiliary head yields significant gains. Alternatively, introducing a custom positive query also contributes significantly to the final results while improving the training efficiency of one-to-one set matching. Both techniques can speed up convergence and improve performance. In summary, the authors observe that the overall improvement results from more discriminative features of the encoder and more efficient attention learning of the decoder.

Comparisons to the longer training schedule

        As shown in Table 10, the authors found that as performance saturates, long training times cannot benefit deform-detr. In contrast, Co-DETR greatly speeds up convergence and improves peak performance.

Performance of auxiliary branches

        Surprisingly, the authors observed that Co-DETR also brought consistent gains to the auxiliary header in Table 11. This means that our training paradigm contributes to more discriminative encoder representations, which improves the performance of the decoder and auxiliary head.

Difference in distribution of original and customized positive queries

        The authors visualize the positions of the original positive query and the custom positive query in Figure 7a. Only one object is shown per image (green box). Hungarian matching Positive queries assigned in the decoder are marked in red. We mark the positive queries extracted from Faster-RCNN and ATSS in blue and orange respectively. These custom queries are distributed over the central region of the instance, providing sufficient supervision signals for the detector.

Does distribution difference lead to instability?

        The authors calculate the average distance between the original query and the custom query in Figure 7b. The average distance between original negative queries and custom positive queries is significantly larger than the distance between original and custom positive queries. Since the distribution gap between the original query and the custom query is small, no instability is encountered during training.

The training configuration of the code uses projects\configs\co_deformable_detr\co_deformable_detr_r50_1x_coco.py. By default, k=2 is used, which is ATSS+Faster-rcnn header.

Guess you like

Origin blog.csdn.net/athrunsunny/article/details/134565362