DETR~2

Comprehensive review of pre-training methods for DETR target detection to make DETR training smoother

This paper studies the degree of improvement of the self-supervised pre-training method represented by DETReg on the continuously enhanced DETR architecture on the COCO object detection benchmark. Proposed to use COCO object detector to obtain more accurate pseudo-boxes and use informative pseudo-class labels. >

Paper link: https://arxiv.org/abs/2308.01300

Motivated by the new records achieved by DETR-based methods on the COCO detection and segmentation benchmarks, many recent efforts have shown increased interest in how to further improve DETR methods by pre-training the Transformer in a self-supervised manner and keeping the Backbone network unchanged. Some studies have claimed significant improvements in accuracy.

In this paper, the authors study their experimental approach in more detail and check whether their method still works on state-of-the-art models such as the recent H-Deformable-DETR. The authors conduct comprehensive experiments on the COCO object detection task, studying the impact of pre-training dataset selection, localization, and classification object generation schemes. Unfortunately, the authors found that previously representative self-supervised methods, such as DETReg, failed to improve the performance of powerful DETR-based methods over the complete data range. The authors further analyzed the reasons and found that simply combining a more accurate bounding box predictor with the Objects365 benchmark significantly improved the results of subsequent experiments. The authors demonstrate the effectiveness of our method by achieving strong object detection results of AP=59.3% on the COCO validation set, which exceeds H-Deformable-DETR + Swin-L by 1.4%.

Finally, the authors generate a series of synthetic pre-training datasets by combining recent image-to-text subtitle models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets can significantly improve object detection performance. Going forward, the authors anticipate substantial advantages will be gained by expanding synthetic pre-training datasets.

Recently, DETR-based methods have made significant progress on object detection and segmentation tasks and promoted cutting-edge research. For example, DINO-DETR, H-Deformable-DETR and Group-DETRv2 have refreshed the latest results in target detection performance on the COCO benchmark. MaskDINO further extends DINO-DETR and achieves the best results on COCO instance segmentation and panoramic segmentation tasks. To some extent, this is the first time that an end-to-end Transformer approach can achieve better performance than conventional highly tuned convolution-based strong detectors such as Cascade Mask-RCNN and HTC++.

These DETR-based methods achieve great success, but they still choose to randomly initialize the Transformer and therefore fail to realize the full potential of fully pretrained detection architectures like Aligning pretraining for detection via object-level contrastive learning, which has The benefits of aligning pretrained architectures with downstream architectures are demonstrated.

Figures 1a and 1b illustrate the distribution of the number of parameters and GFLOPs in the standard Deformable-DETR network based on the ResNet50 Backbone network. The author can see that the Transformer encoder and decoder occupy 65% ​​of the GFLOPs and 34% of the parameters, which means that there is a lot of room for improvement in the pre-training path of the Transformer part within DETR. Several recent studies have improved DETR-based object detection models by performing self-supervised pre-training on the Transformer encoder and decoder and freezing the Backbone network (see the process in Figure 2). For example, UP-DETR pre-trains the Transformer to detect random patches in images, DETReg pre-trains the Transformer to match object locations and features with priors generated from selective search schemes, and more recently, Siamese DETR uses correspondences from different perspectives The query features extracted from the box locate the target box.

However, these methods either use the basic DETR model (AP=42.1%) or the Deformable-DETR variant (AP=45.2%). When pre-trained on the latest and more powerful DETR models (such as H-Deformable-DETR, AP=49.6%), their results are significantly worse than expected and cannot achieve good object detection performance on COCO. (Taking DETReg as an example, in Figure 1c, all results are obtained using the ResNet50 Backbone net initialized with SwAV) In this work, the author first carefully studied the self-supervised pre-training method represented by DETReg on the COCO target detection benchmark. Improvements in the ever-enhancing DETR architecture. The author's investigation revealed significant limitations in the effectiveness of DETReg when applied to powerful DETR networks such as the Backbone network pre-trained with SwAV, the deformable technology in Deformable-DETR, and the inherent mix-matching scheme in H-Deformable-DETR. sex (see Figure 1c).

The authors identify the crux of the problem as unreliable proposal boxes generated by unsupervised methods (such as selective search), which results in the generation of noisy pre-training targets and introduces weak semantic information through feature reconstruction, further exacerbating the problem. These shortcomings make unsupervised pre-training methods ineffective when applied to already powerful DETR models.

To solve this problem, the authors propose to use COCO object detector to obtain more accurate pseudo-boxes and use informative pseudo-class labels. Through extensive ablation experiments, the authors highlight the influence of 3 key factors:

  1. Selection of pre-training data sets (ImageNet and Objects365)

  2. Selection of localization pre-training targets (selective search proposals and pseudo-box predictions)

  3. Selection of classification pre-training targets (object embedding loss and pseudo-category prediction)

These factors have an important impact on the effectiveness of the improvement method. The authors' results show that a simple self-training scheme using pseudo-box and pseudo-category predictions as pre-training targets outperforms the DETReg method in a variety of situations. Remarkably, this simple design significantly improves the pre-training performance of state-of-the-art DETR networks even without access to the ground-truth labels of the pre-trained baseline.

For example, using the ResNet50 Backbone network and the Objects365 pre-training dataset, simple self-training improved the COCO target detection results of DETReg on H-Deformable-DETR by 3.6%. In addition, the authors also observed excellent performance of the Swin-L Backbone network, achieving 59.3%.

method

DETR pre-training program

Traditional DETR consists of two network modules (Figure 2), including the Backbone network that extracts general image features and the Transformer that detects specific features and predicts target locations and categories. Transformer further consists of encoder and decoder modules, which are built from multiple linear neural network layers. The encoder applies a self-attention mechanism to extract better image features, while the decoder queries the encoder's features and predicts the required information as the task target.

Existing self-supervised methods adopt a pre-training scheme similar to that shown in Figure 2 for the pre-trained Transformer component. They choose ImageNet as a large-scale pre-training baseline and only access input images to build self-supervised models. Carefully designed pre-training tasks usually include localization tasks for predicting unsupervised pseudo-box proposals, and feature reconstruction tasks to preserve the Transformer’s feature discriminative ability.

To protect the general feature extraction capabilities of Backbone networks from being compromised by pre-training tasks, they freeze Backbone network weights initialized using either plain ImageNet pre-training or stronger self-supervised pre-training (called SwAV). Transformer's encoder and decoder are randomly initialized and updated during pre-training. During the fine-tuning phase, the Backbone network loads the unchanged ImageNet pre-trained weights, while the Transformer loads the updated weights. Then, all model weights are tuned together under the supervision of real labels from the object detection dataset. One of the representative self-supervised methods is DETReg.

DETReg uses selective search as an unsupervised method to create pre-trained box proposals for localization. Selective search generates boxes around possible objects without knowing the semantic category of the object. To compensate for the lack of category information, it also learns features of reconstructed boxes (also called object features) extracted from cropped input images using a fixed SwAVBackbone network. In this way, DETReg enables the detector to be pretrained for both its position and classification capabilities simultaneously, using 3 prediction heads to predict the position of the box, a binary category indicating whether the target is present within the box, and the associated target features.

Simple self-training

In this work, the authors found that self-supervised pre-training methods can only bring slight improvements to downstream tasks, especially when the raw accuracy of the DETR architecture is high.

For example, on the DeformableDETR architecture in Figure 1c, the DETReg pre-training method improves performance by 0.3, but reduces performance by 0.1 on the stronger H-Deformable-DETR architecture.

The author proposes a simple self-training scheme that not only alleviates this problem, but also improves the pre-trained model on the state-of-the-art DETR architecture.

The idea is to replace the low-quality unsupervised proposal boxes in localization pre-training with more accurate proposal boxes predicted by the trained object detector. While feature reconstruction for classification pre-training helps prevent degradation of the network's discriminative ability, the authors further enhance capacity by replacing it with pseudo-class labels predicted by the trained detector.

This modification introduces semantic information into pre-training. Although simple self-training is not a self-supervised method, it only accesses the images of the pre-training dataset, while the supervision introduced by the trained detector is from the downstream task dataset, which the authors assume is already available.

Different from traditional self-training schemes, traditional schemes rely on using complex data augmentation strategies to improve the quality of pseudo labels, and require careful adjustment of the non-maximum suppression (NMS) threshold and iteratively generating more accurate pseudo labels based on fine-tuning models. Label.

In contrast, the author's method directly generates pseudo-labels in one go without using these techniques. The pseudo-labels only contain a certain number of the most credible prediction results, so it is called simple self-training.

To generate pseudo labels, the authors first train an object detection model on the COCO dataset, and then use the model to predict pseudo bounding boxes and pseudo class labels on pre-trained benchmark datasets such as ImageNet. Then, the selected DETR-based network is pre-trained using the pseudo-labeled benchmark dataset.

In this work, the authors aim to study the impact of two key components in the representative self-supervised method DETReg and the authors' simple self-training method: the selection of localization pre-training targets and the selection of classification pre-training targets. Furthermore, in the carve-out study, the authors emphasized the importance of pre-training benchmark selection on pre-training performance.

Locate pre-training goals

Several unsupervised box proposal algorithms are used in self-supervised pre-training methods, such as random blocks in UP-DETR, selective search in DETReg, and EdgeBoxes in Siamese DETR. Regarding locating pre-trained objects, the author's discussion will revolve around the selective search box used by DETReg (Figure 4a) and the pseudo bounding box predictions generated by the trained object detector used in simple self-training (Figure 4c). Selective search box

In the field of deep learning, selective search was one of the most influential region proposal generation methods before the deep learning era, and it performed well in terms of recall. Selective search is inspired by the hierarchical nature of the image itself. It uses a hierarchical grouping algorithm as the basis, by using the FH algorithm to generate the initial region, and using the greedy algorithm to iterate based on the similarity of color, texture, size and shape. Combine areas into larger areas. The resulting regions form a set of candidate object proposals, each corresponding to a region in the image that may contain an object of interest.

Similar to the DETReg method, the author will retain about 30 proposal boxes with the highest confidence as localization pre-training targets based on selective search.

Pseudo bounding box prediction

For the pseudo bounding box prediction scheme, the authors directly select several off-the-shelf well-trained COCO object detectors to predict pseudo bounding boxes for pre-trained benchmarks.

Specifically, the author chose a powerful DETR-based network called H-Deformable-DETR as the author's object detector, and selected two different Backbone networks, including ResNet50 and Swin-L. These two detectors have significant detection performance differences on the COCO dataset, which is due to their different Backbone network capabilities and training time:

  1. H-Deformable-DETR + ResNet50 trained for 12 epochs (AP=48.7%)

  2. H-Deformable-DETR + Swin-L train for 36 epochs (AP=57.8%)

Then, by inference on the pre-trained baseline, the authors obtained the predictions of pseudo bounding boxes and retained the approximately 30 bounding box predictions with the highest confidence.

discuss

Table 1 compares the bounding box quality of various proposal box methods on the pre-trained benchmark dataset Objects365. The authors report class-independent precision and recall. It can be seen that the pseudo bounding boxes predicted by H-Deformable-DETR are more accurate than the unsupervised selective search method.

To understand the difference in quality, we visualized the true bounding box, selective search bounding box, and pseudo bounding box predictions for the two detectors on Objects365 in Figure 3. Pre-training target classification

The authors discuss two methods for generating categorical pre-trained targets, which include feature reconstruction methods (represented by DETReg’s target embedding loss, Figure 4a) and pseudo-class predictions used in simple self-training (Figure 4c).

Object embedding loss

To associate each bounding box with an explicit semantic category meaning, DETReg applies a target embedding header on each query embedding in the decoder to return a target embedding that contains the semantic meaning within the associated bounding box. Encoding related information. DETReg obtains target embeddings by feeding it image regions (cropped by proposal boxes) using a Backbone network pretrained with SwAV, as shown in Figure 4a.

Then, it calculates the L1 loss between the predicted object embedding and the corresponding target embedding as the target embedding loss. In DETReg, the Backbone network used to extract target embeddings and the Backbone network in the main network based on DETReg is fixed, and only the Transformer encoder, decoder, and prediction head are updated during pre-training.

Pseudo class prediction

The author can also utilize the category prediction of the aforementioned COCO object detector as the classification target corresponding to each bounding box object, which contains finer and richer semantic information. whaosoft  aiot  http://143ai.com

Since the detector is trained on COCO, the pseudo-category labels it predicts are the 80 categories of COCO. As a subset of the pre-trained baseline categories, it can help achieve efficient pre-training effects. Since COCO pseudo-categories also include categories from downstream benchmarks (COCO and PASCAL VOC), it narrows the gap between pre-training and downstream tasks.

Since each pseudo-class prediction is assigned to a pseudo-bounding box in the object detector, the authors cannot use it with selective search to locate objects. Figure 4 shows the research results of the remaining three positioning and classification pre-training targets, which are the original DETReg method, the DETReg method enhanced by the pseudo-bounding box of the COCO detector, and the simple self-training method.

experiment

Comparison to the State-of-the-art Research results on different DETR structures 

ablation experiment

Pre-training dataset selection Pre-training method pseudo-box number Encoder and decoder pre-training Fine-tuning dataset size Qualitative analysis results of synthetic data generated via T2I

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132769063