Plain-DETR

Without multi-scale feature maps & specific local constraints, DETR can increase painlessly! Microsoft Asia Research Institute proposed the strongest improvement strategy for DETR, which improves the original DETR detector while maintaining its "simple" characteristics: there is no multi-scale feature map and no local design of cross-attention calculation. 

The paper proposes an improved DETR detector that maintains "simple" characteristics: using single-scale feature maps and global cross-attention calculations without specific local constraints, in contrast to previous leading DETR-based detectors , the latter reintroduces multi-scale and local architectural inductive biases into the decoder. The authors demonstrate two simple techniques that work surprisingly well within a "simple" design to compensate for the shortcomings of multi-scale feature maps and locality constraints.

  • The first is to add a BBox-to-pixel relative position bias (BoxRPB) term to the cross-attention formula. This term can well guide each query to focus on the corresponding target area while providing coding flexibility.

  • The second is Backbone pre-training based on Mask Image Modeling (MIM), which helps to learn representations with fine localization capabilities and is crucial in making up for the dependence on multi-scale feature maps.

By integrating these techniques and recent advances in training and problem formulation, the improved "simple" DETR shows significant improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieves a mAP accuracy of 63.9 using Swin-L Backbone, which is very competitive among competing state-of-the-art detectors, all of which rely heavily on Multi-scale feature maps and region-based feature extraction.

Code: https://github.com/impiga/Plain-DETR

Recent revolutionary advances in the field of NLP have highlighted the importance of keeping task-specific heads or decoders as general, simple, and lightweight as possible, and shifting major efforts toward building more powerful large-scale underlying models. However, the computer vision community often continues to focus on the tuning and complexity of task-specific heads, resulting in designs becoming increasingly onerous and complex.

The development of DETR-based target detection methods also follows this trajectory. The original DETR method is impressive because it abandons complex and domain-specific designs such as multi-scale feature maps and region-based feature extraction that require specialized understanding of the specific object detection problem. However, subsequent developments in the field reintroduced these designs, which, while improving training speed and accuracy, also violated the principle of “less inductive bias.”

In this work, the authors aim to improve the original DETR detector while maintaining its “simple” properties: no multi-scale feature maps and no locality design for cross-attention computation. This is challenging because the object detector needs to handle objects at different scales and locations. Despite some progress in recent training and problem formulation, as shown in Table 1, ordinary DETR methods still lag far behind state-of-the-art detectors designed using multi-scale feature maps and region-based feature extraction. So, how do authors compensate for the architectural “inductive bias” in addressing multi-scale and arbitrary location targets? The author's exploration revealed that, although not completely new, two simple techniques worked surprisingly well in this situation:

  1. Relative position deviation from BBox to pixel (BoxRPB)

  2. Mask Image Modeling (MIM) pre-training

BoxRPB is inspired by the relative position bias (RPB) term in visual Transformers, which encodes the geometric relationship between pixels and enhances translation invariance. BoxRPB extends RPB to encode the geometric relationship between 4D boxes and 2D pixels. The authors also propose an axial decomposition method for efficient computation without loss of accuracy compared to using full terms.

The author's experiments show that the BoxRPB term can well guide the cross-attention calculation to focus on individual targets (see Figure 4), and significantly improve the detection accuracy by +8.9 mAP, reaching 37.2 mAP on the COCO benchmark. Common DETR Baseline (see Table 2).

The utilization of MIM pre-training is another key technology to improve the performance of ordinary DETR. The author's results also show that MIM pre-training can significantly improve +7.4 mAP based on the ordinary DETR Baseline (see Table 2), which may be attributed to its fine localization capabilities. While MIM pre-training has been shown to modestly improve the performance of other detectors, in common settings its impact is far-reaching.

Furthermore, this technique has proven to be a key factor in eliminating the need to use multi-scale feature maps from Backbone, thereby leveraging detectors from hierarchical Backbone or single-scale heads.

By integrating these techniques and recent advances in training and problem formulation, the authors' improved "simple" DETR achieves significant improvements over the original DETR detector, as shown in Figure 1. Furthermore, the authors' method achieves an accuracy of 63.9 mAP when pre-trained on the Object365 dataset, making it comparable to state-of-the-art object detectors that rely on multi-scale feature maps and region-based feature extraction techniques such as cascaded R-CNN and DINO etc.), is highly competitive.

In addition to these results, our approach demonstrates how to minimize architectural “inductive bias” when designing task-specific heads or decoders, rather than relying on detection-specific multi-scale and localized designs. The authors' research hopes to inspire future research into using general-purpose trivial decoders, such as DETR's, to solve a wider range of vision problems with minimal effort, thereby enabling the field to shift more of its efforts toward developing large-scale Basic vision model, similar to the situation in the NLP field.

Modern common DETR Baseline

Review of the original DETR

The original DETR detector consists of 3 sub-networks: The DETR framework has several advantages, including:

  1. Conceptually intuitive and universally applicable. It treats object detection as a pixel-to-object "translation" task, with the general concept of decoding image pixels into problem objects.

  2. Since an end-to-end ensemble matching loss is used, minimal domain knowledge is required, such as custom label assignment and hand-designed non-maximum suppression.

  3. Domain-specific multi-scale feature maps and region-based feature extraction are avoided.

Enhanced Normal DETR Baseline

basic settings

The author's basic setup mostly follows the original DETR framework, except for the following adjustments: Next, the author incorporates some recent training and problem formulation advances into the basic setup and gradually improves the detection accuracy, as shown in Table 1.

Incorporate Transformer encoder into Backbone network

The role of the Backbone network and Transformer encoder is to encode image features. The authors found that by leveraging the Vision Transformer Backbone network, the authors were able to incorporate the Transformer encoder's computational budget into the Backbone network, slightly improving performance, likely due to more parameters being pretrained.

Specifically, the author used the Swin-S Backbone network and removed the Transformer encoder. This method calculates FLOPs similarly to the original Swin-T plus 6-layer Transformer encoder. This method simplifies the entire DETR framework, including only a Backbone (encoder) and a decoder network.

Better classification using Focal Loss

Replacing the default cross-entropy loss with Focal Loss improves detection accuracy from 23.1 mAP to 31.6 mAP.

iterative optimization

The authors follow an iterative optimization scheme where each decoder layer makes incremental bounding box predictions on the latest bounding box produced by the previous decoder layer, unlike the original DETR which uses independent predictions within each Transformer decoder layer. This strategy improves detection accuracy by +1.5 mAP to 33.1 mAP.

content-based queries

Generate target queries based on image content. The 300 predictions with the highest confidence are selected as queries for the subsequent decoding process. A set matching loss is used to generate the target query, thus retaining the advantage of not requiring domain-specific label assignment strategies. This modification improves detection accuracy by +0.9 mAP to 34.0 mAP.

Look forward twice

The authors adopt a look-ahead-twice strategy to exploit refined bounding box information from previous Transformer decoder layers to more efficiently optimize parameters between adjacent Transformer decoder layers. This modification resulted in a +0.8 mAP improvement.

Mixed query selection

The original one-to-one set matching does not work well for training positive samples. There are some methods to improve the performance through auxiliary one-to-many set matching loss. The authors chose a mix-match approach because it retains the advantage of not requiring additional hand-labeled noise or assignment designs. This modification improved the detection accuracy by +2.0 mAP, reaching a final 37.2 mAP.

Relative position deviation from BBox to pixel

In this section, the authors introduce a simple technique, BBox-to-pixel relative position bias (BoxRPB), which is critical to compensate for the lack of multi-scale features and explicit local cross-attention computation.

The original DETR decoder uses standard cross-attention calculations: where X and O are the input and output features of each target query respectively; Q, K and V are the Query, Key and Value features respectively.

As shown in Figure 4, the original cross-attention calculation usually focuses on target-irrelevant image regions within the ordinary DETR framework. The authors speculate that this may be one of the reasons why its accuracy is much lower than multi-scale and explicit local designs. Inspired by the success of pixel-to-pixel relative position bias in the visual Transformer architecture, the authors explore the use of BBox-to-pixel relative position bias (BoxRPB) for cross-attention computation: where B is the geometric relationship between BBox and pixels Determined relative position deviation.

Unlike the original relative position bias (RPB) defined in 2D relative positions, BoxRPB needs to deal with a larger 4D geometric space. Next, the authors introduce two implementation variants.

A simple BoxRPB implementation

The authors adapted the continuous RPB method to calculate the 4D BBox-to-pixel relative position deviation. The original continuous RPB method generates a bias term for each relative position configuration by applying a meta-network on the corresponding 2D relative coordinates. When extending this method to BoxRPB, the author uses the upper left corner and lower right corner to represent a BBox, and uses the relative positions between these corner points and image pixels as the input of the meta-network. The author's experiments show that this simple implementation is already very effective, as shown in Table 3a. However, it would consume a large amount of GPU computation and memory budget and thus is not practical.

Decomposed BoxRPB implementation

Now, the author proposes a more efficient implementation of BoxRPB. Instead of directly calculating the bias term for 4D input, the authors consider decomposing the bias calculation into two terms: through decomposition, both computational FLOPs and memory consumption are greatly reduced, while the accuracy remains almost the same, as shown in Table 3a. This decomposition-based implementation was the default in the authors' experiments. Figure 4 shows the impact of BBox-to-pixel relative position deviation on cross-attention calculation. Overall, BoxRPB terms make attention more focused on targets and boundaries, whereas cross-attention without BoxRPB may focus on many irrelevant areas. This may explain how the BoxRPB entry significantly improves 8.9 mAP, as shown in Table 2.

More improvements

In this section, the authors introduce two additional techniques that can further improve the general DETR framework.

MIM pre-training

The authors exploit recent state-of-the-art techniques for mask image modeling pre-training, which have shown better locality. Specifically, the authors initialize the Swin Transformer Backbone network using SimMIM pre-trained weights, which are learned on ImageNet without labels. As shown in Table 2, MIM pre-training brings an improvement of +7.4 mAP compared to the author's ordinary DETR Baseline. The significant gain of MIM pre-training on the ordinary DETR framework compared to other detectors may highlight the importance of the learning localization ability of the ordinary DETR framework.

On the higher Baseline that already involves BoxRPB, MIM pre-training can still bring a gain of +2.6 mAP, reaching 48.7 mAP.

Furthermore, the authors note that MIM pre-training is also crucial in enabling the authors to abandon multi-scale Backbone features with almost no loss in accuracy, as shown in Tables 5b and 5c.

Reparameterized bounding box regression

Another improvement that the authors would like to highlight is the reparameterization of bounding boxes when performing bounding box regression.

The original DETR framework and most of its variants directly scale the center and size of the bounding box to [0,1]. Since large targets dominate the loss calculation, it encounters difficulty in detecting small targets. Instead, the authors reparameterize the bounding box center and size of the l-th decoder layer as: Ablation Study and Analysis

The importance of BBox relative position deviation

In Table 3, the authors study the impact of each factor in the BoxRPB scheme and report detailed comparison results in the following discussion. Impact of axial decomposition In Table 3a, the authors compare the two methods and find that the axial decomposition scheme reaches a comparable level in performance (50.9 vs. 50.8) while requiring a lower memory footprint (9.5G vs. 26.8G) and smaller computational overhead (5.8G FLOP vs. 265.4G FLOP).

The influence of BBox points

Table 3b shows the comparison using only the center point or two corner points. The authors found that applying center points alone improves Baseline (fourth row of Table 2) by +1.7 AP. However, the performance is not as good as using two corner points.

In particular, while the two methods achieve comparable AP50 results, exploiting corner points improves AP75 by +2.2. This shows that not only the position (center), but also the scale (height and width) of the query box is important to accurately model the relative position deviation.

The impact of hidden dimensions

The authors studied the impact of hidden dimensions in Equation 5. As shown in Table 3c, a smaller hidden dimension of 128 will result in a performance degradation of 0.5, indicating that the position relationship is not simple and requires a higher-dimensional space to model.

Comparison with other methods

The authors studied the impact of choosing other options to calculate the modulation term B in Equation 2. The authors compared with the following representative methods:

  1. Conditional cross-attention scheme, which computes modulation terms based on the inner product between conditional spatial (location) query embeddings and spatially critical query embeddings.

  2. DAB cross-attention scheme, which is based on conditional cross-attention and further uses box width and height information to modulate the position attention map.

  3. The Spatially Modulated Cross-Attention Scheme (SMCA), which designs a hand-crafted query space prior, implemented using a 2D Gaussian-like weight map, to constrain the attended features around an initial estimate of the target query.

Table 3d reports detailed comparison results. The authors' method performed the best among all methods. Specifically, the conditional cross-attention module achieves similar performance to the authors’ setting (first row of Table 3b) using only center points. DAB cross-attention and SMCA slightly outperform the conditional cross-attention module, but they still lag behind BoxRPB by 2.5 AP and 2.2 AP respectively.

The authors also compared BoxRPB with DAB-based cross-attention based on its official open source code. Replacing the DAB position module with BoxRPB achieved a performance improvement of +1.8 mAP.

Comparison with local attention schemes

In this section, we compare our global attention model with other representative local cross-attention mechanisms, including deformable cross-attention, RoIAlign, RoI sampling (sampling fixed points within a region of interest), and BBox Mask. The authors detail key differences between these approaches. As shown in Table 4, our method outperforms all local cross-attention variants. Furthermore, the authors observed that larger targets resulted in greater improvements in the authors' approach. Similar observations are also reported in DETR, which may be due to more effective long-distance context modeling based on global attention patterns.

About MIM pre-training The author explores different ways of using Backbone and decoder feature maps with or without MIM pre-training. The authors evaluated the performance of three different architectural configurations, as shown in Figure 3. The authors discuss and analyze the results below.

The observation that MIM pre-training leads to consistent gains in decoders that can remove multi-scale feature maps  is not trivial as most existing detection heads still require multi-scale features as input, which makes building a competitive single Scale pure DETR becomes possible. The authors hope that this finding will simplify the design of future detection frameworks.

No need for multi-scale feature maps from Backbone

By comparing the results of Table 5b and Table 5c, the authors analyze the impact of removing multi-scale feature maps from Backbone. When using supervised pre-trained Backbone, taking only the last feature map from Backbone will hurt performance. These results show that MIM pre-training can reduce the dependence on multi-scale feature maps.

It is enough to get a single scale feature map from Backbone and a single scale feature map from the decoder

Based on the above observations, the authors can draw a surprising but important simple conclusion, that is, by using the BoxRPB scheme proposed by the authors and MIM pre-training, the need for multi-scale feature maps in Backbone and Transformer decoders can be completely eliminated.

Applied to pure ViT

In this section, the author constructs a simple and effective all-pure target detection system by applying the author's method to pure ViT. The author's system only uses single-resolution feature maps in a pure Transformer encoder-decoder architecture without any multi-scale design or processing. We compare our method with the state-of-the-art Cascaded Mask R-CNN on the COCO dataset. For a fair comparison, the author uses ViT-Base pre-trained by MAE as Backbone and trains the object detector for ∼50 Epochs.

As shown in Table 8, the author's method achieves comparable results to cascaded Mask R-CNN without relying on multi-scale feature maps to better localize different target scales.

It is worth noting that the authors’ method is not trained using instance mask annotations, which are generally considered beneficial for object detection.

Visualization of cross attention maps

Figure 4 shows the cross-attention maps of the model with and without BoxRPB. For models with BoxRPB, cross-attention is focused on a single target. In contrast, the cross-attention of the model without BoxRPB focuses on multiple targets with similar appearance.

System level results

SOTA comparison

In this section, we compare our method with other state-of-the-art methods. Table 7 shows the results, all experiments reported in this table use Swin-Large as the backbone. Since other works usually apply encoders to enhance Backbone features, for fair comparison, the authors also stack 12 window-based single-scale Transformer layers (feature dimension 256) on top of Backbone. Through 36 training Epochs, the author's model achieved 60.0 AP on the COCO test-dev set, exceeding DINO-DETR 1.4 AP. Further introducing Objects365 as a pre-training data set, the author's method achieved an AP of 63.9 on the test-dev set, which is significantly improved than DINO-DETR and DETA. These powerful results verify that the all-pure DETR architecture has no inherent shortcomings and can achieve high performance. whaosoft  aiot  http://143ai.com  

Simpler ViT results

Table 8 reports more comparison results based on pure ViT. The author used the default settings described in Section 5.4 of the main article, used ViTBase pre-trained by MAE as Backbone, and trained the model for ∼50 Epochs. Based on the results, the authors observed:

  1. The author's method improves the pure DETR Baseline from 46.5 AP to 53.8 AP, using only a global cross-attention scheme to process single-scale feature maps

  2. The authors' approach outperforms powerful DETR-based object detectors such as Deformable DETR using a local cross-attention scheme to exploit multi-scale feature maps

Runtime Comparison with Other Methods

The author further analyzes the runtime costs of different cross-attention modulations in Table 9. BoxRPB slightly increases runtime compared to standard cross-attention but is comparable in speed to other positional bias methods. More Details of Local Attention Scheme

Figure 5 shows the difference between the author's method and local cross-attention methods such as deformable cross-attention, RoIAlign, RoI sampling, and box Mask. Most local cross-attention methods require the use of special sampling and interpolation mechanisms to build a sparse key-value space. The author's method uses all image positions as the key-value space and learns a box-to-pixel relative position deviation term (the gradient pink circular area in (e)) to adjust the attention weight. This makes the authors' approach more flexible and versatile than previous approaches. System-level comparison on COCO val

Table 10 compares the author's method with previous state-of-the-art methods when using Swin-Large as the Backbone network. Within 36 training Epochs, the author's model achieved 59.8 AP on the COCO validation set, exceeding DINO-DETR's 1.3 AP. By using Objects365 for pre-training, the author's method obtained 63.8 AP, which is much higher than DINO-DETR. These results show that with the authors' approach, the improved ordinary DETR can achieve competitive performance without inherent limitations.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132927783