Focus-DETR

At present, the DETR class model has become a mainstream paradigm of target detection. However, the DETR algorithm model has high complexity and low inference speed, which seriously affects the deployment of high-accuracy target detection models on end-to-end devices, and widens the gap between academic research and industrial applications.

Researchers from Huawei Noah and Huazhong University of Science and Technology designed a new DETR lightweight model Focus-DETR to solve this problem.

  • Paper address: https://arxiv.org/abs/2307.12612

  • Code address - mindspore: https://github.com/linxid/Focus-DETR

  • Code address - torch: https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR

The research background of this paper is based on the transformer-based target detection model, especially the DETR series model. DETR-like models have made great progress and gradually bridged the gap with convolutional network-based detection models. However, the traditional encoder structure performs equal calculations on all tokens, and there is calculation redundancy. Recent work has discussed the feasibility of compressing tokens in the transformer encoder to reduce computational complexity. But these methods tend to rely on less reliable model statistics. Moreover, simply reducing the number of tokens can seriously affect detection performance. So the research background of this paper is how to reduce the amount of calculation while maintaining detection accuracy, that is, to achieve a better balance between computational efficiency and model accuracy. Specifically, existing sparse methods suffer from the following problems:

  1. Existing methods tend to rely heavily on the decoder's cross-attention map (DAM) to supervise foreground feature selection, which leads to poor foreground feature selection. Especially in models employing learnable queries, the correlation between DAM and retained foreground tokens decreases.

  2. Existing methods cannot effectively utilize the semantic correlation between multi-scale feature maps, and the deviation of token selection decisions between different scales is ignored.

  3. Existing methods only perform simple discarding of background tokens. The number of foreground tokens is still large, and finer-grained query interaction cannot be performed to improve semantic information because the amount of calculation is limited.

  4. Some models only reduce the number of tokens in the encoder without designing a mechanism to enhance the semantic expression of the foreground, making it difficult to improve performance.

That is to say, the existing model's foreground feature selection strategy is not optimized, the subdivision at the semantic level is insufficient, and the association of multi-scale features is ignored, which reduces the upper limit of the model. So how does the method proposed in the paper solve these problems?

This paper proposes Focus-DETR to focus on more informative tokens, the core of which is to design a hierarchical semantic discrimination mechanism. The authors design a token scoring mechanism that considers both localization and semantic taxonomy information. First, foreground selection is performed based on multi-scale features through a foreground token selector, and then a multi-category score predictor is introduced to select more fine-grained object tokens. This allows two levels of semantic segmentation to be obtained. Guided by this reliable scoring mechanism, tokens of different semantic levels are fed into a dual-attention encoder. Fine-grained object tokens are first expressed with self-attention enhancement, and then scattered back into foreground tokens to compensate for the limitation of deformable attention, thus improving the semantics of foreground queries. By gradually introducing semantic information for fine-grained discrimination, Focus-DETR reconstructs the computational flow of the encoder and enhances the semantics of fine-grained tokens with minimal computation. That is to say, Focus-DETR enables the transformer encoder to efficiently focus on features with more information by designing semantic-level progressive discrimination and fine-grained representation enhancement attention, which improves detection performance while reducing the amount of calculation. Compared with the previous method that relied too much on DAM, this paper introduces dual information of positioning and semantics for multi-level semantic discrimination, and on this basis, enhances the representation, which has higher reliability and thus achieves better performance.

In order to achieve a balance between model performance and computing resource consumption, video memory consumption, and inference delay, Focus-DETR uses a carefully designed foreground feature selection strategy to achieve accurate screening of highly relevant features for target detection; then, Focus-DETR further proposes An attention enhancement mechanism for filtered features is proposed to make up for the lack of long-distance information interaction of Deformable attention. Compared with the industry's full-input SOTA model, the AP is reduced by less than 0.5, the calculation amount is reduced by 45%, and the FPS is increased by 41%, and it has been adapted in multiple DETR-like models.

The authors compared and analyzed the GFLOPs and time delays of several DETR class detectors, as shown in Figure 1. It is found from the figure that in Deformable-DETR and DINO, the calculation amount of the encoder is 8.8 times and 7 times that of the decoder, respectively. At the same time, the delay of the encoder is about 4 to 8 times that of the decoder. This shows that improving the efficiency of the encoder is crucial. Figure 1: Comparative Analysis of Computational Amount and Latency of Multiple DETR Detectors

network structure

Focus-DETR includes a backbone, an encoder and a decoder composed of dual-attention. The Foreground Token Selector (Foreground Token Selector), between the backbone and the encoder, is a top-down scoring modulation based on cross-multi-scale features to determine whether a token belongs to the foreground. The Dual attention module selects more fine-grained target tokens through a multi-category scoring mechanism, and then inputs them into a self-attention module to compensate for the lack of token interaction information. Figure 2: Focus-DETR overall network structure

Computational Reduction: Foreground Screening Strategies

At present, there are already some methods for pruning the foreground token to improve performance. For example, Sparse DETR (ICLR2022) proposes to use the decoder's DAM (decoder attention map) as supervisory information. However, the author found that, as shown in Figure 3, the tokens screened by Sparse DETR are not all foreground areas. The author believes that this is due to the fact that Sparse DETR uses DAM to supervise the foreground token, and DAM will introduce errors during training. Focus-DETR uses ground truth (boxes and labels) to supervise the screening of foreground tokens. Figure 3: Comparison of tokens retained by Focus-DETR and Sparse DETR on different feature maps

Figure 4. Visualization of foreground and background label assignments

Furthermore, differences in feature selection from different feature maps are also ignored, which limits the potential to select features from the most appropriate resolution. To bridge this gap, Focus-DETR constructs a top-down scoring modulation module based on multi-scale feature maps, as shown in Figure 5. To fully exploit the semantic associations between multi-scale feature maps, the authors first use a multi-layer perceptron (MLP)  module to predict multi-category semantic scores in each feature map. Considering the high-level semantic features, the low-level semantic features contain richer semantic information, the author uses the token importance score of the high-level feature map as supplementary information to modulate the prediction results of the low-level feature map. Figure 5: Top-down prospect screening scoring modulation strategy

Fine-grained feature enhancement strategy

After relying on the pre-designed foreground filter to obtain more accurate foreground features, Focus-DETR uses an effective operation to obtain more fine-grained features, and uses these fine-grained features to obtain better detection performance. Intuitively, the authors hypothesized that it would be beneficial to introduce more fine-grained category information in this scenario. Based on this motivation, the authors propose a new attention mechanism combined with foreground feature selection to better combine fine-grained features and foreground features. Unlike the query selection strategy of two-stage Deformable DETR, the multi-class probabilities of Focus-DETR do not include background classes (∅). This module can be regarded as a self-attention, which performs enhanced calculations on fine-grained features. Then, the enhanced features will be scatter back to the original foreground features and update them.

Experimental results

main results

As shown in Table 1, the author compares the performance of Focus-DETR on the COCO validation set with other models. It can be found that based on DINO, when Focus-DETR only uses 30% tokens, it exceeds Sparse DETR by 2.2 AP. Compared with the original DINO, only 0.5 AP is lost, but the calculation amount is reduced by 45%, and the inference speed is increased by 40.8%. Table 1: Overall comparison experiment results

Model Performance Analysis

In Fig. 6, from the relationship between the accuracy and calculation amount of different models, Focus-DETR achieves the best balance between accuracy and computational complexity. On the whole, compared with other models, the performance of SOTA is obtained. Figure 6 Correlation analysis between test accuracy and computational complexity of different models 

Ablation experiment

As shown in Table 2, the author conducts ablation experiments for the model design to verify the effectiveness of the algorithm proposed by the author. Table 2 The impact of the foreground feature pruning strategy and fine-grained feature self-attention enhancement module proposed in this study on the experimental performance

1. The impact of foreground feature selection strategies 

Directly using the prospect score to predict AP is 47.8, adding the label generated by the label assignment strategy as supervision, and AP is increased by 1.0. Adding a top-down modulation strategy can improve the interaction between multi-scale feature maps, and the AP can be increased by 0.4. This shows that the proposed strategy is very effective for improving the accuracy. As can be seen from the visualization in Figure 7, Focus-DETR can accurately select foreground tokens on multi-scale features. And it can be found that the objects that can be detected overlap between the feature degrees of different scales, which is precisely because Focus-DETR uses overlapping settings. Figure 7 Tokens retained by multi-scale features

2. The impact of top-down scoring modulation strategies  Table 3. Associative methods for multi-scale feature map foreground scoring, the authors try top-down and bottom-up modulation.

The author compares the effects of the top-down modulation strategy and the bottom-up modulation strategy. The comparison results show that the top-down modulation strategy proposed by the author can achieve better performance.

3. Effect of foreground retention ratio on experimental performance  Table 4. Focus-DETR, Sparse DETR and DINO+Sparse DETR retain the ratio of foreground tokens

The author compared the performance of different pruning ratios. From the experimental results, it can be found that Focus-DETR has achieved better results under the same pruning ratio. whaosoft  aiot  http://143ai.com

Summarize

Focus-DETR achieves similar performance with only 30% of the foreground tokens, achieving a better trade-off between computational efficiency and model accuracy. The core component of Focus-DETR is a foreground token selector based on multi-level semantic features, taking into account both location and semantic information. Focus-DETR achieves a better balance between model complexity and accuracy by accurately selecting foreground and fine-grained features, and semantically enhancing fine-grained features .

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132025652