DETRs with Hybrid Matching

CVPR2023| DETRs with Hybrid Matching An improved DETR algorithm based on one-to-one and one-to-many matching hybrid strategies

  • Paper link: https://arxiv.org/pdf/2207.13080.pdf
  • Source code link: https://github.com/HDETR/H-Deformable-DETR

Introduction

Models based on DEtection TRansformer (DETR) perform well in various basic visual recognition tasks (target detection, instance segmentation, panoramic segmentation, directional target segmentation, video instance segmentation, pose estimation, multi-target tracking, depth estimation, text detection, line segmentation detection , 3D target detection based on point cloud or multi-view, visual question answering and other fields) has achieved great success.

There are many angles to improve the DETR model, including redesigning the Transformer encoder, Transformer encoder architecture or redesigning the query formula. Different from previous improved methods, this method focuses on the inefficient training problem caused by one-to-one matching (only one query is assigned to one GT). For example, since more than 99% of the images in the COCO dataset contain less than 30 annotation boxes, Deformable-DETR only considers selecting less than 30 matching GTs from the 300 queries in the pool, and the remaining more than 270 queries are annotated ∅ \ emptyset , and is only supervised by the classification loss.

In order to overcome the shortcomings of one-to-one matching and release the advantages of exploring active queries, this paper proposes a simple hybrid matching strategy to generate more informative queries for GT matching in each forward propagation process. The core idea of ​​mixed matching is to use one-to-many matching to improve the training strategy and use one-to-one matching to avoid NMS post-processing.

Two decoder branches are used here to agree on one-to-one matching and one-to-many matching. During the training phase, one decoder branch is maintained handling the set of queries for one-to-one matching and another decoder branch handles the additional queries for one-to-many matching. In the evaluation phase, only the first decoder branch is used, and the first set of queries is supervised by the one-to-one matching method. This method avoids NMS post-processing and does not introduce additional computational costs in the inference stage.

Methods of this article

Introduction to DETR method

DETR architecture

Given input I \mathbf{I}I , DETR first uses backbone and Transformer encoders to extract a series of enhanced pixel embeddingsX = { x 0 , x 1 , … , x N } \mathbf{X} = \{\mathbf{x}_{0},\ mathbf{x}_{1},\ldots,\mathbf{x}_{N}\}X={ x0,x1,,xN} . Then the pixel embedding and default object query embedding groupQ = { q 0 , q 1 , … , qn } \mathbf{Q} = \{\mathbf{q}_{0},\mathbf{q}_{1} ,\ldots,\mathbf{q}_{n}\}Q={ q0,q1,,qn} Pass in the Transformer decoder. DETR then applies the task-specific prediction head to the updated object query embedding after each Transformer decoder to independently generate a set of predictionsP = { p 0 , p 1 , … , pn } \mathbf{P}=\{\mathbf{ p}_{0},\mathbf{p}_{1},\ldots,\mathbf{p}_{n}\}P={ p0,p1,,pn} . Finally DETR performs one-to-one binary matching before prediction and GT. DETR matches GT and predicted values ​​based on minimum matching loss and applies corresponding supervision loss.

Deformable DETR architecture

The main improvements of Deformable DETR include:

  1. Use multi-scale deformable self-attention and multi-scale deformable cross-attention modules to replace the original multi-head self-attention or cross-attention.

  2. The original independent hierarchical forecasting method is replaced by an iterative refinement forecasting method.

  3. The dynamic query generated from the original Transformer encoder output replaces the original image content regardless of the query.

Hybrid branch architecture

Maintain two sets of queries Q = { q 1 , q 2 , … , qn } \mathbf{Q} = \{\mathbf{q}_{1},\mathbf{q}_{2},\ldots,\mathbf {q}_{n}\}Q={ q1,q2,,qn}Q = { q^1, q^2, ..., q^n} \mathbf{Q} = \{\width{\mathbf{q}}_{1},\width{\mathbf{q}} _{2},\ldots,\width{\mathbf{q}}_{n}\}Q={ q 1,q 2,,q n} . Use one-to-one or one-to-many matching on prediction results.

one-to-one matching branch

Use LLThe L Transformer decoder processes the first set of queries and forms predictions for each decoder layer. Using binary matching between prediction and GT, calculate the loss:
L one 2 one = ∑ l = 1 LLH ungarian ( P l , G ) \mathcal{L}_{one2one} = \sum_{l=1}^{L} \mathcal{L}_{Hungarian}(\mathbf{P}^{l},\mathbf{G})Lone2one=l=1LLHungarian(Pl,G )
uses the same loss function as DETR and Deformable DETR, including a classification loss, anL 1 \mathcal{L}_{1}L1losses and GIOU losses.

One-to-many matching branch

Process the second set of queries using the same L Transformer layers to get LLL group prediction. In order to achieve one-to-many matching, simply copy GTKKK times to obtain the augmented targetG ^ = { G 1 , G 2 , … , GK } \widehat{\mathbf{G}} = \{\mathbf{G}^{1},\mathbf{G}^ {2},\ldots,\mathbf{G}^{K}\}G ={ G1,G2,,GK} G 1 = G 2 = … = G K = G \mathbf{G}^{1}=\mathbf{G}^{2}=\ldots=\mathbf{G}^{K}=\mathbf{G} G1=G2==GK=G. _ Bipartite matching prediction and augmented target are also used here.
L one 2 many = ∑ l = 1 LLH ungarian ( P ^ l , G ^ ) \mathcal{L}_{one2many} = \sum_{l=1}^{L}\mathcal{L}_{Hungarian}( \widehat{\mathbf{P}}^{l},\widehat{\mathbf{G}})Lone2many=l=1LLHungarian(P l,G ) uses a combination of two losses λ L one 2 many + L one 2 one \lambda\mathcal{L}_{one2many}+\mathcal{L}_{one2one}
throughout the training processλLone2many+Lone2one

Other mix-match variations

Hybrid training cycle framework

Hybrid epoch scheme. The core change is to use different strategies during different training cycles.

One-to-many matching training cycle

At the initial ρ \rhoρ training cycle, using one-to-many matching strategy to processLLLL obtained by L Transformer decoder layersGroup L output also needs to augment GTG ^ = { G 1 , G 2 , … , GK } \widehat{\mathbf{G}} = \{\mathbf{G}^{1},\mathbf{G }^{2},\ldots,\mathbf{G}^{K}\}G ={ G1,G2,,GK }. Use the same one-to-many matching strategy as before.

One-to-one matching training cycle

In the remaining ( 1 − ρ ) (1-\rho)(1ρ ) The training cycle uses one-to-one matching instead of one-to-many matching.

mixed layer strategy

Hybrid layer scheme. Different pairing strategies are used here for different Transformer decoder outputs: first L 1 L_{1}L1The Transformer decoder layer uses one-to-many matching. L 2 L_{2} is leftL2Each decoder layer uses a one-to-one pairing strategy.

Guess you like

Origin blog.csdn.net/qgh1223/article/details/129972415