[Paper Notes] DiffusionTrack: Diffusion Model For Multi-Object Tracking

Original link: https://arxiv.org/abs/2308.09905

1 Introduction

  Multi-target tracking is usually divided into two-stage tracking after detection (TBD) and one-stage joint detection and tracking (JDT). After TBD detects objects in a single frame, it uses a tracker to associate the same object across frames. Trackers used include motion-based tracking using Kalman filters, object association using re-identification techniques, graph-based trackers (modeling the association process as a minimum cost flow problem).
  The JDT method handles tracking and detection in a unified manner and can be divided into three categories: query-based trackers (using implicit unique queries to force queries to track the same object), offset-based trackers (using motion features to estimate motion offsets) ) and trajectory-based trackers (which use spatiotemporal information to deal with severe object occlusion problems).
  However, most TBD and JDT methods have the following problems:

  1. Global or local inconsistency: TBD's separate training process for detection and tracking leads to global inconsistency, while JDT also places detection and tracking in different branches or modules, which does not completely resolve the inconsistency.
  2. The balance between robustness and model complexity is suboptimal: TBD is simple, but detection fluctuations affect performance; JDT is robust, but it damages detection accuracy.
  3. Inflexibility of different scenes within the same video: Videos are processed under uniform settings without adaptively applying strategies to different scenes.

  This article proposes DiffusionTrack, which establishes a new paradigm from noise to tracking. This method forms object associations directly from a set of random bounding box pairs within adjacent frames, as shown in the figure. The coordinates of bounding box pairs are refined so that they cover the same object in both frames, allowing implicit simultaneous detection and tracking under a unified model.
Insert image description here

3. Method

3.1 Preparation knowledge

  The goal of multi-target tracking is to obtain a set of input target pairs (X t , B t , C t ) (X_t,B_t,C_t) sorted by time(Xt,Bt,Ct) , whereX t X_tXtYestt _Input image at time t , B t B_tBtand C t C_tCtYestt _The set of bounding boxes and class labels at time t . B t B_tBt的元素 B t i = ( c x i , c y i , w i , h i ) B_t^i=(c_x^i,c_y^i,w_i,h_i) Bti=(cxi,cyi,wi,hi) , partiii is the identification number of the object. WhenX t X_tXtmiddle iiWhen object i does not exist or is not detected, B ti = ∅ B_t^i=\emptyBti= .
  Diffusion modelstypically utilize two Markov chains, a forward chain that adds noise to the image and a backward chain that extracts the image from the noise. Given data distributionx 0 ∼ q ( x 0 ) x_0\sim q(x_0)x0q(x0) and define the variance sequenceβ 1 , β 2 , ⋯ , β T \beta_1,\beta_2,\cdots,\beta_Tb1,b2,,bT, the forward process is defined as p ( xt ∣ xt − 1 ) = N ( xt ; 1 − β txt − 1 , β t I ) p(x_t|x_{t-1})=\mathcal{N}(x_t ;\sqrt{1-\beta_t}x_{t-1},\beta_tI)p(xtxt1)=N(xt;1bt xt1,btI ) Givenx 0 x_0x0, by sampling the Gaussian vector ϵ ∼ N ( 0 , I ) \epsilon\sim\mathcal{N}(0,I)ϵN(0,I ) , xt x_tcan be obtained according to the following formulaxtSample of: xt = α ˉ tx 0 + ( 1 − α ˉ t ) ϵ t x_t=\sqrt{\bar{\alpha}_t}x_0+(1-\bar{\alpha}_t)\epsilon_txt=aˉt x0+(1aˉt) ϵtWhere α ˉ t = ∏ s = 0 t ( 1 − β s ) \bar{\alpha}_t=\prod_{s=0}^t(1-\beta_s)aˉt=s=0t(1bs) . During training, the grid will be based on xt x_tat different timesxtPredict x 0 x_0x0. When extrapolating, from random noise x T x_TxTStart and iterate through the reverse process to obtain x 0 x_0x0

3.2 DiffusionTrack

Insert image description here

  The network structure is shown in the figure, including the feature extraction backbone and the data association denoising head (diffusion head). The former is run once to extract two adjacent frames in depth (X t − 1 , X t ) (X_{t-1}, X_t)(Xt1,Xt) , which uses features as conditions to gradually extract predicted associated bounding box pairs from noisy bounding box pairs. The data sample is a set of bounding box pairsz 0 = ( B t − 1 , B t ) ∈ RN × 8 z_0=(B_{t-1},B_t)\in\mathbb{R}^{N\times8}z0=(Bt1,Bt)RN×8。神经网络 f θ ( z s , s , X t − 1 , X t ) , s ∈ { 0 , 1 , ⋯   , T } f_\theta(z_s,s,X_{t-1},X_t),s\in\{0,1,\cdots,T\} fi(zs,s,Xt1,Xt),s{ 0,1,,T } is trained asz_s from zszsPredict z 0 z_0z0, and based on two adjacent frames of images. Corresponding category label (C t − 1 , C t ) (C_{t-1},C_t)(Ct1,Ct) and the associated confidence scoreSSS is also predicted at the same time. IfX t − 1 = X t X_{t-1}=X_tXt1=Xt, then the multi-target tracking task degenerates into a target detection task.
  The backbone uses YOLOX+FPN.
  The diffusion head uses proposal boxes to crop RoI features from the feature map and feeds them into different blocks to obtain bounding box regression values, classification results, and associated confidence scores. Additionally, a spatio-temporal fusion module (STF) and associated fractional head are added for each block of the diffusion head.
  The spatiotemporal fusion module enables pairs of bounding boxes to exchange temporal information to ensure the integrity of data association. Given the RoI feature froit − 1 , froit ∈ RN × R × d f_{roi}^{t-1},f_{roi}^t\in\mathbb{R}^{N\times R\times d}froit1,froitRN × R × d and self-attention output queryqprot − 1 of the current block, qprot ∈ RN × d q_{pro}^{t-1},q_{pro}^t\in\mathbb{R}^{N \times d}qprot1,qprotRN × d , use linear projection and batch matrix multiplication to obtain the object queryqt − 1 , qt ∈ RN × dq^{t-1},q^t\in\mathbb{R}^{N\times d}qt1,qtRN×d
[ P 1 i ; P 2 i ] = L i n e a r 1 ( q p r o i ) ,                             P 1 i , P 2 i ∈ R N × C d f e a t = [ f r o i i , f r o i 2 t − 1 − i ] ,                               f e a t ∈ R N × 2 R × d f e a t = B m m ( f e a t , P 1 i . v i e w ( N , d , C ) ) f e a t = B m m ( f e a t , P 2 i . v i e w ( N , C , d ) ) q i = L i n e a r 2 ( f e a t . f l a t t e n ( 1 ) ) ,        q i ∈ R N × d \begin{aligned}[P_1^i;P_2^i]&=\mathtt{Linear1}(q_{pro}^i),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P_1^i,P_2^i\in\mathbb{R}^{N\times Cd}\\\mathbf{feat}&=[f_{roi}^i,f_{roi}^{2t-1-i}],\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathbf{feat}\in\mathbb{R}^{N\times 2R\times d}\\\mathbf{feat}&=\mathtt{Bmm}(\mathbf{feat},P^i_1.\mathtt{view}(N,d,C))\\\mathbf{feat}&=\mathtt{Bmm}(\mathbf{feat},P^i_2.\mathtt{view}(N,C,d))\\q_i&=\mathtt{Linear2}(\mathbf{feat}.\mathtt{flatten}(1)),\ \ \ \ \ \ q^i\in\mathbb{R}^{N\times d}\end{aligned} [P1i;P2i]featfeatfeatqi=Linear1(qproi),                           P1i,P2iRN×Cd=[froii,froi2t1i],                             featRN×2R×d=Bmm(feat,P1i.view(N,d,C))=Bmm(feat,P2i.view(N,C,d))=Linear2(feat.flatten(1)),      qiRN×dAmong them [ ⋅ , ⋅ ] [\cdot,\cdot][,] means splicing,[ ⋅ ; ⋅ ] [\cdot;\cdot][;] represents division,i ∈ { t − 1 , t } i\in\{t-1,t\}i{ t1,t } .
  The association score headerinputs the fused features of the bounding box pairs into the linear layer, which is used to obtain the confidence score of the data association. This score is used in the subsequent NMS post-processing link to determine whether the output bounding box pair belongs to the same object.

3.3 Model training and inference

  When training, randomly sample two frames from the video (the frame interval is 5), and fill the bounding box so that its number is N train N_{train}Ntrain. Then, α t \alpha_tatUse monotonic decreasing cosine scheduling and add Gaussian noise to the ground-truth bounding boxes. Finally, a denoising process is used to obtain associated bounding box pairs from the noisy bounding boxes. In addition, a baseline scheme is designed that only adds noise to the current frame and denoises based on the bounding boxes of past frames to demonstrate the necessity of destroying both frames simultaneously.
  Loss function : Extended 3D GIoU using GIoU. For the matching set MM obtained using the Hungarian algorithmMatching pair (T d, T gt) (T_d,T_{gt})in M(Td,Tgt) , record its classification score, bounding box and associated confidence score as(C dt − 1, C dt), (B dt − 1, B dt) (C^{t-1}_d,C^t_d),( B^{t-1}_d,B^t_d)(Cdt1,Cdt),(Bdt1,Bdt) andS d S_dSd。损失函数如下: L c l s ( T d , T g t ) = ∑ i = t − 1 t L c l s ( C d i × S d , C g t i ) L r e g ( T d , T g t ) = ∑ i = t − 1 t L r e g ( B d i , B g t i ) L d e t = 1 N p o s ∑ ( T d , T g t ) ∈ M λ 1 L c l s ( T d , T g t ) + λ 2 L r e g ( T d , T g t ) + λ 3 ( 1 − G I o U 3 D ( T d , T g t ) ) \mathcal{L}_{cls}(T_d,T_{gt})=\sum_{i=t-1}^t\mathcal{L}_{cls}(\sqrt{C_d^i\times S_d},C_{gt}^i)\\\mathcal{L}_{reg}(T_d,T_{gt})=\sum_{i=t-1}^t\mathcal{L}_{reg}(B_d^i,B_{gt}^i)\\L_{det}=\frac{1}{N_{pos}}\sum_{(T_d,T_{gt})\in M}\lambda_1\mathcal{L}_{cls}(T_d,T_{gt})+\lambda_2\mathcal{L}_{reg}(T_d,T_{gt})+\lambda_3(1-\mathtt{GIoU}_{3D}(T_d,T_{gt})) Lcls(Td,Tgt)=i=t1tLcls(Cdi×Sd ,Cgti)Lreg(Td,Tgt)=i=t1tLreg(Bdi,Bgti)Ld e t=Npos1(Td,Tgt)Ml1Lcls(Td,Tgt)+l2Lreg(Td,Tgt)+l3(1GIoU3D _(Td,Tgt)) whereT d T_dTd T g t T_{gt} TgtContains respectively the estimation results and real bounding boxes of the same target in two adjacent frames. N pos N_{pos}Nposis the number of prospect targets. L cls \mathcal{L}_{cls}Lclsis the focal loss, L reg \mathcal{L}_{reg}Lregfor L1 loss.
  The figure below shows the inference process. Different from detection, since the estimation results of past frames are known, N test N_{test} can be generated from the bounding boxes of past frames.NtestAn initial noisy bounding box. In the baseline scheme, existing bounding boxes are repeated instead of adding random bounding boxes, and noise is only added to the current frame. After obtaining the correlation results, IoU is used to measure the similarity and connect the object trajectories. To handle potential occlusions, a Kalman filter is used to re-associate missing objects.
Insert image description here

4. Experiment

4.1 Settings

  Implementation details : using YOLOX detector + DiffusionTrack. The training is divided into two phases: first training with detection tasks, and then training with tracking tasks.

4.2 Features

  DiffusionTrack has dynamic bounding boxes, progressive refinement features, and robustness to detecting perturbations.
  Dynamic bounding boxes and progressive refinement : Once the model is trained, the number of bounding boxes and the number of sampling steps in DiffusionTrack can be modified during inference. Thus a balance of performance and accuracy can be achieved without retraining.
  Robustness to detecting disturbances : Almost all past methods are sensitive to detecting disturbances, which poses safety issues in autonomous driving. Experiments show that DiffusionTrack is almost unaffected by detection disturbances.

4.3 Ablation studies

  Based on Figure 3, this paper studies the impact of each part.
  Proportion of prior information : Since the tracking task has prior information about the object in the previous frame, the proportion of prior information can be controlled by copying the number of existing bounding boxes. Experiments show that the right ratio can improve performance.
  Bounding box filling strategy : Filling random bounding boxes following a Gaussian distribution outperforms random bounding boxes following a rest distribution, image-sized bounding boxes, and copying existing bounding boxes.
  Disturbance scheduling : In order to deal with complex situations, α t \alpha_t needs to be adjustedat. For example, when the object moves nonlinearly, α t \alpha_t needs to be increasedat. Disturbance scheduling is available ttt modeling, expressed ast = max ⁡ ( 999 , min ⁡ ( 0 , 1000 f ( x ) ) ) t=\max(999,\min(0,1000f(x)))t=max(999,min(0,1000 f ( x ))) , wherexxx is the average percentage of objects spanning two frames,fff is the disturbance scheduling function. Experiments show that the optimal scheduling is logarithmic scheduling, that is,f ( x ) = log ⁡ ( x + 1 ) log ⁡ 2 f(x)=\frac{\log(x+1)}{\log2}f(x)=log2log(x+1).
  Efficiency comparison : More refinement steps improve performance but also reduce speed.

4.4 Comparison with SotA

  The method in this article can reach the SotA level of the one-stage method on the MOT17, MOT20 and Dancetrack data sets.
  The performance of the baseline scheme is similar on MOT17, but drops severely on the remaining datasets. This is because it only learns the given t − 1 t-1tIn the case of 1 time characteristics,B t − 1 B_{t-1}Bt1and B t B_tBtcoordinate regression, but cannot handle nonlinear object motion. This article speculates that the diffusion process is a special data enhancement method that enables DiffusionTrack to distinguish different targets.

Guess you like

Origin blog.csdn.net/weixin_45657478/article/details/133239770