【MOT】Summary of multi-target tracking general process method

Classified from the overall framework:

  • TBD (Tracking-by-Detecton), or DBT (Detection-based-Tracking), specifically first uses an object detector to detect objects, and then uses motion, position, appearance cues, or a combination of them to correlate detections across frames to form Tracks corresponding to specific identities. For online applications, associations can be solved frame-by-frame, or offline in a tracking fashion over sequences. The quality of the tracker is limited by the quality of the detector;
  • JDT (Joint-Detection-and-Tracking), or TBR (Tracking-by-Regression), or D2T (Detect-to-Track), generally speaking, detection and tracking belong to the same framework, and detection and tracking tasks are performed at the same time;

From whether to use embedding classification:

  • w/o embedding
  • SDE (Separate-Detection-and-Embedding), embedding is done with a neural network alone, which is time-consuming;
  • JDE (Joint-Detection-and-Embedding), which outputs embedding while detecting, only needs one step of reasoning.

This article mainly refers to the MOT algorithm provided by https://blog.csdn.net/weixin_43082343/article/details/127421908 and https://github.com/luanshiyinyang/awesome-multiple-object-tracking#metrics .

1. Sports model

1.1 Explicit models

1.1.1 Linear motion

(1) No model (can be regarded as a constant position model):

[AVSS 2017] IOUTracker: High-Speed tracking-by-detection without using image information

Directly use the current frame det to do IOU matching with the previous frame track, that is, do not make position prediction for the track of the previous frame. Simple and fast (100kfps), occlusion or missed detection will interrupt tracking

(2) Simple linear motion (average speed)

[WACV 2023] C-BIoU: Hard to Track Objects with Irregular Motions and Similar Appearances? Make It Easier by Buffering the Matching Space

that thatd e t position astrack trackThe position of t r a ck , iftrack trackIf t r a c k does not match, use the estimated position as the current frame position, and take the previouskkAverage the speed of k frames (the speed is the difference between the positions of the center points of the detection frames of adjacent frames).
[2022] GHOST: Simple Cues Lead to a Strong Multi-Object Tracker also uses this approach.

(3) Kalman filter (constant speed model)

[ICIP 2016] SORT: Simple online and realtime trackin

Kalman filter estimates the position of the next frame, assuming uniform motion, the state variables are [ x , y , s , γ , x ˙ , y ˙ , s ˙ ] [x,y,s,γ,\dot x,\dot y ,\dot s][x,y,s,c ,x˙,y˙,s˙ ], a classic. There are many MOT algorithms usingKFthe constant velocity model, and some of them modify the state variables. For example, the state variables of OC-SORT are[ x , y , w , h , x ˙ , y ˙ , w ˙ , h ˙ ] [x,y ,w,h,\dot x,\dot y,\dot w,\dot h][x,y,w,h,x˙,y˙,w˙,h˙]

[ICRA 2017] CIWT: Combined image-and world-space tracking in traffic scenes

Use EKF(扩展卡尔曼滤波)a constant velocity model.
Also, [2018] DetTA: Detection-Tracking for Efficient Person Analysis: The DetTA Pipeline uses Bi-EKF(双向扩展卡尔曼滤波)a constant velocity model.

[2022] GIAOTracker: A comprehensive framework for MCMOT with global information and optimizing strategies in VisDrone 2021

Use UKF(无迹卡尔曼滤波)the constant velocity model, on this basis, modify the observation noise R according to the confidence of the detection frame: R ~ k = ( 1 − ck ) R k \widetilde R_k=(1-c_k)R_kR k=(1ck)Rk, the author called the improved model NSA KF[a bright idea, the author verified the feasibility of experiments].
[2022] StrongSORT: Make DeepSORT Great Again also takes this approach.

1.1.2 Nonlinear Motion

(1) Secondary movement

[ECCV2020] DMM-Net: Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking

The author believes that the tracklet can be regarded as a tube composed of N frames of bboxes. The axis of the tube is the trajectory of the center point. The author describes it as a secondary motion trajectory, where P 4 × 3 P_{4×3}P4×3是运动参数:
[ c x ( t ) c y ( t ) w ( t ) h ( t ) ] = P 4 × 3 [ t 2 t 1 ] \begin{equation} \left[ \begin{array}{cccc} cx(t)\\ cy(t)\\ w(t)\\ h(t) \end{array} \right] =P_{4×3} \left[ \begin{array}{cccc} t^2\\ t\\ 1 \end{array} \right ] \end{equation} cx(t)cy(t)w(t)h(t) =P4×3 t2t1

(2) Uniform motion in the world coordinate system

[ICCV2021] PermaTrack: Learning to Track with Object Permanence

For unmatched track trackt r a c k , convert the position coordinates of the image coordinate system to the world coordinate system, calculate the current frame position according to the uniform motion, and then convert back to the image coordinate system. [The movement of a track that has not been updated for a long time in the image coordinate system is obviously not a uniform motion. In the world coordinate system, the target can be regarded as a uniform motion. This method is more reasonable, but the transformation matrix is ​​not easy to find.
Similar [ICRA2018] MOTBeyondPixels: Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking

1.2 Implicit Model

(1) Recurrent neural network

[2021] DEFT: Detection Embeddings for Tracking

Use the LSTM module as the motion module to limit the unreasonable matching of appearance features, which is the same as DeepSORT using spatial distance gating to limit the appearance distance.

[CVPR2021] ArTIST: Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

The author proposes a motion model ArTIST based on random autoregressive , which explicitly learns the natural motion trajectory under multi-modality. Specifically, the continuous trajectory speed is K-means, divided into 1024 features, and the cyclic neural network + The output of the fully connected layer is 4*1024, and the offset with the largest output probability after softmax [ Δ x , Δ y , Δ w , Δ h ] [\Delta_x,\Delta_y,\Delta_w,\Delta_h][ Dx,Dy,Dw,Dh]

[Note] Based on the method of RNN and LSTM, I only list the MOT method after 2016, and there are many algorithms used before that.

(2) Network regression

Generally, it is a JDT method. Add a branch to predict motion in the detection or directly return the position of the next frame according to the previous frame. I also summarize it into the motion model, which can be regarded as an implicit propagation of position.

[ICCV2017] D2T: Detect to Track and Track to Detect

It should be regarded as the earliest JDT-like method, which jointly performs detection and tracking through a CNN network, specifically combining the feature output position offset between two frames [ Δ x , Δ y , Δ w , Δ h ] [\Delta_x ,\Delta_y,\Delta_w,\Delta_h][ Dx,Dy,Dw,Dh]

[ICCV2019] Tracktor++: Tracking without bells and whistles

One of the classic JDT methods, under the framework of Faster RCNN, the author uses the position of the target in the previous frame as the region proposals, directly returns the target position of the current frame, and inherits the id. Directly eliminate targets with low confidence, and do NMS for overlapping targets according to the confidence, and keep the id with high confidence. In addition, when the video frame rate is low, a constant velocity assumption is used to estimate the position as region proposals.
[2020] MAT: Motion-Aware Multi-Object Tracking adds a motion module to the previous frame bbox based on Tracktor++: KF constant velocity model and camera motion compensation.

[2020] SMOT: Single-Shot Multi Object Tracking

Similar to Tracktor++, with SSD as the framework, the prior position is estimated by the dense optical flow algorithm for the target in the previous frame , the posterior position selects the anchor with the highest IOU with the prior position, and the target position is returned through the SSD network.
[CVPR 2020] FFT: Multiple Object Tracking by Flowing and Fusing Based on Tracktor++, the optical flow estimation network FlowTracker is used to estimate motion features as a region proposal.

[2021] TrackFormer: Multi-Object Tracking with Transformers

The overall structure adopts DETR, and the embedding of the trajectory of the previous frame is transformed through MSA (multi-head self-attention mechanism) to obtain the track query [I also regard this process as a kind of motion transformation], and then send it to the decoder together with the object query As a query and the current frame feature to do cross-attention calculation.

[ECCV 2022] MOTR: End-to-End Multiple-Object Tracking with Transformer

The overall structure adopts deformable DETR. The concept of track query is the same as that of TrackFormer, but QIM (Query Interaction Module) is added to manage tracks, mainly adding and disappearing tracks. Specifically: the current frame detect query and the previous frame track The query is input into QIM together, and the encoder generates a tracking score to judge whether to retain the track. The retained track query is passed through a TAN (Temporal Aggregation Network, which is actually a multi-head self-attention module) and combined with the track query level generated from the detect query. Linked as the track query of the current frame.

[For the two-stage method, the target position of the previous frame is used as region proposals; for the one-stage method, the target position of the previous frame is used as anchor or anchor points; for the transformer method, the target position of the previous frame is used as the query for cross-attention calculation 】

[CVPR2021] SiamMOT: Siamese Multi-Object Tracking

Also use Faster RCNN as the baseline, but add an additional motion model Siamese tracker, there are two types of motion modes: (i) Implicit motion model (IMM), using MLP to estimate the motion offset pp
between two framesp and visibility confidencevvv
(ii) Display Motion Model (EMM): Cross-correlate the feature map of the previous frame with the feature map of the search area of ​​the current frame, and then use the full convolutional network to output the motion offsetppp and confidencevvv

[ECCV2020]CenterTrack: Tracking Objects as Points

Under the CenterNet framework, the input combines the previous frame image and the detection result, and the output adds a head with a predicted position offset. According to the confidence w of the detection result of the current frame from high to low, associate the previous frame with the latest match. match object. If no object is matched within the radius k, it is considered a new object. (k is defined as the geometric mean of the height and width of the predicted bounding box of each predicted track.) [
CVPR2021] TraDes: Track to Detect and Segment: An Online Multi-Object Tracker also uses the same method to predict the inter-frame offset

1.3 SOT Tracker

SOT is given an initial position, and then searches near the target.
The difference with MOT: SOT is to distinguish the target from the background, MOT is to distinguish different targets; SOT is a proximity search, and MOT is a global search.
(I have a relatively shallow understanding of SOT, if there is any mistake, I hope to point it out)

[ICCV 2017] STAM:Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

Use SOT tracker to track each target separately, and the motion model is a constant speed model with momentum updated online.
v ~ ti = 1 T gap ( lti − lt − T gapi ) \widetilde v_t^i=\frac{1}{T_{gap}}(l_t^il^i_{t-T_{gap}})v ti=Tgap1(ltiltTgapi)
v t i = α t i v t − 1 i + ( 1 − α t i ) v ~ t i v_t^i=\alpha^i_tv_{t-1}^i+(1-\alpha^i_t)\widetilde v_t^i vti=ativt1i+(1ati)v ti

[AVSS 2018] V-IOU: Extending IOU Based Multi-Object Tracking by Visual Information

Add motion estimation on the basis of the aforementioned IOU Tracker, and track the unmatched tracktrack predicts position with SOT tracker.

[ECCV 2018] Online Multi-Object Tracking with Dual Matching Attention Network

for each track trackThe track is tracked separately with the SOT tracker, and when the SOT position estimation is unreliable, the constant velocity model is used to estimate the position.

[CVPR 2021] SOTMOT: Improving Multiple Object Tracking with Single Object Tracking

The author believes that the MOT method that used the SOT tracker before needs to track each tracktrack uses a tracker , and the time is proportional to the number of targets. Therefore, adding a SOT branch on the basis of CenterNet can perform SOT tracking on each target at the same time (the SOT branch treats the target as a point)

2. Appearance model

That is, the re-ID branch can be divided into JDE and SDE according to whether the appearance features are extracted separately.
SDE directly uses a mature re-id feature extractor or a simple CNN network. Since this method needs to perform target detection first and then extract features for the target separately, the efficiency is not high.
JDE extracts appearance features at the same time during the detection process. Generally, a head is added to the CNN class detector to extract, while the transformer class structure is to implicitly extract appearance features. Here are a few

  • [CVPR2019] MOTS:Multi-Object Tracking and Segmentation —— Mask RCNN+embedding head
  • [ECCV2020] JDE: Towards Real-Time Multi-Object Tracking —— YOLOv3+embedding head
  • [CVPR2020] RetinaTrack: Online Single Stage Joint Detection and Tracking —— RetinaNet+embedding head
  • [IJCV2021] FairMOT: A Simple Baseline for Multi-Object Tracking —— CenterNet+embedding head
  • [CVPR2021] CorrTracker/TLR: Multiple Object Tracking with Correlation Learning —— CenterNet+embedding head
  • [CVPR2021] QDTrack: Quasi-Dense Similarity Learning for Multiple Object Tracking —— Faster RCNN+embedding head

This article is classified according to the embedding form when calculating the appearance distance:

(1) one embedding

Embedding directly output from the network

(2) scaled embedding

Merge embedding output from different (scale) feature layers

[2021] DEFT: Detection Embeddings for Tracking

Taking CenterNet as the baseline, map the position of the center point of the predicted target to each scale feature layer to extract the corresponding embedding, after 1×1 convolution and stitching together as the embedding of the target

(3) embedding bank

Each track saves an embedding library

[ICIP 2017] DeepSORT: Simple online and
realtime tracking with a deep association metric

When calculating the cost function, take the minimum value of all embeddings and detection frame embeddings in the library

[2022] GHOST: Simple Cues Lead to a Strong Multi-Object Tracker

inactive trajectory kkk and detection frameiiWhen i is associated, the appearance distance is the firstN k N_kNkFrame embedding and detection frame iiThe averagedistance di of the embedding of i
, k = 1 N k ∑ i = 1 N kd ( fi , fkn ) d_{i,k}=\frac{1}{N_k}\sum_{i=1}^{ N_k}d(f_i,f_k^n)di,k=Nk1i=1Nkd(fi,fkn)
In addition, the author also compared two other embeddings in the appendix experiment:

  • (i) Mode embedding: save an embedding library, and take the mode of each dimension as the embedding for calculation
  • (ii) Median embedding: Save an embedding library, take the median of each dimension as the embedding for calculation
(4) MA embedding

Continuously update embedding with MA (moving average), which is mentioned in Tracktor++ in the GHOST paper

(5) EMA embedding

EMA (exponential moving average) is used to continuously update the embedding, and the average embedding of EMA is used to calculate the distance when matching, so that the historical appearance information can be considered when matching, and as time goes by, the previous appearance information will become more and more diluted, and more attention should be paid Appearance information for the most recent frame. JDE, FairMOT, TraDes, etc. all adopt this method:
eit = α eit − 1 + ( 1 − α ) fite^t_i=\alpha e^{t-1}_i+(1-\alpha)f^t_ieit=a eit1+(1a ) fit

(6) EMA bank

Calculate the EMA embedding of each frame and save it in the library. When calculating, take the minimum value of the embedding of the detection frame . GIAOTracker adopts this method

(7) other embedding

[CVPR 2021] TADAM: Online Multiple Object Tracking with Cross-Task Synergy

When multiple objects overlap, the distracting objects (foreground) affect the appearance characteristics of the actual objects (background), leading to false associations. In order to solve this problem, based on Tracktor++, the author proposes a model that is enhanced to predict only and feature association at the same time. By learning the attention of the actual target and the interference target in the bbox, and through the memory aggregation module, the interference target is distinguished and ignored . Generate the embedding of the actual target

[CVPR 2022] MeMOT: Multi-Object Tracking with Memory

The spatio-temporal storage module is used to store the embedding of each track in frame T, and the memory encoding module is used to fuse the stored features .TsThe frame embedding is used as cross attention; the long-term memory module uses the output of the previous frame memory encoder as Q, and the previous T l T_lTlThe frame embedding is used as cross attention; the hybrid module splices the output of the long and short-term memory module as self attention as the output of the memory encoding module of the current frame Q tck Q_{tck}Qtc k _

3. Cost function

3.1 IOUs

IOUIt is the most commonly used when matching bbox. As a cost function, it is generally 1 − IOU 1-IOU1IOU

In addition GIOU, , DIOU, and CIOUare used instead of IOU in the source code.

[WACV 2023] C-BIoU: Hard to Track Objects with Irregular Motions and Similar Appearances? Make It Easier by Buffering the Matching Space

When matching, the proportional amplification is performed on the basis of the original track and detection bbox , controlled by an expansion factor b, and then the IOU is calculated. The author calls this method BIOU. With this setting, the precise estimation of the motion can be omitted, and only by expanding the search range, the effect is better for the track that has not been updated for a long time, because the motion uncertainty at this time is very large.

insert image description here

3.2 Distance class

The most commonly used distance functions are 欧式距离, 余弦距离, 马氏距离.

  • Euclidean distance, the intuitive distance in space:
    d ( x , y ) = ∑ i = 1 n ( xi − yi ) 2 d(x,y)=\sqrt{\sum_{i=1}^n(x_i-y_i )^2}d(x,y)=i=1n(xiyi)2
  • Cosine distance, similarity in direction:
    d ( x , y ) = x ⋅ y ∥ x ∥ ∥ y ∥ d(x,y)=\frac{\boldsymbol{x}·\boldsymbol{y}}{\Vert \ boldsymbol{x}\Vert \Vert \boldsymbol{y}\Vert}d(x,y)=x∥∥yxy
  • Mahalanobis distance, considering the spatial distance of variable distribution, was used earlier in DeepSORT:
    d ( x , y ) = ( x − y ) TS − 1 ( x − y ) d(x,y)=\sqrt{(\boldsymbol {x}-\boldsymbol{y})^T \boldsymbol{S}^{-1} (\boldsymbol{x}-\boldsymbol{y})}d(x,y)=(xy)TS1(xy)

In addition, there are some other distance functions, such as

[2021]TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

When calculating the correlation between track and det, the author defines a normalized distance
D top = ∥ ( xi + wi 2 − xj − wj 2 , yi − yj ) ∥ hi D_{top}=\frac{\Vert(x_i+ \frac{w_i}{2}-x_j-\frac{w_j}{2},y_i-y_j)\Vert}{h_i}Dtop=hi(xi+2wixj2wj,yiyj)

The calculated parameters mainly include 中心点距离, embedding距离, 速度相似性.

  • 中心点距离Most of them are calculated by Euclidean distance . When det is associated with track, it usually takes the center point of det or track as the center of the circle, and the radius is kkA circle of k , greedyis performed within the circle.
    k = whk=\sqrt{wh}k=wh
    如CenterTrack、TraDes、PermaTrack、GIAOTracker
    k = α v Δ tk=\alpha v\Delta tk=αv Δ t
    as

[CVPR 2021] Improving Multiple Pedestrian Tracking by Track Management and Occlusion Handling

  • 速度相似性Mainly the consistency of direction, using cosine distance calculation

[CVPR 2022] OC-SORT: Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking

A weighted sum of IOU and velocity similarity is used as the cost function

  • embedding距离Multi-use Euclidean distance and cosine distance calculation

In addition, there are other ways to calculate the embedding distance. For example,
the author of QDTack uses BI-softmax to calculate the embedding similarity, where there are N dets and M tracks:
f ( i , j ) = [ exp ( ni ⋅ mj ) ∑ k = 0 M − 1 exp ( ni ⋅ mk ) + exp ( ni ⋅ mj ) ∑ k = 0 N − 1 exp ( nk ⋅ mj ) ] / 2 f(i,j)=[\frac{exp(\boldsymbol{n }_i·\boldsymbol{m}_j)}{\sum_{k=0}^{M-1}exp(\boldsymbol{n}_i·\boldsymbol{m}_k)}+\frac{exp(\boldsymbol {n}_i·\boldsymbol{m}_j)}{\sum_{k=0}^{N-1}exp(\boldsymbol{n}_k·\boldsymbol{m}_j)}]/2f(i,j)=[k=0M1exp(nimk)exp(nimj)+k=0N1exp(nkmj)exp(nimj)]/2

Generally speaking, in addition to using a certain measurement method alone, most of them use the weighted sum of IOU and distance or different distances to comprehensively consider multiple factors.
In addition to the sum form, there is also a form that considers the minimum value , such as

[2022] BoT-SORT: BoT-SORT: Robust Associations Multi-Pedestrian Tracking

When the IOU and embedding distance are both smaller than the threshold, take the minimum value
f ( i , j ) = min ( IOU , dcos ( ⋅ ) ) f(i,j)=min(IOU,d_{cos}( ))f(i,j)=min ( I O U ,dcos())
In addition, there are also motion information like DeepSORT and DEFT to limit (unreasonable) appearance distance.

3.3 time

The cost function of GIAOTracker considers the difference in time and as one of them, mainly the number of cross frames.

3.4 Relevance, Similarity, Affinity

Generally, it needs to be obtained through network reasoning

[TPAMI 2019] DAN:Deep affinity network for multiple object tracking

After the features of the current frame are extracted by a feature extractor, they are sent to the affinity estimator together with the features of the previous frame to estimate the affinity matrix of the two frames before and after , and its form is ( N m + 1 ) × ( N m + 1 ) ( N_m+1)×(N_m+1)(Nm+1)×(Nm+1 ) , whereN m N_mNmIndicates the maximum number of targets that appear in a burst, and is an adjustable parameter. +1 is for unmatched detection (indicating new track) and unmatched tracking (indicating track disappearance).
insert image description here
The [2021] DEFT mentioned above also used the idea of ​​DAN and improved it: calculate the (time) forward affinity and (time) reverse affinity once and take the average as the affinity matrix S, and at the same time On the right side, a score matrix X indicating that the trajectory does not match the detection is concatenated as a cost matrix.

[CVPR 2021] CorrTracker/TLR: Multiple Object Tracking with Correlation Learning

In order to solve the problem that only using appearance information is not enough to disambiguate multiple similar regions in the image, the author takes FairMOT as the baseline and adds a spatial local correlation layer before data association , only around the target and its coordinates (within the radius R) Calculate the feature similarity between targets, and perform Hungarian matching on the similarity matrix correlation volume.
C l ( F q , F r , x , d ) = F ql ( x ) TF rl ( x + d ) , ∣ ∣ d ∣ ∣ ∞ < = RC^l(F_q,F_r,x,d)=F^ l_q(x)_TF^l_r(x+d),{||d||}_\infty<=RCl(Fq,Fr,x,d)=Fql(x)TFrl(x+d),∣∣d∣∣<=The R
author considers the embedding similarity between the current target and the surrounding targets when generating the embedding.
F cl = F tl + MLP l ( C l ( F tl , F tl ) ) F_c^l=F_t^l+MLP^l(C^l(F_t^l,F_t^l))Fcl=Ftl+MLPl (Cl(Ftl,Ftl))
(My understanding is that when generating the embedding, the similarity between the target and the surrounding targets is calculated and integrated into the embedding, in order to increase the feature recognition of the current target and the surrounding targets. The more similar the appearance, the higher the degree of correlation, plus After the degree of correlation, the degree of discrimination becomes higher; while the spatial local correlation layer and the correlation matrix are det only search in its vicinity to avoid false associations that are similar to further target features)
insert image description here

[CVPR 2021] MTP: Discriminative Appearance Modeling with Multi-track Pooling for Real-time Multi-object Tracking

The historical appearance features of the track are fused through a shared Bilinear LSTM structure, and its hidden layer features are ht − 1 h_{t-1}ht1, In addition, through a binary classification network Track Proposal Classifier to detect and track pairs ( dt , T 1 ( t − 1 ) ) (d_t,T_1(t-1))(dt,T1(t1 )) output association probability:
f ( dt , T 1 ( t − 1 ) ; θ ) = p ( dt ∈ T 1 ( t ) ∣ T 1 ( t − 1 ) ) f(d_t,T_1(t-1) ;\theta)=p(d_t \in T_1(t)|T_1(t-1) )f(dt,T1(t1);i )=p(dtT1(t)T1(t1))

[ECCV 2022] AiATrack: Attention in Attention for Transformer Visual Tracking

The first layer of self-attention mechanism may learn wrong attention (unreasonable in space, but similar in characteristics), so the author adds another layer of self-attention to the attention correlation map, and judges whether the previous association is based on vector consistency. Reasonable, effectively eliminate noise interference. As shown in the figure, the red line in the figure below is the wrong attention association
red line

[CVPR2022] MeMOT: Multi-Object Tracking with Memory

Taking deformable DETR as the baseline, the output Q pro Q_{pro}Qproand Q trk mentioned above Q_{trk}QtrkSplicing QQ as a decoderQ , do cross attention on the image features of the current frame, output bbox and the corresponding two confidence levels (occlusion confidence level o and detection confidence level u), the total confidencelevelis the product of the two, first filter according to the confidence level and then compareQpro Q_{pro}QproInherit id and Q trk Q_{trk}QtrkAdd an id.

4. Association method

(1) Hungarian algorithm

The most used matching method in MOT.

(2) Greedy algorithm

The second most matching method is used in MOT. Generally, the method of calculating the center point distance uses this association method, which was mentioned in the previous section.

(3) Nearest neighbor search

So far, only QDTrack has used this association method.

(4) Network reasoning

Most of them are implicit associations, such as the method of CNN class like Tracktor++ and the method of transformer class like TrackFormer. The tracking target of the previous frame is used as a priori (in the form of proposal or track query), and the target obtained after reasoning is directly compared with the prior Targets are associated (id inheritance).

[CVPR 2019] DeepMOT: How To Train Your Deep Multi-Object Tracker

The author builds an end-to-end network and directly uses MOTA, MOTP and other indicators as loss functions. For this reason, the Hungarian algorithm is integrated into RNN to form DHN (Deep Hungarian Net), which becomes a differentiable function. The differentiable form of the indicator is interesting and will be mentioned in the next section.
insert image description here

[CVPR 2020] FFT: Multiple Object Tracking by Flowing and Fusing

Use the FuseTracker module to associate the IOU with the detection confidence, take the Faster RCNN as the baseline, take the track and detection det output by the optical flow estimation network as the proposal, and inherit the track id to the det after two levels of NMS, and det without the inherited id as new track.
insert image description here

5. Other strategies

  • [CVPR 2019] DeepMOT: How To Train Your Deep Multi-Object Tracker

For differentiable MOTA and MOTP, the author expresses FP, FN, and IDS as follows, where A ~ \widetilde AA is the association matrix output by the differentiable Hungarian algorithm (N×M, indicating that N dets are associated with M tracks); add a column of threshold δ \delta to the matrixδ , do softmax calculation for each row, then the sum of the last column isFP, because the incidence matrix can be allocated after differentiable Hungarian matching,d 1 & t 3 , d 2 & t 2 , d 3 & t 1 d_1\ &t_3,d_2\&t_2,d_3\&t_1d1&t3,d2&t2,d3&t1, but d 3 and t 1 d_3 and t_1d3with t1The association is somewhat reluctant, after all, the confidence is lower than the defined threshold δ \deltaδ , sod 3 d_3d3It can be regarded as the wrong detection result, that is, in FPthe same way, add a row of threshold δ \delta to the matrixδ , do softmax calculation for each row, then the sum of the last row isFN,t 3 t_3t3It can be regarded as missing detection, that is, FNthe matching resultIDS of the previous frame is used as a mask, placed on the matching result of the current frame, and the number of matching changes is calculated. d MOTA = 1 − FN ~ + FP ~ + γ IDS ~ M dMOTA=1-\frac{\widetilde{FN}+\widetilde{FP}+\gamma \widetilde{IDS}}{M}
insert image description here
dMOT A _=1MFN +FP +cIDS
d M O T P = 1 − ∣ ∣ D ⨀ B T P ∣ ∣ 1 ∣ ∣ B T P ∣ ∣ 0 dMOTP=1-\frac{ {||D\bigodot B^{TP}||}_1}{ {||B^{TP}||}_0} d MOTP=1∣∣BTP∣∣0∣∣DBTP∣∣1
Among them, DDD is a distance matrix that integrates center point distance, IOU distance, and appearance distance,BTPB^{TP}BTP is the matching result of the current frame, then the loss function of DeepMOT is defined as:
LD eep MOT = ( 1 − d MOTA ) + λ ( 1 − d MOTP ) L_{DeepMOT}=(1-dMOTA)+\lambda(1- dMOTP)LDeepMOT=(1d MOT A )+l ( 1( MOTP )

  • [CVPR 2021] Improving Multiple Pedestrian Tracking by Track Management and Occlusion Handling

A boundary idea is proposed. When the track appears outside the boundary and the speed direction is outward, it will be deleted to prevent it from being associated with a new target. This is a simple and effective idea.
In addition, the author proposes a two-way tracking strategy in the offline method, that is, time-sequential tracking once, reverse tracking once, and the two results are combined, which is similar to DEFT (see Section 3.4).

  • [ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Adhering to the idea of ​​" existence is reasonable ", the detection frame is divided into high score/low score according to the confidence level. First, the high score detection frame is used to match the track, and then the low score detection frame is used to match the unmatched track. This is also a simple and effective idea, and many subsequent algorithms incorporate this idea.

  • For interpolation methods:
interpolation method example
linear interpolation OC SORT、BoT SORT、GIAOTracker
Gaussian smooth interpolation StrongSORT
Two-way KF interpolation MAT
  • Camera motion compensation method [details to be added]
    Tracktor++, Giaotracker, MAT, Detecting invisible people, Modeling ambiguous assignments for multi-person tracking in crowds
    method: Enhanced Correlation Coefficient (ECC) maximization, ORB feature matching

6、Runtime

The tracking indicator data mainly comes from paperwithcode , which is generally the optimal result mentioned in the paper (private detector is better than the public detector result). The reasoning speed mainly comes from the paper or MOT challenge . Since the reasoning speed is greatly affected by the experimental environment, it is provided here Experimental equipment (mainly mentioned in the paper). At present, only two popular data sets MOT17 and MOT20 have been selected, and data sets such as DanceTrack will be added later.

  • MOT17
Model MOTA IDF1 HOTA FPS Hardware Remark
SMILEtrack 81.06 80.5 65.24 - -
BoT-SORT 80.6 79.5 64.6 6.6 RTX 3060 desktop
BoT-SORT-ReID 80.5 80.2 65 4.5 RTX 3060 desktop
ByteTrack 80.3 77.3 63.1 29.6 V100 Time to include yolox
StrongSORT 79.6 79.5 64.4 7.1 v100
ppbytemot 79.4 76.7 62.9 2219.6 - The bytetrack pure tracking time of the paddlepaddle library
BOTTLE 78.0 77.5 63.2 29.0 RTX 2080 Including yolox time, pure tracking 700fps+
Unicorn 77.2 75.5 61.7 - - Training requires 16 A100s
FCG 76.7 77.7 62.6 - -
TransMOT/STGT 76.7 75.1 61.7 9.6 V100
SGT 76.4 72.8 60.8 23.0 V100
TransCenter 76.4 65.4 - 1.0 V100
SimpleTrack 75.3 76.3 61.6 22.53 RTX 2080Ti Training requires 4 pieces of NVIDIA TITAN RTX and takes 25h
GTR 75.3 71.5 59.1 19.6 Titan Xp Training requires 8 Quadro RTX 6000
track shaper 74.1 68 - 7.4 - Training requires 7*32 GPUs, training for 2 days
FairMOT 73.7 72.3 59.3 25.9 RTX 2080Ti
OUTrack_fm 73.5 70.2 - 25.4 TITAN XP
LMOT 72.0 70.3 - 28.6 RTX 2070 Mobile
TraDeS 69.1 63.9 - 17.5 RTX 2080Ti
QDTrack 68.7 66.3 - 20.3 V100
MPNTrack 58.8 61.7 - 6.5 -
Tracktor++v2 56.3 55.1 44.8 1.5 Titan X
  • WORD20
Model MOTA IDF1 HOTA FPS Hardware Remark
SMILEtrack 81.06 80.5 65.24 - -
BoT-SORT 77.8 76.3 62.6 6.6 RTX 3060 desktop
BoT-SORT-ReID 77.8 77.5 63.3 2.4 RTX 3060 desktop
ByteTrack 77.8 75.2 61.3 17.5 V100 Time to include yolox
TransMOT/STGT 77.5 75.2 - 9.6 V100
BOTTLE 75.7 76.3 62.4 18.7 RTX 2080 Including the time of yolox, simply tracking 700fps+
StrongSORT 73.8 77 62.6 1.4 v100
SGT 72.8 70.6 57 19.9 V100
SimpleTrack 72.6 70.2 57.6 7 RTX 2080Ti Training requires 4 NVIDIA TITAN RTX and takes 25h
TransCenter 72.4 57.9 - 8.4 V100
TrackFormer 68.6 65.7 54.7 7.4 -
OUTrack_fm 68.5 69.4 - 12.4 TITAN XP
FCG 68 69.7 57.3 - -
FairMOT 61.8 67.3 54.6 13.2 RTX 2080Ti
LMOT 59.1 61.1 - 28.6 RTX 2070 Mobile
MPNTrack 57.6 59.1 - - -
Tracktor++v2 52.6 52.7 42.1 1.2 Titan X

postscript

This is a simple summary of recent work. Many things were probably understood when I read them for the first time, but I forgot them when I summarized them. I hope to record some content in this way, which is convenient for myself and others. In addition, I also want to commemorate my first Knock the formula with latex once.

This summary is not written in an "object-oriented" way, mainly because there are many interpretations and quick reviews of similar papers. Although it is helpful to understand the specific method content, it is not organized enough for those who are new to MOT, so I use Based on the "process-oriented" approach, I hope to start with the general process of the MOT algorithm and introduce the methods that have appeared and the evolution process.

Even so, I still haven't written enough details. For methods that I don't understand at all, such as graph convolutional network, twin neural network, and SOT, in order not to mislead, I try not to mention or mention them briefly. At the same time, some content has no obvious features or I am not sure where to write the content, and I may not have mentioned it. I hope to complete it in the future.

Finally, I have not been in contact with MOT for a long time, and I cannot avoid misunderstandings. I hope everyone can correct me for the mistakes in the text, and I will correct them humbly!

Guess you like

Origin blog.csdn.net/LoveJSH/article/details/128714498