Image Tracking - MOTR: End-to-End Multiple-Object Tracking with Transformer (ECCV 2022)

Disclaimer: This translation is only a personal study record

Article information

Summary

  Temporal modeling of objects is a key challenge in multiple object tracking (MOT). Existing methods associate detections for tracking via motion-based and appearance-based similarity heuristics. The post-processing nature of the association prevents end-to-end exploitation of temporal variations in video sequences.

  In this paper, we propose MOTR, which extends DETR [6] and introduces "tracking queries" to model tracked instances throughout the video. Track queries are transmitted and updated frame by frame to perform iterative predictions over time. We propose tracklet-aware label assignment to train tracking queries and nascent target queries. We further propose temporal aggregation network and collective average loss to enhance temporal relationship modeling. Experimental results on DanceTrack show that MOTR significantly outperforms the state-of-the-art method ByteTrack [42] by 6.5% on the HOTA metric. On MOT17, MOTR outperforms our contemporaneous works TrackFormer [18] and TransTrack [29] in terms of association performance. MOTR can serve as a stronger baseline for future temporal modeling and Transformer-based tracker research. The code is located at https://github.com/megvii-research/MOTR.

Keywords : multi-target tracking, Transformer, end-to-end

1 Introduction

  Multiple Object Tracking (MOT) predicts the trajectories of instances in a continuous image sequence [39, 2]. Most existing methods split MOT temporal associations into appearance and motion: appearance variance is usually measured by pairwise Re-ID similarity [37, 43], while motion is measured by IoU [4] or Kalman filtering [3 ] heuristic to model. These methods require similarity-based matching for post-processing, which becomes a bottleneck for temporal information flow across frames. In this paper, we aim to introduce a fully end-to-end MOT framework featuring joint motion and appearance modeling.

  Recently, DETR [6, 45] was proposed for end-to-end object detection. It formulates object detection as an ensemble prediction problem. As shown in Figure 1(a), an object query, which is a decoupled representation of the object, is fed into a Transformer decoder and interacts with image features to update its representation. Bipartite matching is further adopted to achieve a one-to-one assignment between the target query and the ground truth, eliminating post-processing like NMS. Different from object detection, MOT can be viewed as a sequence prediction problem. The method to perform sequence prediction in an end-to-end DETR system is an open problem.

insert image description here

Figure 1: (a) DETR achieves end-to-end detection by interacting an object query with image features, and performs a one-to-one assignment between the updated query and object. (b) MOTR performs a set of sequence predictions by updating trajectory queries. Each track query represents a track. Best viewed in color.

  Iterative prediction is popular in machine translation [30, 31]. The output context is represented by a hidden state, and sentence features are iteratively interacted with the hidden state in the decoder to predict translated words. Inspired by these advances in machine translation, we intuitively view MOT as a sequence set prediction problem, since MOT requires a set of target sequences. Each sequence corresponds to a target trajectory. Technically, we extend target queries in DETR to track queries used to predict target sequences. Trajectory query serves as the hidden state of the target trajectory. The representation of the trajectory query is updated in the Transformer decoder and used to iteratively predict the target trajectory, as shown in Figure 1(b). Specifically, tracking queries are updated via self-attention and cross-attention on frame features. The updated trajectory query is further used to predict bounding boxes. The trajectory of an object can be obtained from all predictions of a trajectory query in different frames.

  To achieve the above goals, we need to solve two problems: 1) track a target through a tracking query; 2) handle nascent and terminated targets. To address the first problem, we introduce tracklet-aware label assignment (TALA). This means that the prediction of a tracking query is supervised by a sequence of bounding boxes with the same identity. To address the second problem, we maintain a variable-length tracking queryset. Queries for nascent targets are merged into this set, while queries for terminating targets are dropped. We refer to this process as the entry and exit mechanism. In this way, MOTR does not require explicit trajectory association during inference. Furthermore, iterative updating of trajectory queries enables temporal modeling of appearance and motion.

  To enhance temporal modeling, we further propose Collective Average Loss (CAL) and Temporal Aggregation Network (TAN). Using CAL, MOTR takes video clips as input during training. The parameters of MOTR are updated based on the total loss computed for the entire video clip. TAN introduces a shortcut for tracking queries, aggregating the historical information of its previous state through the key-query mechanism in Transformer.

  MOTR is a simple online tracker. It is easily developed on top of DETR with minor modifications to the label assignment. It is a true end-to-end MOT framework that does not require post-processing such as tracking NMS or IoU matching used in our contemporaneous work TransTrack [29] and TrackFormer [18]. Experimental results on MOT17 and DanceTrack datasets show that MOTR has good performance. On DanceTrack [28], MOTR outperforms the state-of-the-art ByteTrack [42] by 6.5% and 8.1% on the HOTA metric and on the AssA metric, respectively.

In summary, our contributions are as follows:

  • We propose a complete end-to-end MOT framework named MOTR. MOTR can implicitly learn appearance and position changes in a joint manner.

  • We formulate MOT as a problem of predicting sets of sequences. We generate tracking queries from previous hidden states for iterative updating and prediction.

  • We propose tracklet-aware label assignment for the one-to-one assignment between tracking queries and targets. An in-and-out mechanism is introduced to handle nascent and terminated trajectories.

  • We further propose CAL and TAN to enhance temporal modeling.

2. Related work

Transformer-based architecture . Transformer [31] was first introduced to aggregate information from the entire input sequence for machine translation. It mainly involves self-attention and cross-attention mechanisms. Since then, it has been gradually introduced into many fields, such as speech processing [13, 7] and computer vision [34, 5]. Recently, DETR [6] combined Convolutional Neural Network (CNN), Transformer and bipartite matching to perform end-to-end object detection. To achieve fast convergence, Deformable DETR [45] introduces deformable attention modules in Transformer encoder and Transformer decoder. ViT [9] builds a pure Transformer architecture for image classification. In addition, Swin-Transformer [16] proposes a shifted window scheme to perform self-attention within a local window, leading to higher efficiency. VisTR [36] employs a straightforward end-to-end parallel sequence prediction framework to perform video instance segmentation.

Multiple target tracking . The dominant MOT methods mainly follow the detection and pursuit paradigm [3, 12, 22, 24, 39]. These methods usually first use an object detector to localize objects in each frame, and then perform tracking correlation between adjacent frames to generate tracking results. SORT [3] combines Kalman filter [38] and Hungarian algorithm [11] for trajectory association. DeepSORT [39] and Tracktor [2] introduce an additional cosine distance and compute the appearance similarity of track associations. Track RCNN [26], JDE [37] and FairMOT [43] further add a Re-ID branch on top of the object detector in the joint training framework, combining object detection and Re-ID feature learning. TransMOT [8] builds a spatio-temporal graph transformer for association. Our contemporaneous work, TransTrack [29] and TrackFormer [18] also developed Transformer-based frameworks for MOT. See Section 3.7 for a direct comparison with them.

Iterative sequence prediction . Sequence-to-sequence (seq2seq) prediction of sequences using an encoder-decoder architecture is popular in machine translation [30, 31] and text recognition [25]. In the seq2seq framework, an encoder network encodes the input into an intermediate representation. Then, a hidden state with task-specific contextual information is introduced and iteratively interacted with the intermediate representation to generate the target sequence through the decoder network. The iterative decoding process involves multiple iterations. At each iteration, the hidden state decodes one element of the target sequence.

3. Method

3.1 Queries in object detection

DETR [6] introduces a set of fixed-length object queries to detect objects. The target query is fed to the Transformer decoder and interacts with the image features extracted from the Transformer encoder to update its representation. Bipartite matching is further employed to achieve a one-to-one assignment between the updated target query and the ground truth. Here, we simply write the object query as "detection query" to specify the query for object detection.

3.2 Detection query and tracking query

When applying DETR from object detection to MOT, two main issues arise: 1) how to track an object through one tracking query; 2) how to deal with nascent and terminated objects. In this paper, we extend detection queries to trace queries. Trace querysets are dynamically updated and variable in length. As shown in Figure 2, the trajectory query set is initialized to be empty, and the detection query in DETR is used to detect nascent objects (object 3 at T2). The hidden state of the detected object yields a tracking query for the next frame; the trajectory query assigned to the terminating object is removed from the set of trajectory queries (object 2 at T4).

3.3 Tracklet-Aware Label Assignment

In DETR, a detection (object) query can be assigned to any object in an image, since the label assignment is determined by performing bipartite matching between all detection queries and the ground truth. Whereas in MOTR, the detection query is only used to detect nascent objects, while the tracking query predicts all tracked objects. Here, we introduce tracklet-aware label assignment (TALA) to address this problem.

  In general, TALA includes two strategies. For detection queries, we modify the assignment strategy in DETR to only nascent targets, where a bipartite match is made between the detection query and the ground-truth of the nascent target. For tracking queries, we design a goal-aligned allocation strategy. Trace queries follow the same assignment of previous frames and are therefore excluded from the above-mentioned bipartite matching.

insert image description here

Figure 2: The update process for detection (target) queries and tracking queries in some typical MOT situations. Trace querysets are dynamically updated and variable in length. The tracking query set is initialized to be empty, and the detection query is used to detect nascent objects. The hidden states of all detected objects are concatenated to generate the tracking query for the next frame. Trajectory queries assigned to terminating targets are removed from the trajectory query set.

  Formally, we denote the prediction of a tracking query as Y ^ tr \widehat{Y}_{tr}Y tr, denoting the prediction of the detection query as Y ^ det \widehat{Y}_{det}Y d e t Y n e w Y_{new} Ynewis the truth value of the nascent target. The label assignment results of tracking query and detection query can be written as ω tr ω_{tr}ohtrω gives ω_{det}ohd e t. for frame iii , the label assignment for the detection query is obtained from the bipartite matching between the detection query and the nascent target, i.e.,

insert image description here

where L \mathcal{L}L is the pairwise matching cost defined in DETR, andΩ i Ω_iOhiis the space of all bipartite matches between the detection query and the nascent target. For the track query assignment, we merge the assignment of the nascent object and the tracked object in the previous frame, i.e., for i > 1 i > 1i>1

insert image description here

For the first frame ( i = 1 ) (i=1)(i=1 ) Since there is no tracking target in the first frame, the tracking query assignmentω tr 1 ω^1_{tr}ohtr1is an empty set ∅ ∅ . For consecutive frames( i > 1 ) (i>1)(i>1 ) , the trajectory query assignmentω tri ω^i_{tr}ohtriis the previous trajectory query assignment ω tri − 1 ω^{i−1}_{tr}ohtri1and freshman target assignment ω deti − 1 ω^{i−1}_{det}ohd e ti1cascade.

  In practice, the TALA strategy is simple and effective due to the powerful attention mechanism in Transformer. For each frame, detection queries and tracking queries are concatenated and fed into the Transformer decoder to update their representations. Detection queries will only detect nascent objects, since query interactions via self-attention in the Transformer decoder will suppress detection queries that detect tracked objects. This mechanism is similar to deduplication in DETR, i.e. suppression of duplicated boxes with low scores.

insert image description here

Figure 3: The overall architecture of MOTR. "Enc" stands for the Convolutional Neural Network backbone and the Transformer Encoder that extracts image features for each frame. Detection query qd q_dqdand trace query qtr q_{tr}qtrThe concatenation of is fed to a Deformable DETR decoder (Dec) to produce hidden states. The hidden state is used to generate predictions Y ^ \widehat{Y} for nascent and tracked targetsY . The Query Interaction Module (QIM) takes the hidden state as input and generates tracking queries for the next frame.

3.4 MOTR Architecture

  The overall architecture of MOTR is shown in Figure 3. Video sequences are fed into a Convolutional Neural Network (CNN) such as ResNet-50 [10] and a Deformable DETR [45] encoder to extract frame features.

  For the first frame, there is no tracking query, and we only assign fixed-length learnable detection queries ( qd q_d in Figure 3qd) are fed to the Deformable DETR [45] decoder. For consecutive frames, we feed the concatenation of tracking queries from previous frames and learnable detection queries into the decoder. These queries interact with image features in the decoder to generate hidden states for bounding box prediction. The hidden state is also fed into the Query Interaction Module (QIM) to generate a trajectory query for the next frame.

  During the training phase, labels are assigned to each frame as described in Section 3.3. All predictions for a video clip are gathered into a prediction library { Y ^ 1 , Y ^ 2 , … , Y ^ N } \{\widehat{Y}_1,\widehat{Y}_2,…,\widehat{Y}_N \}{ Y 1,Y 2,,Y N} , and we use the proposed Collective Average Loss (CAL) described in Section 3.6 for supervision. During inference time, the video stream can be processed online and a prediction for each frame is generated.

3.5 Query Interaction Module

In this section, we describe the Query Interaction Module (QIM). QIM includes a target entry and exit mechanism and a temporal aggregation network.

Target entry and exit . As mentioned above, some objects in a video sequence may appear or disappear in intermediate frames. Here we present the way nascent and terminated targets are handled in our approach. For any frame, the tracking query is concatenated with the detection query and fed into the Transformer decoder, resulting in a hidden state (see left side of Figure 4).

insert image description here

Figure 4: Structure of the Query Interaction Module (QIM). The input to QIM is the hidden state and corresponding prediction score produced by the Transformer decoder. During the inference phase, we retain the nascent targets and discard the exit targets based on the confidence scores. Temporal Aggregation Networks (TANs) enhance long-term modeling.

  During training, the hidden state of the terminated object is removed if the matched object disappears in the ground truth, or the intersection-over-union (IoU) ratio (IoU) between the predicted bounding box and the object is below a threshold of 0.5. That means, if these objects disappear in the current frame while the remaining hidden states remain, the corresponding hidden states will be filtered. For the nascent target, based on the nascent target ω deti ω^i_{det} defined in Equation 1ohd e tiallocation to maintain the corresponding hidden state.

  For inference, we use the predicted classification scores to determine the appearance of nascent objects and the disappearance of tracked objects, as shown in Figure 4. For the target query, keep the classification score above the state threshold τ en τ_{en}te npredictions while removing other hidden states. For trace queries, remove consecutive MMThe classification score of M frames is below the exit thresholdτ ex τ_{ex}texpredictions of , while preserving other hidden states.

Time Aggregation Network . Here, we introduce Temporal Aggregation Network (TAN) in QIM to enhance temporal relationship modeling and provide contextual priors for tracked objects.

  As shown in Figure 4, the input to TAN is the filtered hidden state of the tracked object (object “1”). We also collect the trace query qtriq^i_{tr} from the last frameqtri, for temporal aggregation. TAN is an improved Transformer decoder layer. The tracking query of the last frame and the filtered hidden state are summed to become the key and query components of multi-head self-attention (MHA). The hidden state itself is the value component of MHA. After MHA, we apply a feed-forward network (FFN) and concatenate the result with the hidden state of the nascent target (target "3") to generate a query set of trajectories qtri+1 q^{i+1} for the next frame _{tr}qtri+1

3.6 Collective Average Loss

  Training samples are important for temporal modeling of trajectories because MOTR learns temporal variance from data, rather than hand-crafted heuristics like Kalman filtering. Common training strategies, such as training within two frames, fail to generate training samples for long-distance object motion. In contrast, MOTR takes video clips as input. In this way, training samples of long-range object motions can be generated for temporal learning.

  Our collective average loss (CAL) is not to calculate the loss frame by frame, but to collect multiple predictions Y ^ = { Y ^ i } i = 1 N \widehat{Y}=\{\widehat{Y}_i\}^N_ {i=1}Y ={ Y i}i=1N. 然后通过真值Y = { Y i } i = 1 NY=\{Y_i\}^N_{i=1}Y={ Yi}i=1NAnd matching result ω = { ω i } i = 1 N ω=\{ω_i\}^N_{i=1}oh={ ohi}i=1Nto compute the loss over the entire video sequence. CAL is the overall loss over the entire video sequence, normalized by the number of targets:

insert image description here

其中V i = V tri + V deti V_i=V^i_{tr}+V^i_{det}ViVtri+Vd e tiIndicates the iiThe total number of ground truth objects at frame i . V tri V^i_{tr}VtriV deti V^i_{det}Vd e tiare the iiThe number of tracked objects and newborn objects for i frame. L \mathcal{L}L is the loss for a single frame, which is similar to the detection loss in DETR. Single frame lossL \mathcal{L}L can be formulated as:

insert image description here

where L cls \mathcal{L}_{cls}Lclsis the focus loss [14]. L l 1 \mathcal{L}_{l_1}Ll1Indicates the L1 loss, L giou \mathcal{L}_{giou}Lg i o uis the generalized IoU loss [21]. λ cls λ_{cls}lclsλ l 1 λ_{l_1}ll1λ giou λ_{giou}lg i o uis the corresponding weight coefficient.

3.7 Discussion

  Based on DETR, our contemporaneous work TransTrack [29] and TrackFormer [18] also developed a Transformer-based framework for MOT. However, compared to them, our method shows a large difference:

  TransTrack models a full trajectory as a composition of several independent short trajectories. Similar to the detection-re-tracking paradigm, TransTrack decouples MOT into two subtasks: 1) detecting object pairs as short trajectories within two adjacent frames; 2) associating short trajectories with full trajectories via IoU matching. Whereas for MOTR, we model the complete trajectory in an end-to-end manner through iterative updates of trajectory queries without IoU matching.

  TrackFormer shared with us the idea of ​​tracking queries. However, TrackFormer still learns within two adjacent frames. As mentioned in Section 3.6, short distance learning will lead to relatively weak temporal learning. Therefore, TrackFormer employs heuristics, such as Track NMS and Re-ID features, to filter out duplicate tracks. Unlike TrackFormer, MOTR learns stronger temporal motion via CAL and TAN, eliminating the need for these heuristics. See Table 1 for a direct comparison with TransTrack and TrackFormer.

  Here, we clarify that we started this work independently long before TrackFormer and TransTrack appeared on arXiv. Adding that they are not formally published, we view them as contemporaneous and independent works rather than prior works on which our work is based.

Table 1: Comparison with other Transformer-based MOT methods.

insert image description here

Table 2: Statistics of selected datasets used for evaluation.

insert image description here

4. Experiment

4.1 Datasets and metrics

dataset . For comprehensive evaluation, we conduct experiments on three datasets: DanceTrack [28], MOT17 [19] and BDD100k [41]. MOT17 [19] contains 7 training sequences and 7 testing sequences. DanceTrack [28] is a recent multi-object tracking dataset with uniform appearance and diverse motions. It includes more videos for training and evaluation, giving better options for validating tracking performance. BDD100k [41] is an autonomous driving dataset whose MOT trajectories have multiple object classes. For more details, please refer to the statistics of the dataset, shown in Table 2.

Evaluation metrics . We follow standard evaluation protocols to evaluate our method. Common metrics include high-order metrics [17] (HOTA, AssA, DetA) for evaluating multi-object tracking, multi-object tracking accuracy (MOTA), identity switching (IDS), and identity F1 score (IDF1).

4.2 Implementation Details

According to the settings in CenterTrack [44], MOTR adopts several data augmentation methods, such as random flipping and random cropping. The short side of the input image is resized to 800, and the maximum size is limited to 1536. At this resolution, the inference speed of the Tesla V100 is about 7.5 FPS. We sample keyframes with random intervals to account for variable frame rates. Furthermore, we use the probability pdrop p_{drop}pdropErase tracked queries, generate more samples for nascent targets, and pinsert with probability p_{insert}pinsertInsert false positive trace queries to simulate terminated targets. All experiments are performed on PyTorch with 8 NVIDIA Tesla V100 GPUs. We also provide a memory-optimized version that can be trained on an NVIDIA 2080 Ti GPU.

  For fast convergence, we build MOTR on top of Deformable-DETR [45] and ResNet50 [10]. The batch size is set to 1, and each batch contains video clips of 5 frames. We train our model using the AdamW optimizer with an initial learning rate of 2.0 ⋅ 1 0 − 4 2.0 10^{−4}2.010−4 . _ For all datasets, we initialize MOTR with the official Deformable-DETR [45] weights pre-trained on the COCO [15] dataset. On MOT17, we train MOTR for 200 epochs, and the learning rate drops by a factor of 10 at the 100th epoch. For state-of-the-art comparisons, we train on a joint dataset (MOT17 training set and CrowdHuman [23] validation set). For ~5k still images in the CrowdHuman validation set, we apply the random shift in [44] to generate video clips with pseudo-trajectories. The video clips have an initial length of 2, which we gradually increase to 3, 4, 5 at the 50th, 90th, and 150th epoch, respectively. Gradually increasing the length of video clips improves training efficiency and stability. For the ablation study, we train MOTR on the MOT17 training set without the CrowdHuman dataset and validate on the 2DMOT15 training set. On DanceTrack, we train on the training set for 20 epochs, and the learning rate drops at the 10th epoch. During the 5th, 9th, 15th epoch we gradually increased the clip length from 2 to 3, 4, 5. On BDD100k, we train on the training set for 20 epochs, with the learning rate decaying at epoch 16. During the 6th and 12th epochs, we gradually increased the clip length from 2 to 3 and 4.

4.3 Comparison of current status of MOT17

  Table 3 compares our method with the state-of-the-art methods on the MOT17 test set. We mainly compare MOTR with our Transformer-based contemporaneous work: TrackFormer [18] and TransTrack [29]. Our method achieves higher IDF1 scores, surpassing TransTrack and TrackFormer by 4.5%. MOTR outperforms TransTrack by 3.1% on the HOTA metric. For the MOTA metric, our method achieves better performance than TrackFormer (71.9% vs. 65.0%). Interestingly, we find that TransTrack outperforms our MOTR on MOTA. We hypothesize that the decoupling of detection and tracking branches in TransTrack does improve object detection performance. In MOTR, detection and tracking queries are learned through a shared Transformer decoder. Detection queries are suppressed when detecting tracked objects, limiting detection performance on nascent objects.

  If we compare its performance with other state-of-the-art methods such as ByteTrack [42], we find that MOTR is far inferior to them on the MOT17 dataset. Usually, the state-of-the-art performance on the MOT17 dataset is dominated by trackers with good detection performance for various appearance distributions. Furthermore, different trackers tend to employ different detectors for object detection. It is difficult for us to fairly verify the motion performance of various trackers. Therefore, we believe that the MOT17 dataset alone is insufficient to adequately evaluate the tracking performance of MOTR. We further evaluate the tracking performance on the DanceTrack [28] dataset, which has uniform appearance and varied motion, as described below.

Table 3: Performance comparison between MOTR and existing methods on the MOT17 dataset under a private detection protocol. If the number is the best among Transformer-based methods, it will be marked in bold.

insert image description here

4.4 Comparison of current situation of DanceTrack

  Recently, DanceTrack [28] was introduced, a dataset with uniform appearance and diverse motions (see Table 2). It includes more videos for evaluation and provides better options for verifying tracking performance. We further conduct experiments on the DanceTrack dataset and compare the performance with state-of-the-art methods in Table 4. This shows that MOTR achieves better performance on the DanceTrack dataset. Our method achieves a higher HOTA score, surpassing ByteTrack by 6.5%. For the AssA metric, our method also achieves better performance than ByteTrack (40.2% vs. 32.1%). While for the DetA metric, MOTR is inferior to some state-of-the-art methods. This means that MOTR performs well in temporal motion learning, but not so well in detection performance. The huge improvement of HOTA mainly comes from temporal aggregation network and collective average loss.

4.5 Generalization to Multi-Class Scenarios

  Re-ID based methods, such as FairMOT [43], tend to treat each tracked object (e.g., person) as a class and correlate detection results by feature similarity. However, when the number of tracked objects is very large, association will be difficult. In contrast, in MOTR, each target is represented as a trajectory query, and the trajectory query set is of dynamic length. MOTR can easily handle multi-class prediction problems by only modifying the number of classes in the classification branch. To verify the performance of MOTR on multi-class scenarios, we further conduct experiments on the BDD100k dataset (see Table 5). Results on the bdd100k validation set show that MOTR performs well in multi-class scenarios and achieves good performance with less ID switching.

Table 4: Performance comparison between MOTR and existing methods on the DanceTrack [28] dataset. The results of existing methods are from DanceTrack [28]

insert image description here

Table 5: Performance comparison between MOTR and existing methods on the BDD100k [41] validation set.

insert image description here

4.6 Ablation studies

MOTR components . Table 6a shows the impact of integrating different components. Integrating our components into the baseline improves the overall performance incrementally. Since most targets are treated as ingress targets, using only target queries as raw targets results in a large number of IDs. By introducing trace queries, the baseline is able to handle trace associations and improves IDF1 from 1.2 to 49.8. Furthermore, adding TAN to the baseline improves MOTA by 7.8% and IDF1 by 13.6%. When CAL is used in training, MOTA and IDF1 improve by 8.3% and 7.1%, respectively. The results show that TAN combined with CAL can enhance temporal motor learning.

Collective average loss . Here we explore the effect of video sequence length on tracking performance in CAL. As shown in Table 6b, when the length of video segments gradually increases from 2 to 5, the MOTA and IDF1 metrics improve by 8.3% and 7.1%, respectively. Therefore, multi-frame CAL can greatly improve tracking performance. We explain that multi-frame CAL can help the network handle some difficult situations, such as occluded scenes. We observe that repeated boxes, ID switching, and object loss are significantly reduced in occluded scenes. To verify this, we provide some visualizations in Figure 5.

Table 6: Our proposed MOTR ablation study. All experiments use single-stage C5 features in ResNet50.

insert image description here

Erase and insert trace queries . In the MOT dataset, there are two situations with few training samples: entry objects and exit objects in video sequences. Therefore, we adopt trace query erasure and insertion to use probability pdrop p_{drop} respectivelypdrop p i n s e r t p_{insert} pinsertSimulate both cases. Table 6c reports p_{drop} using different pdrops during trainingpdropvalue performance. when pdrop p_{drop}pdropMOTR achieves the best performance when set to 0.1. Similar to entry targets, tracking queries transmitted from previous frames that were predicted to be false positives are inserted into the current frame to simulate a target exit scenario. In Table 6d, we explore the different pinsert p_{insert}pinsertImpact on tracking performance. when pinsert p_{insert}pinsertOur MOTR achieves the highest score on MOTA when gradually increasing from 0.1 to 0.7, when pinsert p_{insert}pinsertWhen set to 0.3, while the IDF1 score is dropping.

Target entry and exit thresholds . Table 6e studies the target entry threshold τ en τ_{en} in QIMte nand exit threshold τ ex τ_{ex}texeffects of different combinations. When we change the target entry threshold τ en τ_{en}te n, we can see that the performance is dependent on τ en τ_{en}te nis less sensitive (within 0.5% on MOTA), and using an entry threshold of 0.8 yields relatively better performance. We also change the target exit threshold τ ex τ_{ex}texto further experiment. The results show that using a threshold of 0.5 results in slightly better performance than a threshold of 0.6. In our practice, τ en τ_{en} of 0.6te nIt shows better performance on the MOT17 test set.

sampling interval . In Table 6f, we evaluate the impact of random sampling intervals on tracking performance during training. When the sampling interval is increased from 2 to 10, the IDS decreases significantly from 209 to 155. During training, when frames are sampled at small intervals, the network can easily get stuck in a local optimal solution. Appropriately increasing the sampling interval can simulate the real scene. When the random sampling interval is larger than 10, the tracking framework cannot capture such long-range dynamics, resulting in relatively poor tracking performance.

insert image description here

Figure 5: The effect of CAL on solving (a) repeated boxes and (b) ID switching problems. The top and bottom rows are the trace results without and with CAL, respectively.

5. Restrictions

MOTR is an online tracker that enables end-to-end multi-object tracking. Thanks to the DETR architecture as well as tracklet-aware label assignment, it implicitly learns appearance and position variations in a joint manner. However, it also has several disadvantages. First, the performance of detecting nascent objects is far from satisfactory (results on the MOTA metric are not good enough). As mentioned above, detection queries are suppressed when detecting tracked objects, which may violate the nature of object queries and limit the detection performance on nascent objects. Second, query passing in MOTR is performed frame by frame, limiting the efficiency of model learning during training. In our practice, parallel decoding in VisTR [36] cannot handle complex scenes in MOT. Solving these two problems will be an important research topic of Transformer-based MOT framework.

Acknowledgments : This research was supported by the National Key Research and Development Program (No. 2017YFA0700800) and the Beijing Academy of Artificial Intelligence (BAAI).

References

  1. CodaLab Competition - CVPR 2020 BDD100K Multiple Object Tracking Challenge (Jul 2022), https://competitions.codalab.org/competitions/24910, [Online; accessed 19. Jul. 2022] 12
  2. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019) 1, 3, 11
  3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016) 1, 3
  4. Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS (2017) 1
  5. Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: Joint end-to-end sign language recognition and translation. In: CVPR (2020) 3
  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020) 1, 3, 4
  7. Chang, X., Zhang, W., Qian, Y., Le Roux, J., Watanabe, S.: End-to-end multi-speaker speech recognition with transformer. In: ICASSP (2020) 3
  8. Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv preprint arXiv:2104.00194 (2021) 4
  9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 3
  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 6, 10
  11. Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2), 83–97 (1955) 3
  12. Leal-Taix´e, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese cnn for robust target association. In: CVPRW (2016) 3
  13. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: AAAI (2019) 3
  14. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense object detection. In: ICCV (2017) 8
  15. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 10
  16. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021) 3
  17. Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taix´e, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. IJCV 129(2), 548–578 (2021) 9
  18. Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702 (2021) 1, 3, 4, 8, 9, 10, 11
  19. Milan, A., Leal-Taix´e, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016) 9
  20. Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021) 11, 12
  21. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: CVPR (2019) 8
  22. Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: CVPR (2017) 3
  23. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018) 10
  24. Sharma, S., Ansari, J.A., Murthy, J.K., Krishna, K.M.: Beyond pixels: Leveraging geometry and shape cues for online multi-object tracking. In: ICRA (2018) 3
  25. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39(11), 2298–2304 (2016) 4
  26. Shuai, B., Berneshawi, A.G., Modolo, D., Tighe, J.: Multi-object tracking with siamese track-rcnn. arXiv preprint arXiv:2004.07786 (2020) 3
  27. Stadler, D., Beyerer, J.: Modelling ambiguous assignments for multi-person tracking in crowds. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 133–142 (2022) 11
  28. Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv preprint arXiv:2111.14690 (2021) 3, 9, 11, 12
  29. Sun, P., Jiang, Y., Zhang, R., Xie, E., Cao, J., Hu, X., Kong, T., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460 (2020) 1, 3, 4, 8, 9, 10, 11, 12
  30. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NeurlPS (2014) 2, 4
  31. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurlPS (2017) 2, 3, 4
  32. Wang, Q., Zheng, Y., Pan, P., Xu, Y.: Multiple object tracking with correlation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3876–3886 (2021) 11
  33. Wang, S., Sheng, H., Zhang, Y., Wu, Y., Xiong, Z.: A general recurrent tracking framework without real data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13219–13228 (2021) 11
  34. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018) 3
  35. Wang, Y., Kitani, K., Weng, X.: Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 13708–13715. IEEE (2021) 11
  36. Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: CVPR (2021) 3, 14
  37. Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: ECCV (2020) 1, 3
  38. Welch, G., Bishop, G., et al.: An introduction to the kalman filter (1995) 3
  39. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017) 1, 3
  40. Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: CVPR (2021) 11, 12
  41. Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 9, 12
  42. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864 (2021) 1, 3, 10, 11, 12
  43. Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV pp. 1–19 (2021) 1, 3, 11, 12
  44. Zhou, X., Koltun, V., Kr¨ahenb¨uhl, P.: Tracking objects as points. In: ECCV (2020) 9, 10, 11, 12
  45. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: ICLR (2020) 1, 3, 6, 10

Guess you like

Origin blog.csdn.net/i6101206007/article/details/131601448