DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Paper address:  https://arxiv.org/abs/2111.14690

Code address:  https://github.com/DanceTrack/DanceTrack

Author Affiliation: The University of Hong Kong, CMU, Headquarters

Although the existing MOT dataset has many occlusions, the movement is regular, and the best performance can be obtained only by using IoU matching, which obviously cannot accurately evaluate the performance of the tracker. Occlusion and similarity are still the main factors restricting the performance of the algorithm

Summary

A common practice in multi-object tracking is to use detectors to localize locations and Re-IDs for association. This pipeline is benefiting from recent progress in object detection and Re-ID, partly from the bias of existing datasets (scenes are not rich enough), most of them have significantly separable features, and only use Re-ID to extract The features can be very good for target association. To deal with these biases, we argue that multi-objective methods should focus on those objects that have few distinguishing features. We therefore propose a large-scale dataset called DanceTrack for multi-person tracking with similar appearance and diverse motions. We hope that DanceTrack can provide a better evaluation benchmark for multi-target tracking algorithms, prompt the algorithm to reduce dependence on appearance features, and improve the ability of motion analysis more. We evaluated several trackers with the best performance and observed a lot of performance degradation compared with existing benchmarks. The data and code are available at DanceTrack .

introduction

Object tracking has been studied for a long time and is widely used in applications such as autonomous driving, video analysis, and motion planning. The goal of object tracking is to locate objects in a video and link them between previous and subsequent frames. Interestingly, we found that the development of multi-object tracking relies heavily on detection and Re-ID, mostly using appearance features for association. Algorithmic trends lead to the poor performance of existing methods on similar-looking objects, which inspires us to propose a benchmark that encourages modeling by fusing operational patterns and temporal features.

Like many other fields of computer vision, the development of multi-object tracking has benefited from the development of benchmark test sets. Algorithms based on specific data sets are prone to certain distributional biases. In this paper, we recognize the limitation of existing multi-object tracking, that is, most objects have distinct appearance and fixed motion pattern (almost uniform linear motion). Influenced by these datasets, recently proposed algorithms highly rely on appearance features for association and rarely consider motion cues, which is very unfavorable for us to build more general and intelligent algorithms.

We also observe that using appearance matching is very unreliable when objects have similar appearance or are occluded, which leads to a large drop-off of current state-of-the-art algorithms in practical applications. In order to provide a better platform for more complex algorithms, we propose a new dataset, which we call DanceTrack since most of them are dancing videos. It contains more than 100,000 images (10 times that of MOT17) ), as shown in Figure 1, this data set is characterized by (1) similar appearance: the people in the video are very similar or even wear the same clothes, which makes it difficult to distinguish through the features extracted by Re-ID ( 2) A variety of sports, people have a large range of motion and the posture changes are also very rich, which puts forward high requirements for motion modeling capabilities. The second feature also brings occlusion and intersection, people will overlap to a large extent, and the direction of movement is constantly changing.

Based on this dataset, we construct a test benchmark that includes existing popular multi-objective algorithms. The results show that it is difficult to obtain satisfactory performance simply by using the appearance model or the linear motion model. Considering that the scenarios in this dataset are often encountered in real life, we believe that it can expose the problems of existing algorithms in practical applications. In order to further know the research direction of the next step, we also analyzed the effect of different methods of data association and came to the following conclusions (1) fine-grained features, such as segmentation and pose, can obtain better results than coarse-grained bounding boxes (2) Although we are solving 2D problems, depth information still has a beneficial effect (3) It is very important to model temporal motion information

In summary, the key contributions of this paper to the field of object tracking are as follows:

1. We construct a large-scale object tracking dataset, which covers the lack of similar-looking objects in current datasets

2. We evaluate many methods on this new dataset, showing the shortcomings of current algorithms

3. We provide detailed analysis to uncover more clues for more complex real-life multi-objective algorithms

Guess you like

Origin blog.csdn.net/minstyrain/article/details/122900468