A Survey of Research on Visual Multi-Target Tracking Based on Deep Learning

Authors: Wu Han, Nie Jiahao, Zhang Zhaowei, He Zhiwei, Gao Mingyu

Source: Computer Science

Edit: East Bank because of @一点Artificial Intelligence

Invitation to join the group: 7 professional direction exchange groups + 1 data demand group

Original address: A review of research on visual multi-target tracking based on deep learning

Multiple Object Tracking (MOT) aims to output the motion trajectories of all objects from a given video sequence and maintain the identity of each object. In recent years, due to its great potential in academic research and practical applications, it has received more and more attention and has become a hot research direction in computer vision. The current mainstream tracking method splits the MOT task into three sub-tasks: target detection, feature extraction, and data association. This idea has been well developed. However, due to challenges such as occlusion and similar object interference in the actual tracking process, maintaining robust tracking is still a current research difficulty. In order to meet the requirements of accurate, robust and real-time tracking of multiple targets in complex scenes, further research and improvement on the MOT algorithm is needed.

At present, there has been a review of the MOT algorithm, but there are still problems such as insufficient summary and lack of latest research results. Therefore, firstly, the principle and challenges of MOT are introduced; secondly, by summarizing the latest research results, the MOT algorithm is summarized and analyzed, and various algorithms are divided into three categories according to the tracking paradigm used to complete the three sub-tasks. That is, separate detection and feature extraction, joint detection and feature extraction, and joint detection and tracking, and describe the main features of various tracking algorithms in detail; then, compare and analyze the proposed algorithm and the current mainstream algorithm on commonly used data sets, The advantages, disadvantages and development trends of the current algorithms are discussed, and the future research direction is prospected.

01 Introduction

The main task of Multiple object tracking (MOT) is to output the trajectories of all objects from a given video and maintain the identity information (Identity, ID) of each object. Wherein, the tracking target may be a pedestrian, a vehicle or other objects. With the development of computer vision technology, MOT has been widely used in many fields, such as video intelligent monitoring, human-computer interaction, intelligent navigation, etc. [1-2]. In addition, MOT is the basis for advanced computer vision tasks such as pose estimation, action recognition, action analysis, and video analysis [3-4]. However, robust tracking in complex scenes is still a current research difficulty, which is mainly reflected in the following three aspects:

1) Frequent occlusion during the tracking process makes it difficult to accurately locate the target;

2) There may be a high appearance similarity between different targets, which increases the difficulty of maintaining the target ID;

3) The interaction between targets may cause the tracking frame to drift.

Traditional MOT algorithms include Markov decision-making, joint probability data association, particle filter, etc., but the traditional method has a large error in the predicted position and poor robustness to occlusion and similar object interference. With the wide application of deep learning in the field of computer vision, tracking methods based on deep learning have received extensive attention and become the mainstream of research in recent years. Benefiting from the rapid development of target detection technology, current methods based on deep learning mainly split MOT into three subtasks: target detection, feature extraction and data association [5]. Specifically, it associates detected objects in different video frames as trajectories based on the similarity of objects' appearance, motion, and spatio-temporal features. The tracking algorithm based on deep learning does not need to manually select features, and it can enable model training to obtain good feature extraction capabilities through a large amount of data.

In order to promote the development of MOT, relevant literature has reviewed the research results of MOT in recent years. Literature [6] comprehensively summarizes the main challenges in MOT, and summarizes the main technologies in MOT; Literature [7] summarizes the application of deep learning in each step of MOT; Literature [8] summarizes in detail the application of deep learning in MOT. The application in data association; literature [9] reviewed the MOT method based on RGB-D three-dimensional visual information; literature [10] divided the MOT model into traditional methods and methods based on deep learning for review. However, most of the classification methods of the above literatures lack novelty and do not cover the latest research results.

In order to make up for the deficiencies of the existing reviews, and at the same time enable the majority of scientific researchers to understand and grasp the latest development trends in the field of MOT, this paper uses a tracking paradigm to complete the three sub-tasks of target detection, feature extraction and data association from a novel perspective. In this paper, the MOT algorithms in recent years are divided into three categories for review. By reviewing the latest research results, the MOT algorithms and their advantages and disadvantages in recent years are summarized, and the future research directions are prospected.

02 Introduction to Existing MOT Algorithms

In recent years, the MOT algorithm mainly adopts the tracking strategy of associating the detected objects in the video sequence into a complete trajectory according to the feature similarity of the objects. According to the tracking paradigm adopted by the model to complete the three sub-tasks of target detection, feature extraction and data association, the MOT algorithms in recent years can be divided into separate detection and feature extraction (Separate Detection and Embedding, SDE), joint detection and feature extraction. method (Joint Detection and Embedding, JDE) and the method of joint detection and tracking (Joint Detection and Tracking, JDT).

As shown in Figure 1, the SDE-based method completes three subtasks successively, that is, first locates the target through a detection network, then extracts the features of the target, and finally calculates the affinity between targets and associates the targets through the data association algorithm . The JDE method outputs the location and appearance features of the target in a network at the same time, and then calculates the affinity between the targets through the data association algorithm and associates the targets. Whereas the approach of JDT is to complete 3 subtasks in a single network to complete the tracking process. The classic models of these three methods, as well as the tracking effects and advantages and disadvantages of different methods will be introduced in detail below.

Figure 1 Schematic diagram of each paradigm structure

Through the unremitting efforts of scholars at home and abroad, many MOT algorithms have achieved remarkable results in tracking accuracy and tracking speed. Figure 2 shows a number of representative algorithms in recent years according to the classification of algorithms, and each algorithm will be introduced in detail later. It is not difficult to see that a variety of tracking methods coexist, and more diversified network structures and tracking strategies have promoted the rapid development of MOT technology.

Figure 2 Classification of MOT algorithms in recent years

03 Algorithm based on SDE paradigm

According to the requirements of the algorithm for the input video frame, the algorithm based on SDE can be further divided into offline method and online method. The offline method considers the information of all video frames of the entire video sequence in the process of data association, while the online method only relies on the visual and spatiotemporal information of the current and past moments in the tracking process. Table 1 details the characteristics and differences of various aspects of offline and online methods.

Table 1 Comparison of offline tracking and online tracking

3.1 Offline Tracking Method

Offline tracking can be regarded as a global optimization problem, given the detection results of all video frames, the detection results belonging to the same object are globally associated into a trajectory.

The key to offline tracking is to find the global optimal solution. Continuous energy minimization [47] is a commonly used global optimization method that aims to integrate data association and trajectory estimation into the energy function and constrain the trajectory by constructing a motion model. Another commonly used global optimization strategy is to model the MOT task as a graph model, where each vertex represents a detection target, and the edges between vertices represent the similarity between targets, and then through the Hungarian algorithm [48-49] or greedy Algorithm [50] determines the matching relationship of each vertex. Methods based on graph models include Network Flow (Network Flow, NF) [11], Conditional Random Field (CRF) [12], Minimum Cost Subgraph Multicut (MC SM) [13] and Maximum Weighted Independent Set (Maximum-Weight Independent Set, MWIS) [51] and so on.

NF is a directed graph where each edge has a certain capacity. For the MOT task, each node in the graph represents a detection target, a flow is modeled as an indicator of whether two nodes are connected, and a trajectory corresponds to a flow path in the graph. The NF-based algorithm can obtain the global optimal solution in polynomial time, and improves the tracking accuracy by considering the information of multiple frames at the same time. However, it is difficult for NF-based methods to take into account multiple information in the tracking process.

CRF is an undirected graphical model that represents conditional probability distributions between sets of random variables. Each node in the figure represents the detection target, and the trajectory is used as input, and the CRF predicts the probability relationship between the detection target and each trajectory. The advantage of CRF is that it can effectively simulate the interaction and interaction between targets. However, the MOT algorithm based on CRF is easy to fall into local optimum.

MCSM treats MOT as a graph clustering problem, where each output cluster represents a tracked target. MCSM measures the similarity between detected objects by edge-related cost, and then combines multiple high-confidence objects in time and space dimensions and performs clustering.

MWIS is the heaviest subset of non-adjacent nodes in the property graph. The nodes in the property graph represent the trajectory pairs in consecutive video frames, and the weights of the nodes represent the affinity of the trajectory pairs. If multiple trajectories share the same detection target, the nodes are connected. Finally, the global association results are obtained through the property graph.

Since more frames of image information can be utilized in the tracking process, offline methods usually have higher tracking accuracy and robustness than online methods, but their computational overhead is higher and their practical application range is smaller than online methods.

3.2 Online tracking method

Because the online tracking method has the characteristics of not relying on future information and is more in line with actual needs, online tracking algorithms have become the mainstream of research today. Online tracking methods usually correlate objects frame by frame in time order, so online tracking is also called sequential tracking. Current online tracking methods often associate objects based on their motion and appearance features. The early research mainly tracked the target based on the target's motion characteristics by building a motion model. Subsequently, benefiting from the powerful feature extraction capabilities of neural networks, tracking algorithms based on appearance features have attracted widespread attention. In order to further improve the tracking accuracy of the algorithm in various complex scenes, the MOT algorithm combined with motion and appearance features has become a research hotspot today.

3.2.1 Algorithms based on motion features

Many algorithms model key features such as position, velocity, and interaction of targets, and associate targets at different moments according to their motion states.

In 2016, Bewley et al. [14] modeled the position and velocity of each target, and then based on the IoU between the prediction frame obtained by Kalman filtering [52] and the detection frame obtained by Faster R-CNN [53] based on the tracking target. Frame associated target. In 2019, Zhou et al. [15] based on the convolutional neural network (CNN) [54] to model the movement rules of the target and the interaction relationship between the targets. Subsequently, Shan et al. [16] and Girbau et al. [17] designed a model based on graph convolution and recurrent neural network to fuse multi-frame image information to predict the target motion state.

The method based on target motion characteristics can effectively deal with short-term occlusion and alleviate the interference of similar targets on the model. However, due to the lack of appearance features, the tracking performance of these algorithms often degrades significantly in dense scenes or when the scale of the target changes.

3.2.2 Algorithms Based on Appearance Features

Benefiting from the powerful feature extraction capabilities of CNN, many current algorithms extract more discriminative appearance features through deep networks, thereby enhancing the tracking robustness of the model in crowded scenes.

In 2016, Yu et al. [18] designed a feature extraction network based on GoogLeNet [55] to extract the appearance features of the target, and associated targets through the k-dense neighbor algorithm [56]. In 2017, Son et al. [19] learned more discriminative object features by simultaneously learning multiple images containing different objects. Lee et al. [20] proposed a feature extraction network incorporating Feature Pyramid Network (FPN) [57] to enhance the network's target discrimination ability by fusing multiple levels of features. In 2021, Sun et al. [21] proposed a deep affinity network to extract the appearance features of objects and evaluate the appearance similarity between objects.

Compared with the algorithm based on motion features, the algorithm based on appearance features has stronger tracking ability in crowded scenes and is more robust to target scale transformation. However, algorithms based only on appearance features are prone to errors such as tracking frame drift in scenes with similar target interference.

3.2.3 Algorithms Combining Motion and Appearance Features

It is difficult to track robustly in complex scenes only relying on the target's motion or appearance features. Therefore, combining target motion and appearance features is the mainstream direction of current research.

In 2017, Wojke et al. [22] combined the predicted position of KF and the target appearance features extracted by CNN to calculate the affinity between targets. Subsequently, in order to alleviate the impact of noisy detection and redundant tracking trajectories on tracking results, Chen et al. [23] designed a scoring mechanism to remove unreliable detection results and candidate trajectories, and then associate the remaining targets based on KF and target appearance features . In 2021, Li et al. [24] designed a self-correcting KF to predict the location of objects, and evaluated the similarity between objects through a recurrent neural network.

Tracking algorithms that combine motion and appearance features tend to have higher tracking accuracy and are more robust to various challenges in complex scenes. However, due to the high complexity of the network and the relatively large amount of calculation, the tracking speed of these algorithms is slow, and it is difficult to meet the requirements of real-time tracking.

04 Algorithm based on JDE paradigm

The SDE method successively infers two deep networks with large computational loads, object detection and feature extraction, during the tracking process. This high computational overhead limits the tracking speed of the model. Therefore, the JDE paradigm, which completes object detection and feature extraction in a single network, has received attention. By making the two key tasks of object detection and feature extraction share a large number of features, the JDE paradigm can significantly reduce the computational load of the algorithm. This section first introduces the development process of the JDE paradigm, and then summarizes the improvement directions of many scholars on the JDE paradigm in recent years.

4.1 The development history of JDE paradigm

The JDE paradigm outputs both location and appearance features of objects in a single network by adding a parallel feature extraction branch to the detection network. By making the two tasks share features, it effectively avoids some repeated calculations and improves the tracking speed of the model.

In 2019, Voigtlaender et al. [58] added a feature extraction branch to the two-stage detection network MaskR-CNN [59] and proposed TrackR-CNN. The feature extraction branch extracts the appearance features of each candidate region from the candidate regions generated by the Region Proposal Network (RPN) through a fully connected layer. In addition, MaskR-CNN has an instance segmentation branch, which enables TrackR-CNN to extract target pixel-level features, thereby effectively improving tracking accuracy. Although the calculation amount of Track-CNN is reduced compared with the algorithm based on the SDE paradigm, due to the long reasoning time of the two-stage network, TrackR-CNN still does not meet the requirements of real-time tracking.

In 2020, Wang et al. [60] added a feature extraction branch to the single-stage detection network YOLOv3 [61] and proposed JDE864. The detection method of YOLOv3, which directly returns the position and category of the target in the image, is beneficial to improve the tracking speed of the algorithm. In addition, JDE864 regards network training as a multi-task learning problem, adopting a self-balancing loss function [62] to balance the importance of classification, bounding box regression and re-identification (Re-identification, ReID) feature extraction. JDE864, which was done simultaneously in a single network, ended up being the first real-time tracked MOT algorithm. However, the design of its feature extraction branch is simple, and the contradiction between target detection and ReID is not fully considered, so the tracking robustness is relatively low.

4.2 Research on the improvement of JDE paradigm

Although TrackR-CNN and JDE864 effectively reduce the computational load of the model, its tracking accuracy is not significantly better than the previous algorithm based on the SDE paradigm. Therefore, many scholars have analyzed the reasons for the unsatisfactory tracking results and improved them. The improvements mainly focus on three aspects: anchor-free detection network, collaborative multiple subtasks, and design attention mechanism.

4.2.1 Anchor-free frame detection network

In a detection network using anchor boxes, one anchor box may contain multiple targets, and one target corresponds to multiple anchor boxes at the same time. This uncertainty reduces the discriminativeness of the extracted ReID features. Therefore, many scholars in the follow-up research choose the detection network design algorithm based on the anchor-free frame.

Zhang et al. [25] added a parallel feature extraction branch to the center-point-based anchor-free frame detection network Center-Net, and reduced the risk of overfitting by learning the low-dimensional features of the target. In 2021, Liu et al. [26] designed a region conversion module based on deformable convolution [65] in the FCOS [64] network to reduce the network's attention to irrelevant regions. Subsequently, Yan et al. [66] integrated a feature extraction branch in the FCOS network. FCOS uses FPN to aggregate multiple levels of target features, making the extracted features more suitable for detection and ReID.

Compared with the network based on the anchor frame, the network without the anchor frame can extract the characteristics of the target itself more accurately, and the algorithm based on the network without the anchor frame achieves a better balance between the tracking accuracy and the tracking speed.

4.2.2 Collaborate with multiple subtasks

Since the purpose of target detection is to find the common points of similar targets, and the purpose of ReID is to find the differences between targets of the same kind, this contradiction makes it difficult for the extracted features to meet the needs of two tasks at the same time. Therefore, synergizing multiple subtasks within a network is an important research direction.

In 2020, Liang et al. [27] designed a cross-correlation network to learn common features shared by multiple tasks and features specific to each task. Chen et al. [28] designed a norm-aware feature to map feature vectors into polar coordinates, and then used the bi-norm [67] and angle of the vectors for detection and ReID, respectively. In 2021, Wan et al. [29] designed a multi-channel spatio-temporal feature, which encodes the appearance and motion features of the target into different channels, and takes into account both detection and ReID through richer features. Subsequently, Liang et al. [30] designed a re-inspection network to correct the detection results and the extracted ReID features.

Since the contradictions within the network can be alleviated, the improved strategy of coordinating multiple subtasks can effectively improve the tracking accuracy of the model. But it increases the network complexity and calculation amount of the model, so the tracking speed is correspondingly slowed down.

4.2.3 Attention mechanism

By designing different attention mechanisms to enhance the network's attention to specific areas, it can effectively improve the detection quality of the model in complex scenes and enable the network to accurately extract the more discriminative ReID features of the target, thereby effectively improving the tracking performance of the algorithm.

In 2020, Meng et al. [31] designed a spatiotemporal attention mechanism to learn and update the weights of features at each moment when tracking target features. Zhang et al. [32] introduced the spatial attention mechanism and the channel attention mechanism [68] to improve the model's robustness to similar object interference and target scale transformation. In 2021, the target attention mechanism and distractor attention mechanism proposed by Guo et al. [59] can effectively enhance the ability of the model to distinguish different targets. Subsequently, Yu et al. [33] designed a deformable attention to capture the association between the target and the surrounding background, and effectively learned the more discriminative ReID features of the target.

Adding an attention mechanism can focus the attention of the network on task-related areas, and different attention can effectively improve the tracking performance of the model in different tracking scenarios. Furthermore, adding an attention mechanism usually has little impact on the computation and complexity of the network.

05 Algorithm based on JDT paradigm

Although the JDE paradigm reduces the amount of computation compared to the SDE paradigm, it only combines the two parts of target detection and feature extraction, so the model complexity is still high and cannot be backpropagated, making global optimization difficult. In recent years, the JDT paradigm of completing three subtasks in a single network has attracted the attention of many scholars.

The JDT paradigm takes adjacent multi-frame images as input, predicts its current position offset or appearance features based on the previous motion or appearance information of the target, and then associates the target. The current algorithms based on the JDT paradigm are mainly divided into methods based on Siamese networks and methods based on Transformer [70].

5.1 Method based on Siamese network

A Siamese network is a variant of a standard CNN. As shown in Figure 3, the method based on the Siamese network extracts the features of the target in different video frame images through two convolutional layers with shared weights, and combines different image information to learn more discriminative features of the target. Subsequently, the algorithm searches for previous tracking objects in the current frame image. According to the way the model searches for objects, it can be divided into candidate region-based methods and center point-based methods.

Figure 3 Schematic diagram of twin network

5.1.1 Algorithms based on candidate regions

Proposal-based methods first generate candidate regions for object locations, and then search for objects in the candidate regions and regress bounding boxes according to the characteristics of the objects at previous moments.

In 2019, Bergmann et al. [71] regarded MOT as a detection problem for an integrated ReID task, and designed a motion compensation model to alleviate the problem of camera motion or large target position changes in low frame rate videos. Peng et al. [34] generated a candidate region through RPN, and then returned a pair of bounding boxes of the target from two adjacent frames of images through chained anchor boxes. Xu et al. [35] designed a deep Hungarian network based on a bidirectional recurrent neural network to improve the accuracy of the algorithm's association target. In 2021, Shuai et al. [61] expanded the tracking frame of the previous frame of the target and mapped it to the current frame image as a candidate area, and searched for the tracking target in it. [37] selected a pair of adjacent images for comparative learning during the training process. After generating a large number of candidate regions through RPN, compare the similarity between the candidate regions of the two frames of images, so as to train the ability of the model to extract features.

This proposal-based method is suitable for tracking scenarios where the target position changes relatively slowly. In scenes where the target position changes greatly between two adjacent frames, such as fast moving targets or low video frame rates, the candidate area generated by the model may deviate greatly from the actual position of the target, resulting in false tracking or missed tracking.

5.1.2 Algorithm based on center point

The center point-based method directly predicts the center position of the target on the image, and at the same time estimates the coordinate position offset of the tracking target in the current image for subsequent data association, and finally returns the bounding box of the target.

In 2020, Zhou et al. [38] added two parallel branches to predict the vertical and horizontal offset of the target between two adjacent frames. Aiming at the problem that the traditional bounding box cannot represent the spatiotemporal information of the target, Pang et al. [39] designed a boundary tube that describes the state of the target with multi-moment locations. In 2021, in order to enhance the robustness of the model to occlusion, Wu et al. [40] designed a motion guidance module to predict the coordinate position offset of corresponding pixels in two frames of images, and based on the predicted offset to fuse multi-moment features graph to enhance the target features. Wang et al. [41] strengthen the model's ability to discriminate each target by learning the relationship between the target and the surrounding background and other targets. In order to improve the robustness of the model to occlusion, Horna-kova et al. [42] designed a spatiotemporal recursive memory module to predict the position of the target when it was occluded based on the position of all historical frames. Subsequently, in order to make full use of the spatio-temporal information of the target, Wang et al. [43] modeled the spatio-temporal interaction relationship between targets through a graph neural network, thereby fusing the information of multiple frames of images.

Compared with the algorithm based on the candidate area, the algorithm based on the center point can more accurately extract the characteristics of the target itself. Second, the center point-based method is more suitable for representing the position offset of the target. At the same time, according to the predicted position offset, the method based on the center point can accurately fuse the characteristics of the target at multiple past moments, so as to improve the tracking accuracy of the algorithm in complex scenes by making full use of spatiotemporal information.

5.2 Transformer method

Transformer was first proposed in natural language processing, which fully extracts the deep features of the target through the attention mechanism. In recent years, Trans-former has achieved remarkable success in multiple computer vision tasks [72-74] due to its powerful feature representation capabilities and good parallel computing capabilities.

In 2020, Sun et al. [44] applied Transformer to the MOT task for the first time. In order to solve the problem that the basic Transformer is difficult to track new targets in the video, they designed two decoders for detecting targets and tracking previous targets. Subsequently, Chu et al. [45] proposed a space-time graph Transformer to model the space-time interaction between targets. It arranges the tracking trajectories of each target into a set of weighted sparse graphs, and effectively simulates the interactive relationship between multiple targets by constructing a spatial graph encoder, a temporal encoder and a spatial graph decoder. Since the method of representing objects by bounding boxes in complex scenes will introduce background and other interference information such as objects, Xu et al. [46] proposed a Transformer tracking algorithm based on heat maps. Predict the center point position of the target.

Benefiting from the powerful data association capability of Transformer, the algorithm based on Transform-former has strong tracking robustness. In addition, Transformer has a clear structure and excellent performance, and it still has great development potential in the field of MOT, which provides a new direction for follow-up research.

06 Datasets and Evaluation Indicators

6.1 MOT Dataset

In order to provide sufficient training data for the MOT algorithm and accurately evaluate the performance level of each algorithm, many scholars have released multiple MOT data sets in recent years. According to the different tracking objects of each dataset, it can be divided into pedestrian tracking dataset and vehicle tracking dataset.

MOT15 [75] is the first MOT dataset, which contains 22 video sequences. MOT15 mainly includes challenges such as unfixed cameras, viewing angle changes, and illumination changes, and provides the detection results of the ACF [76] algorithm. Subsequently, Milan et al. released the dataset MOT16 [77] with a higher target density. The dataset consists of 14 video sequences and provides the detection results of the DPM [78] algorithm. The video sequences of MOT17 [79] and MOT16 are the same, but MOT17 provides more accurate annotation results, and also provides detection results of Faster R-CNN, DPM and SDP [80]. The main challenges of the MOT16 and MOT17 datasets include camera shake, frequent object interactions, and lighting changes. The tracking scene in MOT20 [81] is extremely crowded, and its average object density far exceeds that of other datasets. TAO-person [82] is a large-scale pedestrian tracking dataset, which contains 418 training videos and 826 testing videos. The main challenge of the TAO-person dataset comes from the complex motion patterns and motion blur of pedestrians.

KITTI [83-84] can be used for both pedestrian and vehicle tracking, which contains 50 video sequences, and provides detection results of DPM and Region Lets [85]. Most of the videos in the vehicle tracking dataset UA-DETRAC [86] are shot on crowded roads or highways in cities, so there are a lot of motion blur and mutual occlusion between objects. Waymo [87] contains 1150 videos taken in urban or suburban areas. In addition to 2D images and their annotations, Waymo also provides radar information for 3D detection and tracking tasks. Table 2 lists the information of the current commonly used datasets, where the target density (Density) indicates the number of targets contained in each frame of the dataset on average.

Table 2 Commonly used MOT datasets

6.2 Evaluation Index

In order to comprehensively evaluate the tracking performance of the algorithm, currently multiple indicators [90-92] are usually used to evaluate the tracking performance of the model. The number of ID changes (Identity Switches, IDs) refers to the number of ID exchanges of all targets during the entire tracking process. The Identification F-Score (IDF) takes into account the accuracy and recall of the target ID. IDs and IDF are response models. Important metrics for tracking robustness. FP is the total number of mistakes, and FN is the total number of misses. Multiple Object Tracking Accuracy (MOTA) is one of the most important evaluation indicators, as shown in formula (1), which comprehensively considers FP, FN and IDs.

where N(GT)is the total number of true values.

Tracking accuracy (Multiple Object Tracking Precision, MOTP) mainly considers the overlap between the tracking box and the real bounding box. Mostly Tracked (MT) indicates the proportion of targets whose trajectories are successfully tracked by more than 80%; most lost proportion (Mostly Tracked, ML) indicates the proportion of targets whose trajectories fail to be tracked by more than 80%. The number of track segmentation (Fragmentation, Frag) indicates the total number of interruptions of all tracking tracks. In 2021, Luiten et al. [93] proposed Higher Order Tracking Accuracy (HOTA). HOTA provides a comprehensive evaluation of the performance of the model by computing the geometric mean of detection accuracy and association accuracy at various localization error thresholds. Hz is used to evaluate the tracking speed of the algorithm, and the unit is Frames Per Second (FPS).

07 Model comparison and analysis

This section selects multiple algorithms and evaluates their performance on the MOT17 and MOT20 datasets. The performance indicators of each algorithm in the MOT17 and MOT20 data sets are listed in Table 3. The performance evaluation data of each algorithm are provided by relevant literature, where the bold font indicates the optimal value of the indicator, and the underline indicates the suboptimal value of the indicator.

Table 3 Performance evaluation results of algorithms in MOT17 and MOT20 datasets

The SDE paradigm designs special algorithms for the two tasks of feature extraction and data association, so it usually has better tracking robustness, and most of the algorithm IDs indicators are small.

The strategy of executing the three tasks separately avoids the contradictions within the model and endows the SDE paradigm with a good performance upper limit. For example, TPAGT[16] can achieve 76.2% MOTA on the MOT17 dataset. However, the tracking performance of the algorithm based on the SDE paradigm depends on the detection performance, and unsatisfactory detection results such as missed detection, false detection and noise detection often lead to a significant decline in tracking performance.

We selected representative algorithms SORT [14], DAN [21] and MOTDT [23] from motion-based, appearance-based, and combined motion- and appearance-based methods, respectively. Figure 4 shows the MOTA on the MOT17 dataset when the three algorithms respectively use Faster R-CNN (FRCNN), Mask R-CNN (MASK), YOLOv3, DPM and SDP as detection algorithms. Significant changes have occurred due to different test results.

When SORT and DAN use the detection results of SDP, the MOTA reaches 56.8% and 58.5% respectively, while when the detection results of DPM are used, the MOTA drops to 24.9% and 15.7%, respectively. When MOTDT adopts SDP as the detection algorithm, MOTA reaches 57.6%, while it drops to 30.2% when YOLOv3 is used.

In addition, SDE can only optimize the three tasks of object detection, feature extraction and data association alone, and cannot globally optimize the model through backpropagation.

At the same time, the SDE method model is complex and computationally intensive, so the algorithm based on the SDE paradigm has a low tracking speed. CRF_CNN[12], TPAGT[16], DAN[21] and MOTDT[23] tracked on the MOT17 dataset The speeds are 1.4FPS, 6.8FPS, 3.9FPS and 6.3FPS respectively, which are difficult to meet the requirements of real-time tracking.

Fig.4 Accuracy of tracking algorithm under different detection results

By making the two sub-modules with the largest amount of calculation, object detection and object appearance feature extraction, share features, the JDE paradigm has the characteristics of small amount of calculation and parameters, so it has become a commonly used tracking method in the industry.

However, the early JDE-based algorithms do not have obvious advantages in accuracy. For example, JDE864[38] has a tracking speed of 30.3FPS on the MOT16 dataset, and MOTA is 62.1%. In the subsequent optimization of the JDE paradigm, using the anchor-free detection network is a direct and effective strategy. FairMOT[25] achieved 73.7% and 61.8% MOTA on the MOT17 and MOT20 datasets, respectively, and the tracking speed reached 25.9%. FPS and 13.2FPS.

The strategy of alleviating the contradiction between object detection and ReID within the network can bring considerable gains to the tracking accuracy of the model. For example, CSTrack[27] achieved 74.9% and 66.6% MOTA on the MOT17 and MOT20 datasets, respectively, and OMC[51] achieved 74.9% and 66.6% on the MOT17 datasets. The MOTA on the data set reached 76.3%.

However, this strategy will have a certain impact on the model tracking speed. On the MOT17 dataset, the tracking speeds of CSTrack and OMC are 15.8FPS and 12.8FPS, respectively. Designing an attention mechanism helps to improve the performance of the network in a specific direction. For example, Relation Track [33] has an IDs index of 1374 on the MOT17 dataset, which is the best result among all algorithms. At the same time, the MOTA reached 73.8%, and the MOTA on the MOT20 Reached 67.2%.

In addition, different attention networks bring different calculations to the model, and complex attention networks often lead to a serious decrease in model speed. Tracking speeds dropped to 6.6FPS and 4.3FPS, respectively.

From the above analysis, it can be found that the improvement of the tracking accuracy and robustness of the JDE paradigm often comes at the cost of reducing the tracking speed of the model. In the future research, the simultaneous optimization of the tracking accuracy and tracking speed of the JDE paradigm algorithm is still the focus and difficulty of the research.

JDT is the current research trend, its structure is simple and clear and its performance is superior. Since the JDT method completes three subtasks in a single network at the same time, most algorithms can achieve end-to-end training and can be globally optimized through back propagation, so algorithms based on the JDT paradigm usually have a higher MOTA. For example, Corr-Tracker [41] and GSDT [43] achieved MOTA of 76.5% and 73.2% on MOT17.

In addition, the algorithm based on Siamese network can process multiple frames of video images at the same time, making full use of spatio-temporal information, so most algorithms have fewer errors, such as CTracker[34], CenterTrack[38] and TraDeS[40] on FP indicators All have better performance.

In addition, Transformer has been successfully applied in MOT tasks [44-46], and Transformer-based tracking algorithms have shown good tracking performance. The algorithm based on Transformer has achieved outstanding results in multiple indicators. For example, the MOTA and speed of TransTrack[72] on MOT17 reached 75.2% and 16.9FPS respectively. At the same time, the MT and IDF indicators reached the best. TransMOT[45] ] reached the best in MOTA, MT, IDF and IDs indicators of MOT20.

Although the current algorithms of multiple JDT paradigms have achieved excellent tracking performance, there are still some problems to be solved. First of all, most of the algorithms based on the Siamese network are not significantly better than other algorithms in IDs. How to robustly maintain the ID of each target during the long-term tracking process is still a current research difficulty. Transformer-based methods can maintain strong tracking accuracy and robustness in complex scenes [94], and still have great research potential and development space. However, most of the current Transformer-based algorithms have a slow tracking speed, which is difficult to meet the requirements of practical applications. In addition, the current MOT algorithm based on Transformer has a large amount of calculation, so it has high requirements for hardware equipment, and multiple high-performance GPUs are usually required to optimize the network.

08  Conclusion

MOT is widely used in the fields of intelligent monitoring and human-computer interaction. This paper first introduces the principle of MOT and the challenges in the tracking process. Secondly, according to the tracking paradigm adopted by the algorithm to complete the three sub-tasks, the MOT algorithms in recent years are divided into three categories. , and made a more detailed overview, and then discussed the advantages and disadvantages of each type of algorithm.

In recent years, the MOT technology based on deep learning has developed rapidly, and the tracking performance of the model has been significantly improved. At present, more and more technologies have been applied to the MOT task, but there are still many research directions worth exploring.

(1) Unsupervised MOT: Most of the current MOT algorithms are based on supervised learning. However, the annotation of the MOT dataset needs to find the same target between different images frame by frame, which takes huge time and economic cost. Designing MOT algorithms based on unsupervised learning [95-96] can help reduce the overhead of manually annotated data, however, the unsupervised MOT task is very challenging due to the lack of prior knowledge of the tracking target.

(2) Interaction relationship between objects: By modeling the interaction relationship between multiple objects, the model's ability to discriminate each object in a crowded scene can be enhanced. However, the current algorithm still has little exploration of the interaction relationship between objects. In future research work, Transformer or graph neural network [97-99] can be used to model the interaction relationship between objects, so as to further improve the MOT algorithm in extremely crowded scenes such as subway stations during peak hours and tourist attractions during holidays. tracking robustness.

(3) Tracking facilitates detection: The tracking performance of current MOT algorithms relies on detection algorithms. However, current MOT algorithms usually execute detection algorithms alone and do not explore the information of the target at previous moments. Make full use of the spatio-temporal information of the target, and transfer the features of the target’s motion and appearance in the past to the current frame, which helps to improve the performance of the model when performing traffic vehicle tracking and athlete behavior analysis on the field, which have a lot of occlusion and motion blur. Track performance.

1.  End-to-end lane line detection algorithm based on multi-layer perceptron

2.  Summary of data preprocessing and model training skills in computer vision

3.  Book download - "Deep Learning and Computer Vision in Autonomous Driving"

4.  Book download - "Visual Object Tracking: From Correlation Filtering to Deep Learning"

5.  Research progress of RGB-D salient object detection in the era of deep learning

Guess you like

Origin blog.csdn.net/weixin_40359938/article/details/130544139