Multi-target Tracking Review & Future Research Direction

Continuous frame video demo

Future Discussion on Multi-Target Tracking

0 written in front

MOT refers to identifying and tracking several targets in the video, assigning an id to each target, and doing data association on the basis of detection, that is to say, considering which two targets between two frames belong to the same target object.

First, divide the video into frames and input them into the multi-target tracking algorithm. Suppose we run to the tth frame, and use the feature extraction algorithm to obtain the features of each detected target in the current frame. The features can be appearance features or Motion features, and then perform similarity calculation and data association between the features of each target and the features of the tracked object in the previous t-1 frame to obtain the final tracking result.
Reference paper: https://arxiv.org/pdf/2006.13164.pdf

The object of multi-target tracking is video, and multiple targets are constantly moving from the first frame to the last frame of the video. The purpose of multi-target tracking is to distinguish each target from other targets by assigning an ID to each target and recording their trajectories. Target tracking is different from target detection. Target detection cannot assign an ID to each object, which is unstable; while target tracking can optimize the entire process, assign IDs and perform trajectory tracking.
The main steps of MOT multi-target tracking:

  1.     Detection - given the original frame of the video; run the target detector such as Faster R-CNN, YOLOv3, etc. to detect and obtain the target detection frame;
  2.     Feature extraction, motion prediction——cut out the corresponding targets in all target frames, and perform feature extraction (including apparent features or motion features);
  3.     Similarity calculation - perform similarity calculation, and calculate the matching degree between the targets of the two frames before and after (the distance between the front and back belonging to the same target is relatively small, and the distance between different targets is relatively large);
  4.     Data association - data association, assigning the ID of the target to each object.

1. Classification of multi-target tracking algorithms:

In the past five or six years, as the performance of target detection has been improved by leaps and bounds, a detection-based tracking solution has also been born, and it has quickly become the current mainstream framework for multi-target tracking, which has greatly promoted the advancement of MOT tasks. At the same time, recently, a joint framework based on detection and tracking and a framework based on attention mechanism have emerged , which have begun to attract the attention of researchers.

1. MOT based on Tracking-by-detection

The MOT algorithm based on the Tracking-by-detaction framework first performs target detection on each frame of the video sequence, and then crops the target according to the bounding box to obtain all the targets in the image. Then, it is transformed into the target association problem between the two frames before and after, and the similarity matrix is ​​constructed through IoU, appearance, etc., and solved by Hungarian algorithm, greedy algorithm and other methods.

Representative method:

SORT (reference code: GitHub - abewley/sort: Simple, online, and realtime tracking of multiple objects in a video sequence. ), DeepSORT (reference code: https://github.com/nwojke/deep_sort )

Since DeepSort added a CNN network to SORT to extract features (ReID feature extraction part), the algorithm slowed down and consumed more resources. It is suggested that you can use the SORT model when tracking small targets instead of tracking people.

2. MOT based on joint detection and tracking

 Representative methods: JDE, FairMOT, CenterTrack, ChainedTracker, etc.

3. MOT based on attention mechanism

With the application of Transformer and other attention mechanisms in computer vision, researchers have recently proposed a multi-target tracking framework based on attention mechanisms. At present, there are mainly TransTrack and TrackFormer, both of which apply Transformer to MOT. middle. TransTrack uses the feature map of the current frame as the Key, and uses the target feature Query of the previous frame and a set of target feature Query learned from the current frame as the input Query of the entire network.

Representative methods: TransTrack, TrackFormer, etc. 

2. Research Difficulties

Object tracking is a long-standing direction, but previous research mainly focused on single-object tracking. Until recent years, multi-object tracking has not received close attention from researchers. Compared with other computer vision tasks, the multi-target tracking task mainly has the following research difficulties:

1) Data sets are lacking and labeling is difficult;

2) Target detection is not accurate enough;

3) Frequent target occlusion;

4) The number of targets is uncertain;

5) The speed is slow and the real-time performance is not enough;
3. Data set

  1. MOT20 Dataset (Introduction)
  2. process result                                                                                                                                                      
  3. kitti data set ( Introduction to kitti data set, Baidu network disk sharing link, automatic driving visual consultation_kitti data set Baidu cloud_ambiguous blog of dull cats-CSDN blog )

4. Evaluation indicators

After continuous improvement, a set of special evaluation indicators for multi-target tracking has been formed [63-64]. The specific definition and calculation formula are as follows:

1) FP: False Positive, that is, there is no target in the real situation, but the tracking algorithm mistakenly detects that there is a target.
2) FN: False Negative, that is, it exists in the real situation, but the tracking algorithm misses it.
3) IDS: ID Switch, the number of target ID switches.
4) MOTA: Multiple Object Tracking Accuracy, multi-target tracking accuracy.

5) IDF1: ID F1 score, the ratio of the detection frame assigned by the correct identity label to the average ground truth and the calculated number of detections. 

6) MT: Mostly Tracked, the number of tracks where most targets are tracked. The number of trajectories where the ratio of the track length to the total track length of the target being successfully tracked is greater than or equal to 80%.

7) ML: Mostly Lost, the number of trajectories where most targets are lost. The number of trajectories where the ratio of the length of the track that the target is successfully tracked to the total length of the track is less than or equal to 20%.

8) MOTP: Multiple Object Tracking Precision, multiple target tracking accuracy. Indicates the degree of coincidence between the obtained detection frame and the real label frame.

9) FPS: Frames Per Second, the number of frames processed per second.

5. Summary

(1) Multi-category multi-target tracking, namely MCOT

MCOT, Multiple Classes Object Tracking. Most of the current MOT algorithms, including my own previous work, are still tracking only for a single target type such as "car" or "pedestrian". However, in practical applications, it is generally necessary to simultaneously track multiple categories of targets. Therefore, in follow-up research, friends engaged in MOT research can consider trying to track multiple categories and multiple targets at the same time.

(2) Multi-camera multi-target tracking

It is not difficult to understand that multi-target tracking under multi-camera is more challenging. For multi-camera (cross-camera), there are two working environments. The first one is to shoot with multiple cameras for the same scene, that is, under multiple viewing angles. In this case, the fusion of data information from multiple shots needs to be considered. The second is that each lens records a different scene, forming a non-overlapping multi-camera network. At this time, the data association across cameras becomes a target re-identification (Reid) problem, which further increases the difficulty of research.

(3) Multi-target tracking combined with other computer vision tasks

At present, some researchers have started to combine MOT with some other computer vision tasks, and their experimental results also show that these tasks can benefit from each other. The combinations that you can consider include but are not limited to: target segmentation, pedestrian re-identification, human pose estimation, action recognition, etc. As an example, the object segmentation branch can provide background information and scene structure, which may be very helpful for the MOT problem.

(4) Improvements based on the current scheme, such as designing new loss, new network design, transformer and other attention mechanisms, etc.
 

This part is also the direction that everyone is currently working on, and it also includes lightweight design for the convenience of embedded landing. Of course, it is also very meaningful for data labeling and generation, because data labeling for target tracking is time-consuming and labor-intensive. In fact, there is quite a lot of work that can be done in this part. Due to space and capacity constraints, I will not go into details. Everyone is welcome to actively add.

Reference: http://zhuanlan.zhihu.com/p/388721763

Guess you like

Origin blog.csdn.net/weixin_64043217/article/details/128650926