yolov5+bytetrack target tracking, the effect is better than deepsort

Table of contents

1. Motivation

2. BYTE

3. ByteTrack

4. Specific code

5. Complete code implementation + UI interface


ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Following the paradigm of tracking-by-detection in multi-object tracking (MOT), we propose BYTE, a simple and efficient data association method. Utilizing the similarity between detection frames and tracking trajectories, while retaining high-scoring detection results, it removes the background from low-scoring detection results and digs out real objects (difficult samples such as occlusion and blur), thereby reducing missed detections and improving Trajectory coherence. BYTE can be easily applied to 9 state-of-the-art MOT methods and achieve improvements in IDF1 indicators ranging from 1 to 10 points. Based on BYTE, we proposed a tracking method ByteTrack, which for the first time achieved 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on MOT17 at a running speed of 30 FPS, currently ranking first on the MOTChallenge list. We have also included tutorials on applying BYTE to different MOT methods and ByteTrack deployment code in the open source code.

Paper: http://arxiv.org/abs/2110.06864

The vertical axis is MOTA, the horizontal axis is FPS, and the radius of the circle represents the relative size of IDF1


1. Motivation

Tracking-by-detection is a classic and efficient genre in MOT, which uses similarity (position, appearance, motion and other information) to associate detection frames to obtain tracking trajectories. Due to the complexity of the scene in the video, the detector cannot obtain perfect detection results. In order to deal with the true positive/false positive trade-off, most current MOT methods will select a threshold and only retain detection results higher than this threshold for correlation to obtain tracking results, while detection results lower than this threshold are directly discarded. But is this reasonable? the answer is negative. Hegel said: "Existence is reasonable." Low-scoring detection frames often indicate the existence of objects (such as severely occluded objects). Simply discarding these objects will bring irreversible errors to MOT, including a large number of missed detections and trajectory interruptions, reducing overall tracking performance.

2. BYTE

In order to solve the irrationality of discarding low-scoring detection boxes in previous methods, we propose a simple, efficient, and universal data association method BYTE (each detection box is a basic unit of the tracklet, as byte in computer program). It is obviously not advisable to directly associate low-scoring frames and high-scoring frames together with trajectories, as it will bring a lot of background (false positives). BYTE processes high-scoring frames and low-scoring frames separately, and uses the similarity between low-scoring detection frames and tracking trajectories to dig out real objects from low-scoring frames and filter out the background. The entire process is shown in the figure below:

(1) BYTE will divide each detection frame into two categories based on its score, high-scoring frames and low-scoring frames, and perform a total of two matches.

(2) For the first time, use the high score box to match with the previous tracking trajectory.

(3) Use the low-scoring frame for the second time and the tracking trajectory that did not match the high-scoring frame for the first time (for example, an object that was severely occluded in the current frame, resulting in a reduced score) for matching.

(4) For detection frames that do not match the tracking trajectory and have a high enough score, we create a new tracking trajectory for it. For tracking trajectories that do not match the detection frame, we will retain 30 frames and match them when they appear again.

We believe that the reason why BYTE can work is that occlusion is often accompanied by a slow decrease in the detection score from high to low: the occluded object is a visible object before being occluded, and the detection score is higher to establish a trajectory; when the object is occluded, through The position coincidence between the detection frame and the trajectory can dig out the occluded objects from the low-scoring frame and maintain the continuity of the trajectory.

3. ByteTrack

ByteTrack uses the currently excellent detector YOLOX to obtain the detection results. In the process of data association, just like SORT, only the Kalman filter is used to predict the position of the tracking trajectory of the current frame in the next frame. The IoU between the predicted frame and the actual detection frame is used as the similarity between the two matches. The matching is done via the Hungarian algorithm. It is worth noting here that we do not use ReID features to compute appearance similarity:

(1) The first point is to make it as simple and high-speed as possible. The second point is that we found that when the detection results are good enough, the prediction accuracy of Kalman filter is very high and can replace ReID for long-term association between objects. . In the experiment, it was also found that adding ReID did not improve the tracking results.

(2) If you need to introduce ReID features to calculate appearance similarity, you can refer to the tutorial on applying BYTE to joint-detection-and-embedding methods such as JDE and FairMOT in our open source code.

(3) The essential reason why ByteTrack can achieve high performance in MOT17 and 20 by only using motion models without using appearance similarity is that the motion pattern of the MOT data set is relatively single.

4. Specific code

UI interface design

history record

 

5. Complete code implementation + UI interface

Videos, notes and codes, and comments have all been uploaded to the network disk and placed on the homepage as a top article

According to the characteristics of target tracking, new functions have been added to the source code:

1: Save each camera to a different self-increasing folder, and specify to capture a picture every few seconds

2: Multiple cameras can be turned on, each camera independently tracks different video streams

3: Real-time tracking of target number statistics

4: Target retention function (such as pedestrians, animals, vehicles, drones, etc., as long as they have well-trained weights), and a warning will be issued after a few seconds of customized retention

5: Travel is restricted in the specified area, and the police will be called if it exceeds the specified area

Guess you like

Origin blog.csdn.net/m0_56175815/article/details/131743004