Multi-object tracking - [two stages] ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Paper link: ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Extraction code: tz60
Open source code: https://github.com/ifzhang/ByteTrack
MOT17 dataset link
Extraction code: qqzd

Article focus

  1. This article follows the Tracking-by-detection paradigm of multi-target tracking (MOT) , that is, to complete the target detection first, and then perform data association to generate trajectories according to the results of target detection to complete the task of multi-target tracking.
  2. [Motivation] This article focuses on the low detection score of occluded objects or objects due to motion blur (Motion Blur) during data association. Therefore, it is filtered out (usually the threshold of the target detection frame is 0.6, and the target detection frame below this threshold is regarded as a false detection of the background) as shown in the figure below, t 1 , t 2 , t 3 t_1, t_2, t_3t1,t2,t3There are false positives with a confidence level of 0.1 in the frame, i.e. the background. There is also one at t 1 , t 2 t_1, t_2t1,t2Frame confidence is high, but at t 3 t_3t3Frames are occluded with a confidence of 0.1 (marked additionally with a green box).
    insert image description here
  3. For the problems mentioned in [Motivation], the solution of this article is also the innovation point of this article, which is to propose a two- stage data association method, which regards the detection frame to be matched as a basic unit (just like a computer Byte in ) for trajectory matching.
    • The data association generation trajectory in the first stage: first generate a trajectory for the detection frame above the threshold (the detection frame confidence is 0.6);
    • The second stage of data association generation trajectory: Match the unmatched trajectory with the detection frame whose confidence is lower than the threshold (0.6) to generate the trajectory. After matching, if there is still an unmatched confidence lower than the threshold The detection frame of (0.6) is regarded as the background and deleted. Considering the long-term tracking situation when the target disappears and then returns to the field of view, the unmatched track is kept for a period of time (30 frames).

Tracking Framework Pseudocode

insert image description here

  1. Input to the algorithm BYTE : a video sequence VVV , an object detectorD et DetDe t , a thresholdτ \taut .

  2. Perform the following loop processing for each frame of the input video:

    • Apply D et DetDe t detects the target in the frame and gives the bounding box and confidence of the potential target to generate a detection target setD k D_kDk. According to whether the confidence of the potential target is greater than the threshold τ \tauτ will detect the target setD k D_kDkSubdivided into high confidence target set D high D_{high}DhighAnd low confidence target set D low D_{low}Dlow
    • Use the Kalman filter algorithm to analyze the trajectory set Γ \GammaEach trajectory in Γ predicts the new bounding box location of the trajectory object.
    • Perform the first stage of trajectory association. For high confidence target set D high D_{high}Dhighand the set of trajectories Γ \GammaΓ matches. Calculate targetD high D_{high}Dhighand the set of trajectories Γ \GammaThe similarity of IoU or Re-ID features between Γ , the Hungarian algorithm is used to complete the matching. For no matching detection frame and trajectory set, it is recorded asD remain D_{remain}DremainΓ remain \Gamma_{remain}Cremain
    • After the first stage of trajectory association ends, the high-confidence target bounding box that has not yet been matched is initialized as a new trajectory
    • Perform the second stage of trajectory association . For low confidence target set D low D_{low}DlowWith the set of trajectories that have not been matched Γ remain \Gamma_{remain}CremainMatch trajectory. The author found that at this stage, it is best to only use the IoU matching method, because the low-confidence target set is usually those that are occluded, so the feature of appearance, that is, the Re-ID feature, is not reliable.
    • At this point, the matching work has ended, and the unmatched low-confidence target detection frame is deleted, and it is regarded as the background and removed. And considering the long-term tracking (long-term), the target may reappear after disappearing, set the trajectory Γ re − remain \Gamma_{re-remain} that does not match the two timesCreremainGenerate a lost target set Γ lost \Gamma_{lost}Clost, and for the sake of calculation, Γ lost \Gamma_{lost}Clost30 frames are kept, after which they are discarded.
  3. The output of the algorithm BYTE : the trajectory set Γ \Gamma of the target in the videoΓ ,the detection box and its ID of the object included in each track in each frame.

experiment

The detector used in ByteTrack is YOLOX, and the backbone of YOLOX is YOLOX-X, using COCO-pretrained weights as initial weights.

MOT17

  • Training phase: The training set is MOT17, CrowdHuman, CityPerson, ETHZ.
  • Testing phase: Only IoU is used to generate a similarity matrix. Features of Re-ID are not used.

BDD100K

  • Training phase: The training set is the training set that comes with BDD100K, and there is no additional data.
  • Testing phase: The ResNet-50 ImageNet classification model in UniTrack was used to extract Re-ID features and calculate appearance similarity. Because this data set is a vehicle data set for autonomous driving, the appearance information of the vehicle itself is relatively small, and the appearance similarity is high, so the Re-ID feature is extracted.

Guess you like

Origin blog.csdn.net/qq_42312574/article/details/129005565