Yolo target tracking: Kalman filter + Hungarian matching algorithm + deepsort algorithm

1. Project ideas

Target tracking is based on the target detection algorithm, and the target detected in each frame of the image is tracked. Conversely, only when the target detection algorithm detects the current target can the target be tracked, so the tracking effect depends on the target detection effect.

  • (1) Status update by Kalman filter . Each tracker track needs to predict the state at the next moment based on the current state, and then correct the estimated value based on the results obtained by the target detection. Among them, the target information detected in the first frame is used to initialize the state variables (trackers) of the Kalman filter.
  • (2) Carry out target matching through the Hungarian matching algorithm . There are three matching methods: one is based on 8 state vector matching obtained by Kalman filtering; the other is based on the deep learning pedestrian re-identification network model for appearance feature matching; the third is based on IOU overlapping area matching. After screening through the three methods, the target with the smallest substitution price matrix (eg: Tian Ji horse racing) is defined as the target currently being tracked.

insert image description here

2. Detailed Algorithm

2.1, Kalman filter algorithm

The ultimate goal: to train the weight item (Kalman gain K), so that the final K value can minimize the uncertainty of the optimal value. Among them, the target information detected in the first frame is used to initialize the state variables (trackers) of the Kalman filter.
This article is explained in three parts: (1) Car position estimation (introductory); (2) The detailed process of theoretical derivation (proficient); (3) Demonstration with examples (actual combat).

2.1.1. Car Position Estimation (Getting Started)

The realization process of the optimal value of the current state : use the optimal value of the last state to predict the estimated value of the current state, and use the currently obtained measured value to correct the estimated value to obtain the optimal value of the current state.
Influencing factors for obtaining the optimal value : (1) The measured value of the sensor itself is inaccurate; (2) There is a certain range of error between the estimated value obtained by theoretical derivation and the measured value obtained by the real situation;
insert image description here

2.1.2. Detailed process of theoretical derivation (proficient)

bilibili video explanation: from giving up to mastering! Kalman filter from theory to practice ~ insert image description here
a text graphic Kalman filter (Kalman Filter)
Kalman filter (Kalman Filtering) detailed explanation: detailed derivation process
Kalman filter, the most understandable explanation. Search the Internet to understand this article. : Examples run through the understanding

2.1.3. Examples

insert image description here

2.1.4. Application: Tracking 8 states that need to be considered

insert image description here

2.2. Hungarian matching algorithm

The ultimate goal: to minimize the cost matrix after matching (eg: Tian Ji horse racing).Disadvantages: This method is not optimal matching, but makes every target match as much as possible. There are API interfaces for the Hungarian matching algorithm in sklearn and scipy: (1) 【sklearn】linear_assignment()(2) 【scipy】linear_sum_assignment(). Therefore, you only need to input the prepared code matrix when using it.

(Remarks: Although there is already an API interface, in order to better understand the principle of the algorithm, detailed analysis and examples are carried out. Please select the content of this part according to your needs.)

2.2.1. Background import

In the same frame of image, if there are N targets to be detected (for example: there are 30 people in the image), the target detection algorithm will detect N targets, and the Kalman filter will respectively predict their corresponding targets in the next frame of image Target (8 status messages). So how do multiple targets in the current frame image match the corresponding tracker tracks in the next frame image, and minimize the cost after matching?

  • Situation 1: Two people are very close to each other or superimposed on each other, making it impossible to distinguish.
  • Case 2: The current frame image has 8 trackers, but the next frame image has 10 targets.

2.2.2. Algorithm matching principle and detailed calculation steps

insert image description here

2.2.3. Examples

insert image description here

2.2.4. Three forms of cost matrix (motion + appearance + IOU)

There are three matching methods, in order.

  • The first step: sports information matching . passKalman filterThe 8 estimated state quantities of the next frame are obtained, and the estimated results are compared with the actual detection results of the next frame to obtain the cost matrix of motion information.
  • Step Two: Appearance Matching .The core of deepsort.

The pedestrian re-identification network model (ReID) is used for ReID feature extraction, which has the best extraction effect on pedestrian features. Note: It is only suitable for human body tracking, if it is other targets, a custom + self-training model is required. The main process: input the target box bbox into the pedestrian re-identification model for feature extraction, and obtain a 128-dimensional feature vector.

(1) Convolution or feature extraction (pedestrian re-identification model) is performed on the target frame bbox1 obtained by the current frame tracker to obtain a 128-dimensional vector. This vector expresses the appearance information characteristics of target 1. Key knowledge: currently each track has a feature sequence: [128-T1,128-T2,...,128-TM]. Among them, T represents the frame time, the maximum frame time M=100 (specified in the paper), when it exceeds the range, the previous frame is lost and the next frame is guaranteed. Each frame time will get a 128-dimensional vector, and the feature sequence represents that each tracker track will save the M target information it tracks in M ​​frame time, and the targets may be the same or different.
(2) Similarly, the N target frames obtained by target detection in the next frame are obtained through the same operation to obtain N 128-dimensional vectors.
(3) Calculate the cosine similarity between the 128-dimensional vector of any target detected in the next frame and the 128-dimensional vector of all frame times in the feature sequence of the target box bbox1 of the current frame, not just the previous frame time, and construct The cost matrix, and finally the target frame with the greatest similarity is identified as the same target as the target frame of the current frame.
For example, if the target disappears in the current frame and cannot be detected, but continues to appear after a few frames, since the target information is available in the previous M frames, the target can be detected and tracked at the same time.

  • The third step: IOU matching (bounding box, bbox) . Indicates the IOU overlap area between the target box of the next frame and the target box of the current frame. 1-IOU means the distance of IOU.

2.2.5. Pedestrian re-identification network model (ReID)

insert image description here

2.3. Tracking Algorithm

The deepsort algorithm adds deep learning model extraction features (cascade matching, status confirmation) based on the sort algorithm. The core of the sort algorithm is the Kalman filter algorithm and the Hungarian algorithm.

2.3.1, sort algorithm

Detailed algorithm flow:

  • (1) The target information detected in the first frame is used toInitialize the state variables of the Kalman filter (tracker Tracks). For example: ten people have ten trackers.
  • (2) passKalman FilterPredict the next frame to get the state estimate, and use yolo target detection (Detections) to get the actual detection value, and then proceedIOU Match
    • Matching result 1: Unmatched Tracks (Unmatched Tracks). For example: 8 tracks were initialized in the previous frame, but some targets in the next frame were blocked or disappeared, thendeleteThis track.
    • Matching result 2: Unmatched Detections. For example: 8 tracks were initialized in the previous frame, but there are 10 detection frames in the next frame, thenAdd2 initialization tracks.
    • Matching result 3: Matched detection boxes and trackers (Matched Tracks). Update the state vector of the current target.
  • (3) Repeat step (2) until the end of the video frame.

insert image description here

2.3.2, deepsort algorithm

Detailed algorithm flow: target tracking - explanation of deepsort principle

  • (1) The target information detected in the first frame is used toInitialize the state variable of the Kalman filter (tracker Tracks), and the target frame must be matched to the target for 3 consecutive frames before the target frame will be displayed in the video. Before that, cascade matching will not be performed, and IOU matching will not be performed.. For example: ten people have ten trackers.
  • (2) passKalman FilterPredict the next frame to get the state estimate, and use yolo target detection (Detections) to get the actual detection value.

Case 1: If the crack matches the target for 3 consecutive frames , it is defined as Confirmed and enters Matching Cascade.
Case 2: Otherwise, it is defined as Unconfirmed and proceed directlyIOU Match

Matching Cascade cascade matching, also known as priority matching.
11. Calculate the distance between the motion features predicted by the Kalman filter and the ReID appearance features extracted by the deep learning model to form a new cost matrix.
22. Assume that in the first 100 frames of images, there are 4 tracks that match 100, 100, 90, and 60 times respectively. Then the 101st frame image will first perform IOU matching on track1 and track2 with the most matching times, and the remaining track3/4 will perform IOU matching in the remaining target frame of track1/2 matching.

Remarks: Since the feature of the Hungarian matching algorithm is to match all targets as successfully as possible, a gate unit (gate unit) will be set for cascade matching and IOU matching. If the matching result is greater than the threshold, it means that the difference between the two is too large to meet the requirements, so they are not allowed to be matched together.

  • Matching result 1: Unmatched Tracks (Unmatched Tracks). There are two situations: one is the non-confirmed state Unconfirmed, if no target is matched, the secondary track will be deleted directly; the other is the confirmed state Confirmed, if the target is not matched for 70 consecutive frames, thendeleteThis track, otherwise return to the next frame and perform cascade matching of the confirmation status. For example: 8 tracks were initialized in the previous frame. But some targets in the next frame are blocked or disappear.
  • Matching result 2: Unmatched Detections. For example: 8 tracks were initialized in the previous frame, but there are 10 detection frames in the next frame. When certain conditions are met, thenAdd2 initialization tracks.
  • Matching result 3: Matched detection boxes and trackers (Matched Tracks). renewThe state vector of the current target.

insert image description here

Guess you like

Origin blog.csdn.net/shinuone/article/details/129639864