Paper link: https://arxiv.org/pdf/2203.13250.pdf

Tracking-by-detection paradigm

Most of the current multi-target tracking follows the Tracking-by-detection paradigm to complete the tracking task. The Tracking-by-detection paradigm divides the tracking task into two steps: target detection and data association. This solution also makes many current Tracking-by-detection trackers focus on how to carry out effective data association .

The association idea for Local Tracker is mainly to perform frame-by-frame matching association, also known as pairwise association , which is a greedy local optimal method. When the target is always in the tracking sequence, this method has good performance, but when long-term occlusion is found Or if the appearance changes drastically, this method will fail. The paper I wrote about * MOTR * is such a locally associated Tracker.
For Global Tracker , global optimization is based on graph theory, etc., which allows it to match discontinuous targets together, which is more robust, but the speed is slow, because it is necessary to obtain many frames of tracking targets in advance, which also makes Tracking is direct and detection is separated.

Article focus

The model in this paper is divided into two parts: object detector and object matcher (GTR) . The input of the model is a sequence of pictures (32 frames in the text), in which the target detector is responsible for outputting the bounding boxes of all targets in a picture frame, and GTR is responsible for matching these targets and outputting the detected bounding boxes to complete the tracking task. That is, the output is a sequence of matched trajectories . This makes it a Global Tracker . Because it is not a frame-by-frame correlation, but a global correlation of all detected targets in a sequence (time window).
Such a model leads to a problem. The quality of the target detector is critical to the performance of the entire model . Because the target features used in the target matching in the later stage depend on the output of the target detector.
In fact, in the implementation, what GTR does is to match all the target features contained in the bounding box output by the target detection stage , and output an associated score vector corresponding to each target feature [in the way of calculating similarity], according to the score vector. These target feature matches are associated.

network structure

framework
The above picture shows the overall network structure of Tracker, including target detection and tracking.

Input: A sequence of images with a length of 32 frames.
The first step: target detection. First, the target detector detects all the targets in the input picture frame. The slice of each target shown in the figure All-frame detections is actually the feature corresponding to the target bounding box output by the target detector. (If the output bounding box is offset from the actual target at this time, it will affect the subsequent target matching)
The second step: the target tracking module - Global Tracking Transformer. The input of this module is all the targets detected in the previous step + Trajectory Queries, and the output is the classification result of the trajectory on the target.
- In previous work, the Query part is often regarded as a learnable part, its parameters are trained during training, parameters are fixed during inference, and the similarity is calculated for the input vector weighting. But Trajectory Queries here does not work like this.
- Because the target detector will output a lot of target features in a sequence of pictures (32 frames in the text). The Trajectory Queries in GRT are initialized to the vectors related to the number of target categories (the number of trajectories of different targets contained in this picture), Trajectory Queries and target features are used for Cross-Attention, and the similarity scores between target features (similarity The highest target feature belongs to a trajectory) output as a trajectory.
- Therefore, Trajectory Queries is just a vector that stores the similarity between target features and does not require much training. The focus of the training is how to enhance the characteristics of the target such as Cross-Attention and linear mapping so that similar targets have a higher similarity.

Target Association Module - GTR

insert image description here

enter:
- $F$ is the image feature corresponding to the output bounding box of the target detector, N represents the number of targets, and D is the vector dimension after the target feature flatten.
- $Q$ represents the similarity vector, M is the number of target categories or the number of trajectories, and D represents the vector dimension after the target feature is flattened.
Output: $G$ represents the target association classification score vector,that is, each column vector $G_i(:,i)$ means that the $The trajectory similarity vector corresponding to the i$ target feature, we will assign the highest similarity as the $trajectories of i$ target features.

Training strategy and inference logic

There are many formulas in this part, see notes for details .

The weather in Xi'an is very good today, Maple Leaf Valley!

Multi-Target Tracking——【Transformer】Global Transformer Tracking

Table of contents

Tracking-by-detection paradigm

Article focus

network structure

Target Association Module - GTR

Training strategy and inference logic

Guess you like