Article focus

Inspired by DERT , in target detection $O bj ec t$ $Q u ery$ migrated to multi-target tracking and constructed $T r a c k$ $Q u ery$ . Followingthe detection network inDERT $O bj ec t$ $Q u ery$ greatly enhances the feature extraction ability of the target.
Many current detection-based tracking methods first obtain detection results, track the detection results to extract appearance and motion, and then go to data association. This method is an end-to-end tracing model,
In order to ensure the effectiveness of timing modeling, MOTR proposes a tracklet-aware label assignment [ tracklet-aware label assignment ] training strategy + joint average loss [ collective average loss ] to enhance the timing modeling of the model.

将 $O bj ec t$ $Q u ery$ becomes $T r a c k$ $Q u ery$ The problem to be solved

Generally speaking, although target detection and target tracking are both in the CV field, since the fundamental tasks at the lower end are different, there must be problems in direct application, so careful design is required.

Use a $T r a c k$ $Q u ery$ tracks the same target. BecauseinDERT $O bj ec t$ $Q u ery$ is based on each frame recognition, each target and $O bj ec t$ $Q u ery$ does not have a corresponding relationship, as shown in Figure (a) below. However, multi-target tracking needs to generate tracking trajectories for each target in the sequence, which inevitably requires the consistency of target trajectories, and the problem of ID Switch cannot occur. This means that target detection + target track matching must use $T r a c k$ $Q u ery$ to achieve, this is the essence of end-to-end, cancel the post-processing. This paper introduces a tracklet-aware label assignment [tracklet-aware label assignment] training strategy, so that the bounding box with the same ID can be used to supervise the process of training detection + matching.
Handling of newly appearing and disappearing targets. Because a target suddenly disappears or appears suddenly in multi-target tracking, the fixed-length $T r a c k$ $Q u ery$ cannot meet actual needs. Therefore,this paper proposes two sets of variables— $T r a c k$ $Q u ery$ (variable length) and $De t ec t$ $Q u ery$ (fixed length) to handle the appearance and disappearance of the target. must be iteratively updated for each frame $T r a c k$ $Q u ery$ , change the disappearing target to its corresponding $T r a c k$ $Q u ery$ is deleted, and each frame uses $De t ec t$ $Q u ery$ detects how many targets there are in the frame, and new targets pass $De t ec t$ $Q u ery$ detects it and joins $T r a c k$ In the collection of $Q$ $u$ $ery .$ The specific process is shown in the figure below:

Overall network structure - timing fusion network

insert image description here
It can be seen that the structure analysis of the above figure is as follows:

Enc represents the feature extraction stage: Backbone network + Encoder of Deformable DERT ;
Dec represents the Decoder of Deformable DERT .
- In the first frame, since the tracking target has not yet appeared, the input is a fixed-length $q_d$ $q_{tr}$ for the empty set $q_{t r}$ , and the input of subsequent frames is $q_d$ $q_{tr}$ passed in the previous frame $q_{t r}$ 。
- The output is the intermediate state feature, which is used to generate the tracking prediction result and the input of the QIM .

QIM——Query Interactive Module

insert image description here
The role of this module is to handle things like the appearance and disappearance of objects. The scores in the figure represent the classification scores of Head's predicted tracked targets.

Input: The intermediate state characteristics of the Decoder output, as shown in the leftmost input in the above figure. The yellow part means $q_d$ , orange is $q_{tr}$ 。
The first step: Input it and the classification score of the head prediction tracking target into the two branches of processing (a) target appearance and (b) target disappearance. Here, two thresholds are set as filters to filter valid queries.
The second step: for (a) in the target occurrence branch, the detection target whose classification score is greater than the threshold is regarded as a new target.
Step 3: For (b) target disappearing branch , when new $T r a c k$ $Before Q u ery$ , you have to go through[Timing Enhanced Network]TAN, which is essentially a self-attention mechanism. $q_{tr}^i$ of this frame $q_{t r}^{i}$ , The intermediate state characteristics of the output of the first step (b) branch. This output is the tracking target for the next frame.
Output: The output of the second step and the third step are concatenated into the tracking target $q_{tr}^{i+1} of the next frame$ 。

training logic

Tracklet-Aware Label Assignment

【The purpose is for $T r a c k$ $Query models the one - to -$ one relationship between trajectories and targets. 】
TALAhas two strategies, corresponding to $De t ec t$ $Q u ery$ and $T r a c k$ $Q u ery$ 's training strategy

For $De t ec t$ $Q u ery$ : Follow the detection strategy in DERT to detect new targets that appear in each frame of the tracking sequence. The training strategy is for $De t ec t$ $Q u ery$ performs two-way matching with the GroundTruth of the newly added target.
For $T r a c k$ $Q u ery$ : This paper designs a training strategy with the same goal. of this frame $T r a c k$ $Q u ery$ of the previous frame $T r a c k$ $Q u ery$ + $De t ec t$ $Q u ery$ . For the first frame, $T r a c k$ $Q u ery$ is an empty set.

Collective Average Loss

【The purpose is for $T r a c k$ $Q u ery$ models the frames before and after the transfer of timing information. ]
The usual training strategy is to calculate the loss of the frame, such a strategy ignores the motion information about the target that exists in the sequence. Therefore, this paper designs a joint average loss to predict the loss with video clip as the basic unit. Joint average loss = sum of (tracking loss of a single frame + detection loss of a single frame) / number of frames.
insert image description here

With strong classmates.
insert image description here

Multi-target tracking - [Transformer] MOTR: End-to-End Multiple-Object Tracking with TRansformer

Table of contents

Article focus

将 $O bj ec t$ $Q u ery$ becomes $T r a c k$ $Q u ery$ The problem to be solved

Overall network structure - timing fusion network

QIM——Query Interactive Module

training logic

Tracklet-Aware Label Assignment

Collective Average Loss

Guess you like

Multi-target tracking - [Transformer] MOTR: End-to-End Multiple-Object Tracking with TRansformer

Table of contents

Article focus

将 O b j e c t Object Object Q u e r y Query Q u ery becomesT rack TrackTrack Q u e r y Query Q u ery The problem to be solved

Overall network structure - timing fusion network

QIM——Query Interactive Module

training logic

Tracklet-Aware Label Assignment

Collective Average Loss

Guess you like

将 $O bj ec t$ $Q u ery$ becomes $T r a c k$ $Q u ery$ The problem to be solved