Multi-target tracking - [Transformer] MOTR: End-to-End Multiple-Object Tracking with TRansformer

Paper link: https://arxiv.org/abs/2105.03247

Article focus

  1. Inspired by DERT , the Object Object in target detectionObject Q u e r y Query Q u ery migrated to multi-target tracking and constructedT rack TrackTrack Q u e r y Query Q u ery . Followingthe detection network inDERTObject ObjectObject Q u e r y Query Q u ery greatly enhances the feature extraction ability of the target.
  2. Many current detection-based tracking methods first obtain detection results, track the detection results to extract appearance and motion, and then go to data association. This method is an end-to-end tracing model,
  3. In order to ensure the effectiveness of timing modeling, MOTR proposes a tracklet-aware label assignment [ tracklet-aware label assignment ] training strategy + joint average loss [ collective average loss ] to enhance the timing modeling of the model.

O b j e c t Object Object Q u e r y Query Q u ery becomesT rack TrackTrack Q u e r y Query Q u ery The problem to be solved

Generally speaking, although target detection and target tracking are both in the CV field, since the fundamental tasks at the lower end are different, there must be problems in direct application, so careful design is required.

  1. Use a T rack TrackTrack Q u e r y Query Q u ery tracks the same target. Becausethe Object ObjectinDERTObject Q u e r y Query Q u ery is based on each frame recognition, each target andObject ObjectObject Q u e r y Query Q u ery does not have a corresponding relationship, as shown in Figure (a) below. However, multi-target tracking needs to generate tracking trajectories for each target in the sequence, which inevitably requires the consistency of target trajectories, and the problem of ID Switch cannot occur. This means that target detection + target track matching must useT track TrackTrack Q u e r y Query Q u ery to achieve, this is the essence of end-to-end, cancel the post-processing. This paper introduces a tracklet-aware label assignment [tracklet-aware label assignment] training strategy, so that the bounding box with the same ID can be used to supervise the process of training detection + matching.
    insert image description here
  2. Handling of newly appearing and disappearing targets. Because a target suddenly disappears or appears suddenly in multi-target tracking, the fixed-length T track TrackTrack Q u e r y Query Q u ery cannot meet actual needs. Therefore,this paper proposes two sets of variables— Track TrackTrack Q u e r y Query Q u ery (variable length) andDetect DetectDetect Q u e r y Query Q u ery (fixed length) to handle the appearance and disappearance of the target. As shown in Figure (b) above, T Track Trackmust be iteratively updated for each frameTrack Q u e r y Query Q u ery , change the disappearing target to its correspondingT track TrackTrack Q u e r y Query Q u ery is deleted, and each frame usesDetect DetectDetect Q u e r y Query Q u ery detects how many targets there are in the frame, and new targets passDetect DetectDetect Q u e r y Query Q u ery detects it and joinsT rack TrackTrack Q u e r y Query In the collection of Q u ery . The specific process is shown in the figure below:
    insert image description here

Overall network structure - timing fusion network

insert image description here
It can be seen that the structure analysis of the above figure is as follows:

  1. Enc represents the feature extraction stage: Backbone network + Encoder of Deformable DERT ;
  2. Dec represents the Decoder of Deformable DERT .
    • In the first frame, since the tracking target has not yet appeared, the input is a fixed-length qd q_dqdand qtr q_{tr} for the empty setqtr, and the input of subsequent frames is qd q_dqdand the qtr q_{tr} passed in the previous frameqtr
    • The output is the intermediate state feature, which is used to generate the tracking prediction result and the input of the QIM .

QIM——Query Interactive Module

insert image description here
The role of this module is to handle things like the appearance and disappearance of objects. The scores in the figure represent the classification scores of Head's predicted tracked targets.

  • Input: The intermediate state characteristics of the Decoder output, as shown in the leftmost input in the above figure. The yellow part means qd q_dqd, orange is qtr q_{tr}qtr
  • The first step: Input it and the classification score of the head prediction tracking target into the two branches of processing (a) target appearance and (b) target disappearance. Here, two thresholds are set as filters to filter valid queries.
  • The second step: for (a) in the target occurrence branch, the detection target whose classification score is greater than the threshold is regarded as a new target.
  • Step 3: For (b) target disappearing branch , when new Track Track is screened outTrack Q u e r y Query Before Q u ery , you have to go through[Timing Enhanced Network]TAN, which is essentially a self-attention mechanism. The input is the target query qtri q_{tr}^iof this frameqtri, The intermediate state characteristics of the output of the first step (b) branch. This output is the tracking target for the next frame.
  • Output: The output of the second step and the third step are concatenated into the tracking target qtri + 1 q_{tr}^{i+1} of the next frameqtri+1

training logic

Tracklet-Aware Label Assignment

【The purpose is for T rack TrackTrack Q u e r y Query Query models the one - to - one relationship between trajectories and targets.
TALAhas two strategies, corresponding toDetect DetectDetect Q u e r y Query Q u ery andT rack TrackTrack Q u e r y Query Q u ery 's training strategy

  • For Detect DetectDetect Q u e r y Query Q u ery : Follow the detection strategy in DERT to detect new targets that appear in each frame of the tracking sequence. The training strategy is forDetect DetectDetect Q u e r y Query Q u ery performs two-way matching with the GroundTruth of the newly added target.
    insert image description here

  • For T rack TrackTrack Q u e r y Query Q u ery : This paper designs a training strategy with the same goal. T track Trackof this frameTrack Q u e r y Query Q u ery by T track Trackof the previous frameTrack Q u e r y Query Query+ D e t e c t Detect Detect Q u e r y Query Q u ery . For the first frame,T track TrackTrack Q u e r y Query Q u ery is an empty set.
    insert image description here

Collective Average Loss

【The purpose is for T rack TrackTrack Q u e r y Query Q u ery models the frames before and after the transfer of timing information. ]
The usual training strategy is to calculate the loss of the frame, such a strategy ignores the motion information about the target that exists in the sequence. Therefore, this paper designs a joint average loss to predict the loss with video clip as the basic unit. Joint average loss = sum of (tracking loss of a single frame + detection loss of a single frame) / number of frames.
insert image description here

With strong classmates.
insert image description here

Guess you like

Origin blog.csdn.net/qq_42312574/article/details/127625903