DEFT joint detection and tracking 3DMOT

Jointly learning the detection and tracking tasks in one network leads to a performance boost. A number of recent 2D multiple object tracking (Multiple Object Tracking, MOT) results have shown that using SOTA detectors plus simple inter-frame correlation based on spatial motion can achieve quite good tracking performance, and their effects are better than some using appearance. A method for re-identifying lost trajectories using features. The DEFT (Detection Embeddings for Tracking) recently proposed by Uber and others is a joint detection and tracking model. It is a framework design that uses the detector as the bottom layer to build an appearance-based matching network on the upper layer. In 2D tracking, it can achieve the SOTA effect and has stronger robustness, and in 3D tracking it achieves twice the performance of the current SOTA method.

demo:

3D

KITT

paper:https://arxiv.org/abs/2102.02267

code:https://github.com/MedChaabane/DEFT

It can be assumed that a learnable object matching module can be added to mainstream CNN detectors, resulting in a high-performance multi-object tracker, and then by detection and tracking (association) modules, the two modules adapt to each other to achieve better performance. good performance. Compared with the method of inputting detection as a black-box model into the association module, this idea of ​​sharing the backbone between target detection and inter-frame association will have better speed and accuracy.

In DEFT, association and detection are learned in a unified network, so the delay between object association and detection is very short.

Let's take a look at the entire network design of DEFT. Generally speaking, it is very similar to the work of JDE and FairMOT. Based on the TBD paradigm, DEFT proposes that the intermediate feature map of the target detector (the target detector is used as the backbone in this paper) should be used to extract the embedding of the target for the target matching subnetwork. As shown in the figure below, just looking at the upper part is actually the structure of the entire network. The image enters the Detector branch above, and the feature maps of different stages of the detector are used for the appearance feature learning of the Embedding Extractor module below, and the detection between different frames The appearance features of the target are sent to the Matching Head to obtain the associated similarity score matrix.

In DEFT, the detector and the target matching network are jointly trained. During training, the loss of the target matching network will be passed back to the detection backbone to optimize the extraction of appearance features and the performance of detection tasks. In addition, DEFT also uses a low-dimensional LSTM module to provide geometric constraints for the target matching network to avoid appearance-based but spatially varying inter-frame correlations. Although it can be based on multiple detectors, DEFT has done the main work on CenterNet, obtained SOTA and is faster than other similar methods. This speed comes from the fact that in DEFT, the target association is just an additional small module , compared with the entire detection task, there will only be a small delay.

 If only the learned appearance embedding is used for inter-frame matching, it is likely that two objects are actually very similar in appearance space, causing matching problems. A common method is to add a geometric or time constraint to limit the matching, commonly used is Kalman filter or LSTM. The paper uses LSTM to design the motion prediction module, which will predict the position of the future frame trajectory based on the past frame. This motion prediction module is used to constrain associations that are physically impossible to exist by setting the similarity distance of detection boxes that are too far away from the trajectory prediction position to −∞.
 

The data set uses MOT17 and KITTI as 2D evaluation, and nuScenes as 3D evaluation standard data set.

The first is to compare multiple backbone detectors. The effect is shown in the figure below. It is found that CenterNet works best, so CenterNet is used as the backbone later.

This paper proposes a new MOT method for joint detection and tracking, which can achieve SOTA performance on multiple benchmarks. It can be built on the basis of mainstream target detectors. It is very flexible and efficient. It is a new MOT method worthy of attention. 

Guess you like

Origin blog.csdn.net/weixin_64043217/article/details/129315066
Recommended