Dalian University of Technology and Alibaba Dharma Institute released HQTrack | High-precision video multi-target tracking model

Title: Tracking Anything in High Quality

PDF: https://arxiv.org/pdf/2307.13974v1.pdf

Code: https://github.com/jiawen-zhu/hqtrack

guide

This paper introduces a high-quality video object tracking framework called HQTrack. Video object tracking is a fundamental video task in computer vision. Significant enhancements in perception algorithms have unified single/multiple object and box/mask based tracking in recent years. Among them, Segment Anything Model (SAM) has attracted widespread attention.

HQTrack is mainly composed of two components: video multi-object segmentation (VMOS) and mask optimizer (Mask Refine). Given an object to track in the initial frame of the video, VMOS propagates the object mask to the current frame. However, since VMOS is trained on several similar video object segmentation (VOS) datasets, its generalization ability in complex and corner scenes is limited, so the mask (Mask) results obtained at this stage may not be sufficient accurate.

To further improve the quality of the tracking mask, the authors employ a pre-trained MR model to optimize the tracking results. Notably, HQTrack ranks second in the Visual Object Tracking and Segmentation (VOTS2023) challenge without using any test-time data augmentation or model ensemble tricks, fully demonstrating the effectiveness of the method.

introduction

As one of the most influential challenges in the tracking field, visual object tracking (VOT) faces many challenges, such as the understanding of inter-object relationships, multi-object trajectory tracking, accurate mask estimation, etc.

Visual target tracking has made great progress with the help of deep learning technology. The current mainstream tracking methods are based on the Transformer framework. TransT proposes Transformer-based ECA and CFA modules to replace long-term related calculations. Recently, some trackers have introduced a pure Transformer architecture, feature extraction and template search region interaction are done in a single backbone network, and the tracking performance is pushed to new heights. These trackers mainly focus on single object tracking and output bounding boxes for performance evaluation. Therefore, using only SOT trackers is not suitable for the VOTS2023 challenge.

The goal of video object segmentation is to segment a specific object of interest in a video sequence. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask optimizer (MR). Small objects in complex scenes are perceived by cascading a 1/8 scale Gated Propagation Module (GPM). In addition, the author uses Intern-T as a feature extractor to enhance the ability to distinguish objects. In order to save memory usage, a fixed-length long-term memory is used in VMOS, and the memory of earlier frames after the initial frame will be discarded. On the other hand, in order to further improve the quality of the tracking mask, the authors adopt a pre-trained HQ-SAM model to optimize the tracking mask. The author calculates the outer bounding boxes of the predicted masks from VMOS as hint boxes, and inputs them into HQ-SAM together with the original image to get the optimized mask. The final tracking result is selected from VMOS and MR.

In the VOTS2023 challenge, its test videos contain a large number of long-term sequences, the longest of which exceeds 10,000 frames, which requires the tracker to be able to distinguish drastic changes in the target's appearance and adapt to changes in the environment. Long-term video sequences also make some memory-based methods face memory space challenges. In addition, objects in VOTS videos will leave the field of view and then reappear, so additional design is required for the tracker to accommodate objects disappearing and appearing. In addition, a series of challenges such as fast motion, frequent occlusions, distractors and small objects also make this task more difficult. HQTrack won the runner-up in the VOTS2023 challenge by using a combination of VMOS and MR, and adopting HQ-SAM to optimize the mask.

method

As shown in Figure 1 above, given a video and the reference mask of the first frame (with labels), HQTrack first performs target object segmentation for each frame through VMOS. The segmentation result for the current frame is derived from the propagation of the first frame along the temporal dimension, using appearance/identification information and modeling of long-term/short-term memory. VMOS is a variant of DeAOT, so the modeling of multiple objects of interest can be done in a single pass. In addition, the authors use HQ-SAM as MR to optimize the segmentation mask of VMOS. First, we perform bounding box extraction on the object masks predicted by VMOS, and then feed them into the HQ-SAM model as hint boxes. Finally, a mask selector is designed to select the final result from VMOS and MR.

Video Multi-Object Segmenter (VMOS)

VMOS is a variant of DeAOT. To improve segmentation performance, especially for the perception of small objects, the authors cascade an 8x scale GPM in VMOS and extend the propagation process to multiple scales. The original DeAOT only performs propagation operations on 16x scale visual and recognition features, at this scale, many detailed target cues will be lost, especially for small objects, 16x scale features are not enough for accurate video target segmentation. In VMOS, in order to consider the memory usage and model efficiency, the authors only use upsampling and linear projection to upscale the propagation features to 4 times scale. The multi-scale propagated features will be input into the decoder together with the multi-scale encoder features for mask prediction. The decoder uses a simple FPN structure. Furthermore, as a new large-scale CNN-based base model, Internimage adopts deformable convolution as the core operator, which performs well on various typical tasks, such as object detection and segmentation.

Mask Optimizer (MR)

MR is a pre-trained HQ-SAM model (it can provide more refined segmentation results than SAM). In addition to the strong zero-shot capability brought by large-scale training, SAM also involves a flexible human-computer interaction mechanism enabled by different hint formats. However, when dealing with images containing complex structural objects, SAM's predicted masks are often not accurate enough. In order to solve this problem while maintaining the original hint design, efficiency and zero-sample generalization performance of SAM, some researchers have proposed HQ-SAM, which only introduces some additional parameters in the pre-trained SAM model, which can achieve more accurate segmentation results. In this paper, with the help of HQ-SAM, high-quality masks can be obtained by injecting the learning output token into the mask decoder of SAM.

As shown on the right side of Figure 1 above, the author uses the mask predicted by VMOS as the input of MR. Since the VMOS model is trained on a closed dataset with limited scale, the first-stage prediction mask from VMOS may not be accurate, especially when dealing with some complex situations. Therefore, using a large-scale trained segmentation algorithm to optimize the preliminary segmentation results will lead to significant performance improvements.

Specifically, the authors compute the outer bounding boxes of the VMOS predicted masks as hint boxes, and input them into HQ-SAM together with the original image to obtain the optimized mask. Finally, the output mask of HQTrack is selected from the masking results of VMOS and HQ-SAM. The authors found that for the same target object, the mask optimized by HQ-SAM is sometimes completely different from the mask predicted by VMOS (with a low IoU score), which in turn hurts the segmentation performance. This may be caused by the different understanding and definition of the target between HQ-SAM and reference annotation. Therefore, the authors set an IoU threshold τ (IoU between masks from VMOS and HQ-SAM) to determine which mask will be used as the final output. In this case, we select the optimized mask only if the IoU score is higher than τ. This process enables HQ-SAM to focus on optimizing the current object mask instead of re-predicting another object.

implementation details

In HQTrack's VMOS, InternImage-T is used as the backbone network of the image encoder to make a trade-off between accuracy and efficiency. The number of GMP layers was set to 3 and 1 for the 16× and 8× scales, respectively. The propagated features at the 4× scale were upsampled and processed with the projected features from the 8× scale. HQTrack's segmenter uses long-term and short-term memory to handle object appearance changes in long-term video sequences. In order to save memory usage, the author uses a fixed long-term memory of length 8, excluding the initial frame, and the previous memory will be discarded.

model training

The training process consists of two stages. In the first stage, VMOS is pre-trained on synthetic video sequences generated from still image datasets; in the second stage, VMOS is trained using a multi-object segmentation dataset to better understand Relationships between multiple targets. And the training data sets of DAVIS, YoutubeVOS, VIPSeg, BURST, MOTS and OVIS are selected to train VMOS, and OVIS is used to improve the robustness of the tracker to deal with occluded targets.

reasoning

The reasoning process is shown in Figure 1 above. This paper does not use any test-time data augmentation (TTA), such as flipping, multi-scale testing and model ensemble.

Experimental results

in conclusion

In this paper, we propose a high-quality video multi-object tracking (HQTrack) method. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask optimizer (MR). VMOS is responsible for propagating multiple objects in video frames, while MR is a large-scale pre-trained segmentation model responsible for optimizing segmentation masks. HQTrack demonstrated strong target tracking and segmentation capabilities, and won the runner-up in the visual target tracking and segmentation (VOTS2023) challenge.

Guess you like

Origin blog.csdn.net/CVHub/article/details/132255594