Interpretation of the paper: High-quality object tracking

49c3e2816bfb2d23cc2d3ac879186f1c.jpeg

This paper introduces HQTrack, a new framework for high-quality video object tracking. HQTrack combines a video multi-object segmenter (VMOS) and a mask refiner (MR) to track objects specified in the initial frame of the video and refine the tracking results for higher accuracy. Although the training of VMOS on multiple video object segmentation (VOS) datasets limits its ability to adapt to complex scenes, the MR model helps to improve the accuracy of tracking results. HQTrack demonstrates its effectiveness by taking second place in the Visual Object Tracking and Segmentation (VOTS2023) challenge without using additional enhancements such as test-time data augmentation and model ensembles.

method

a24d588f59b72ae197f89e97223ead63.jpeg

Video Multi-Object Segmenter

VMOS, a key component of the HQTrack framework, is a variant of the DeAOT model, specifically designed to improve segmentation performance. Unlike the original DeAOT, which operates on visual and recognition features at 16× scale, VMOS employs a Gated Propagation Module (GPM), cascaded with 8× scale, and extends the propagation process to multiple scales. This approach helps preserve detailed object cues that might be lost at larger scales, thereby improving perception of tiny objects. VMOS uses upsampling and linear projection to only upscale the propagated features to a 4× scale, considering memory usage and model efficiency. Then, these multi-scale propagated features are fed into a simple Feature Pyramid Network (FPN) decoder for mask prediction. In addition, VMOS also integrates Intern-T of Internimage, which is a large-scale CNN-based model that uses deformable convolution to enhance object discrimination.

Mask refiner

The mask refiner in HQTrack uses the pre-trained HQ-SAM model, which is a variant of the Segment Anything Model. Due to its training on a high-quality dataset containing 1.1 billion masks, SAM has attracted attention in terms of image segmentation ability and zero-shot generalization. However, since SAM has problems in images containing complex objects, HQ-SAM was developed. This model improves on SAM by introducing additional parameters to the pre-trained model, providing higher quality masks.

In HQTrack, MR refines the prediction masks generated by VMOS, especially in complex scenarios where the results of VMOS may be of insufficient quality due to its training on restricted scale-off datasets. MR computes the outer bounding boxes of the predicted masks from VMOS, feeds these box cues along with the original image into the HQ-SAM model, and produces a refined mask.

The final output mask of HQTrack is selected from the results of VMOS and HQ-SAM. A refined mask is selected if the intersection-over-union (IoU) score between VMOS and HQ-SAM is above a certain threshold. This process encourages HQ-SAM to focus on refining the current object mask instead of re-predicting another target object, thus improving segmentation performance.

experiment

09bf246419a644b82c183214003060e1.png

Ablation study

  • When performing ablation studies on different tracking paradigms, it was found that joint tracking of all objects using a single tracker performed better than tracking separately (each object was tracked separately). The better performance of joint tracking may be due to the fact that the tracker understands the relationship between target objects, which improves the robustness to perturbations.

  •  Component studies for Video Multi-Object Segmentation (VMOS) show that replacing the original ResNet50 backbone with InternImage-T and adding a multi-scale propagation mechanism leads to significant performance improvements. The area under the curve (AUC) score increased from 0.611 to 0.650, confirming the effectiveness of these modifications.

  • The long-term memory interval parameters were re-evaluated considering long sequences in Visual Object Tracking and Segmentation (VOTS) videos. Research has found that a memory interval of 50 provides the best performance. 

  • HQTrack's mask refiner (MR) is also checked. It was found that directly refining all segmentation masks is not optimal. Although refining masks with SAM can significantly improve performance, it hurts the performance of lower quality masks. Therefore, a selection process is proposed: when the intersection-over-union (IoU) score between VMOS and SAM is higher than a threshold, a refined mask is selected as the final output.

Paper link: https://arxiv.org/abs/2307.13974v1
Code link: https://github.com/jiawen-zhu/HQTrack

·  END  ·

HAPPY LIFE

b6a4946cb0cba058a3c099a2bdb45312.png

This article is only for learning and communication, if there is any infringement, please contact the author to delete

Guess you like

Origin blog.csdn.net/weixin_38739735/article/details/132463718
Recommended