[Dharma Academy OpenVI] video target progressive Transformer tracker ProContEXT

Papers & Code

background introduction

The video object tracking (Video Object Tracking, VOT) task takes a video and the position information (rectangular box) of the target to be tracked in the first frame as input, and predicts the precise position of the tracked target in subsequent video frames. This task has no restrictions on the category of the tracked target, and the purpose is to track the target instance of interest. This algorithm is a very important research topic in both academia and industry, and is widely used in the fields of automatic driving, human-computer interaction, and video surveillance.

Due to the diversity of input videos, object tracking algorithms need to adapt to many challenges such as scale changes, shape changes, illumination changes, occlusions, etc. Especially when the appearance of the target to be tracked changes drastically and there is interference from similar objects around, the accuracy of the tracking algorithm often drops sharply, and even the tracking fails. As shown in Figure 1, for an input video, the object to be tracked (red dotted circle) will change drastically over time. Compared with the target appearance in the initial frame, the target appearance in the target frame will be different from that in the intermediate frame The target appearance of is more similar, so the target appearance shape of the intermediate frame is a very good temporal context information. In addition, the spatial context information around the target object in the tracking process is of great help to the algorithm to identify similar objects and interference backgrounds.

Figure 1 Context information plays an important role in the tracking process

method introduction

Recently, some video target tracking algorithms based on Transformer networks, such as OSTrack [ 1 ] , MixFormer [ 2 ] , STARK [ 3 ], etc., have shown high algorithm accuracy. Based on previous research work, this paper proposes ProContEXT ( Progressive Context ext Encoding Transformer Tracker ), which introduces time-domain context information and space-domain context information into the Transformer network .

The overall structure of ProContEXT is shown in Figure 2. This method has the following characteristics:

Figure 2 The overall structure of ProContEXT
  1. ProContEXT is a progressive context-aware Transformer tracker. In the Transformer tracker, dynamic temporal information and diverse spatial information are used for feature extraction, so that more Lupine tracking features can be obtained.
  2. By improving the ViT backbone network, ProContEXT adds multi-scale static templates and multi-scale dynamic templates to the input, and makes full use of the time domain of the target in the video tracking process through the context-aware self-attention mechanism module Context and airspace context information. Through a progressive template optimization and update mechanism, the tracker can quickly adapt to changes in the appearance of objects.
  3. ProContEXT achieves SOTA performance in multiple public datasets (TrackingNet and GOT-10k), and its operating efficiency fully meets real-time requirements, with a speed of 54.3FPS.

Experimental results

This paper conducts algorithm experiments based on the TrackingNet and GOT-10k datasets, and fully complies with the usage guidelines of each dataset.

SOTA vs.

First of all, the comparison with the current SOTA method is shown in the table below. ProContEXT exceeds the comparison algorithm in both the TrackingNet dataset and the GOT-10K dataset, reaching SOTA accuracy.

Table 1 Comparison with SOTA

Ablation experiment

In this paper, the ablation experiment was carried out on the number of static templates, and the results are shown in the table below. When using 2 static templates, the effect is the best. The experimental data in the table shows that when more static templates are used, redundant information will be introduced, resulting in a decrease in tracking effect.

Table 2 Static template number ablation experiments

In addition, an ablation experiment was also carried out on the number and scale of dynamic templates. The results are shown in the table below. When adding dynamic templates, the accuracy of the tracking algorithm is improved, and the accuracy of using two-scale dynamic templates is higher than that of using only a single scale algorithm. further improvement.

Table 3 Dynamic template ablation experiments

Finally, the hyperparameters in the token pruning module used in the algorithm are also explored. The experimental results are shown in the table below. When the parameter is 0.7, the most balanced algorithm accuracy and efficiency are achieved.

Table 4. Token pruning module ablation experiments

Model Portal

Video tracking model:

  • Video Single Object Tracking ProContEXT: https://modelscope.cn/models/damo/cv_vitb_video-single-object-tracking_procontext/summary
  • Video single object tracking OSTrack: https://modelscope.cn/models/damo/cv_vitb_video-single-object-tracking_ostrack/summary
  • Video multi-object tracking FairMOT: https://modelscope.cn/models/damo/cv_yolov5_video-multi-object-tracking_fairmot/summary

Detect related models:

  • Real-time target detection model YOLOX: https://modelscope.cn/models/damo/cv_cspnet_image-object-detection_yolox/summary
  • High-precision object detection model DINO: https://modelscope.cn/models/damo/cv_swinl_image-object-detection_dino/summary
  • Real-time target detection model DAMO-YOLO: https://modelscope.cn/models/damo/cv_tinynas_object-detection_damoyolo/summary
  • Vertical industry target detection model: https://modelscope.cn/models?page=1&tasks=vision-detection-tracking%3Adomain-specific-object-detection&type=cv

Key point related models:

  • 2D Human Keypoint Detection Model-HRNet: https://modelscope.cn/models/damo/cv_hrnetv2w32_body-2d-keypoints_image/summary
  • 2D face keypoint detection model-MobileNet: https://modelscope.cn/models/damo/cv_mobilenet_face-2d-keypoints_alignment/summary
  • 2D hand key point detection model-HRNet: https://modelscope.cn/models/damo/cv_hrnetw18_hand-pose-keypoints_coco-wholebody/summary
  • 3D Human Keypoint Detection Model-HDFormer: https://modelscope.cn/models/damo/cv_hdformer_body-3d-keypoints_video/summary
  • 3D Human Keypoint Detection Model-TPNet: https://modelscope.cn/models/damo/cv_canonical_body-3d-keypoints_video/summary

Intelligent traffic model:

  • https://modelscope.cn/models/damo/cv_ddsar_face-detection_iclr23-damofd/summary
  • https://modelscope.cn/models/damo/cv_resnet50_face-detection_retinaface/summary
  • https://modelscope.cn/models/damo/cv_resnet101_face-detection_cvpr22papermogface/summary
  • https://modelscope.cn/models/damo/cv_manual_face-detection_tinymog/summary
  • https://modelscope.cn/models/damo/cv_manual_face-detection_ulfd/summary
  • https://modelscope.cn/models/damo/cv_manual_face-detection_mtcnn/summary
  • https://modelscope.cn/models/damo/cv_resnet_face-recognition_facemask/summary
  • https://modelscope.cn/models/damo/cv_ir50_face-recognition_arcface/summary
  • https://modelscope.cn/models/damo/cv_manual_face-liveness_flir/summary
  • https://modelscope.cn/models/damo/cv_manual_face-liveness_flrgb/summary
  • https://modelscope.cn/models/damo/cv_manual_facial-landmark-confidence_flcm/summary
  • https://modelscope.cn/models/damo/cv_vgg19_facial-expression-recognition_fer/summary
  • https://modelscope.cn/models/damo/cv_resnet34_face-attribute-recognition_fairface/summary

For more models, see the ModelScope home page.

Assay Kit Development Tools

ModelScope Community Visual Inspection Development Kit AdaDet has been released.

references

  • [1] Ye B, Chang H, Ma B, et al., “Joint feature learning and relation modeling for tracking: A one-stream framework”, in ECCV 2022, pp. 341-357.
  • [2] Cui Y, Jiang C, Wang L, et al., “Mixformer: End-to-end tracking with iterative mixed attention”, in CVPR 2022, pp. 13608-13618.
  • [3] Yan B, Peng H, Fu J, et al., “Learning spatio-temporal transformer for visual tracking”, in ICCV 2021, pp. 10448-10457.

Guess you like

Origin blog.csdn.net/sunbaigui/article/details/130247750