CVPR 2023 | Dalian University of Technology and Microsoft proposed SeqTrack: a new framework for target tracking

If the model knows where the target is, then we only need to teach the model to read the position of the target without explicitly classifying or regressing. For this work, the researchers hope to inspire the exploration of autoregressive sequence generation modeling for video tasks such as object tracking.

The autoregressive sequence generation model has always occupied an important position in many natural language processing tasks, especially the recent emergence of ChatGPT, which makes people even more amazed at the powerful generation ability and potential of this model.

Recently, researchers from Microsoft Asia Research Institute and Dalian University of Technology proposed a new framework SeqTrack that uses a sequence generation model to complete visual target tracking tasks to model tracking as a generation task of target coordinate sequences.

The current target tracking framework generally divides target tracking into multiple subtasks such as classification, regression, and corner prediction, and then uses multiple customized prediction heads and loss functions to complete these tasks. By modeling tracking as a simple sequence generation task, SeqTrack not only gets rid of redundant prediction heads and loss functions, but also achieves excellent performance on multiple data sets.

6bd52fbccca98d6353636a27353a8250.png

Paper link:

http://arxiv.org/abs/2304.14394

Github:

https://github.com/microsoft/VideoX

5ddcf38948fd83592d6502c4755c8d6c.gif

Method Highlights

1. A new object tracking framework that models tracking as a sequence generation task, a concise and effective new baseline;

2. Abandon the redundant prediction head and loss function, and only use the simple Transformer and cross-entropy loss, which has high scalability.

addd96ba59baa8241d4c54fa27ca0168.gif

5e9385c7d90a055bf1b2b3154574173e.gif

ee25ca52ce7de2fc42bd6b8571cfd52a.gif

3d219837b130c98c94ce0d430bf6e9d7.gif

1. Research motivation

The more advanced target tracking methods now adopt a "divide and conquer" strategy, which decouples the tracking problem into multiple subtasks, such as center point prediction, foreground/background binary classification, bounding box regression, corner point prediction, etc. Although excellent performance has been achieved on each tracking data machine, this "divide and conquer" strategy has the following two disadvantages:

1. The model is complex: each subtask requires a customized prediction head, which makes the framework complex and not conducive to expansion

2. Loss function redundancy: each prediction head requires one or more loss functions, introducing additional hyperparameters, making training difficult

3b98e4c034e20729b09b3bbdda7ba68f.png

Figure 1 Currently common tracking frameworks

The researchers believe that if the model knows the position of the target in the image, it only needs to simply teach the model to read the target bounding box, and there is no need to use the "divide and conquer" strategy to explicitly perform classification and regression. To this end, the author uses autoregressive sequence generation modeling to solve the target tracking task, and teaches the model to "read" the position of the target as a sentence.

4fec5a71348992204d3208475f5b47de.gif

Figure 2 Sequence Generation Modeling for Tracking

2. Method overview

The researchers converted the four coordinates of the target bounding box into a sequence of discrete-valued tokens, and then trained the SeqTrack model to predict this sequence token by token. In terms of model structure, SeqTrack uses the original encoder-decoder form of transformer . The overall framework of the method is shown in Figure 3 below:

1db30d0e993a602dc33b499de26df6ee.png

Figure 3 SeqTrack structure diagram

The Encoder extracts the visual features of the template and the image in the search area, and the decoder refers to these visual features to complete the generation of the sequence. The sequence contains the x, y, w, h tokens that make up the bounding box, and two special start and end tokens, which represent the start and end of the generation, respectively.

During inference, the start token tells the model to start generating sequences, and then the model generates x, y, w, h in sequence. The generation of each token will refer to the tokens that have been generated in the previous sequence. For example, when generating w, the model will use [start , x, y] as input. When [x,y,w,h] is generated, the model will output an end token to inform the user that the prediction is complete.

In order to ensure efficient training, tokens are generated in parallel during training, that is, [start, x, y, w, h] are input to the model at the same time, and the model predicts [x, y, w, h, end] at the same time. In order to ensure the autoregressive nature of reasoning, a causal attention mask is added to the self-attention layer in the decoder during training to ensure that the prediction of each token only depends on its previous token. The attention mask is shown in Figure 4 shown.

fc41776a4781416330f449f3a409f19b.png

Figure 3 Attention mask, the orange grid in the i-th row and j-th column means that when the i-th output token is generated, the j-th input token is allowed to be observed, while the white grid means it cannot be observed.

The continuous coordinate values ​​on the image are uniformly discretized into integers in [1, 4000]. Each integer can be regarded as a word, which constitutes the word table V, and the four coordinates x, y, w, h take values ​​from the word table V.

Similar to common sequence models, during training, SeqTrack uses cross-entropy loss to maximize the conditional probability of the target value based on the predicted value of the pre-order token, the search area, and the template:

c8c5780170056a9f5d7e80444221abf5.png

At inference time, use maximum likelihood to get a value for each token from the word list V:

a1b07998b5c85395f6cca7b73332f314.png

In this way, only the cross-entropy loss is needed to complete the training of the model, which greatly simplifies the complexity.

In addition, the researchers also designed a suitable way to integrate the prior knowledge of tracking by introducing techniques such as online template update and window penalty without affecting the model and loss function. Please refer to the paper for details.

3. Experimental results

The researchers developed four models of different sizes to achieve a balance between performance and speed, and verified the performance of these models on 8 tracking datasets.

5ee0610215f436ca3349c92183b0e084.png

Table 1 SeqTrack model parameters

As shown in Table 2 below, SeqTrack has achieved excellent performance on large-scale datasets LaSOT, LaSOText, TrackingNet, GOT-10k . For example, SeqTrack-B256 achieves better results on all four datasets than OSTrack-256, which also uses ViT-B and 256 input image resolutions.

b3c9b15006f25ea2f340b78a7760af6f.png

Table 2 Large-scale dataset performance

As shown in Table 3, SeqTrack achieves leading performance on the TNL2K dataset containing a variety of uncommon target categories, validating the generalization of SeqTrack. Competitive performance has also been achieved on small-scale datasets NFS and UAV123 .

f8f6e070f62a81246c2b03491a11fd7a.png

Table 3 Additional Dataset Performance

As shown in Figure 4, on the VOT competition dataset, SeqTrack has achieved excellent performance using the bounding box test and the segmentation mask test respectively.

29838eb1127ca336a462a702114681f1.png

Figure 4 VOT2020 performance

Such a simple framework has good scalability and only needs to introduce information into sequence construction without changing the network structure. For example, the researchers conducted additional experiments to try to introduce timing information into the sequences. Specifically, the input sequence is extended to multiple frames, including the historical values ​​of object bounding boxes. Table 4 shows that such a simple extension improves the performance of the baseline model.

46fc77373cd3a80ec6a70d1c8a9abc3c.png

Figure 5 Schematic diagram of timing sequence

6c030441bbc6b6e729d3b1c23bfc7487.png

Table 4 Time series results

Four. Conclusion

This paper proposes a new modeling method for object tracking: sequential generative modeling. It models target tracking as a sequence generation task, using only a simple Transformer structure and cross-entropy loss, which simplifies the tracking framework. Extensive experiments demonstrate the excellent performance and potential of sequence generative modeling. At the end of the article, the researchers hope to provide inspiration for sequence modeling of visual object tracking and other video tasks through this article. In future work, researchers will try to further fuse temporal information, as well as extend to multimodal tasks.

Click to enter —>【Target Tracking】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标跟踪和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标跟踪或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标跟踪或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watchb72f923dfb38f35c70b77888d1de1609.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130717001