New work by Nanda Wang Limin's team | MixFormerV2: The first Transformer-based object tracker that runs in real time on a CPU device!

This article was first published on the WeChat public account CVHub. Private reprinting or selling to other platforms is strictly prohibited, and offenders will be held accountable.

Title: MixFormerV2: Efficient Fully Transformer Tracking

Paper: https://arxiv.org/pdf/2305.15896.pdf

Code: https://github.com/MCG-NJU/MixFormerV2

guide

This paper mainly introduces an Transformerobject tracking framework based on . The traditional three-stage model paradigm, namely feature extraction , information interaction , and location estimation . Most of these methods adopt a more unified single-stream model structure to simultaneously perform feature extraction and interaction, which is very effective for modeling visual object tracking tasks. However, some modern tracking architectures are too large and computationally expensive to be deployed in practical applications.

To address this issue, the authors propose a framework called MixFormerV2Full TransformerTracking, which does not require intensive convolution operations and complex score prediction modules. The key to the design of this framework is to introduce four special prediction markers and concatenate them with markers of target template and search region . Subsequently, the authors apply a unified backbone network on these hybrid labeled sequences Transformer. These predictive markers are able to capture complex correlations between target templates and search regions through hybrid attention .

Based on these predicted markers, with a simple multi-layer perceptron head, it is easy to predict the tracked boxes and estimate their confidence scores. In addition, in order to further improve the model efficiency, this paper proposes a distillation-based model compression method, including the following two methods:

  • dense to sparse distillation
  • Deep to Shallow Distillation

Finally, this paper designed different scaling architectures according to different application scenarios, and LaSOTachieved results of 70.6 % on the dataset andAUC over 2.7 % at real-time speed.CPUFEAR-L AUC

motivation

The above figure shows MixFormerV2that it can be seen that this is a complete Transformertracking framework without any convolution operations and complex score prediction modules. Its backbone is a simple Transformersequence of tokens acting on a mixture of target template tokens, search region tokens, and learnable prediction tokens. Then, the network uses the simplest MLP head to predict the probability distribution of bounding box coordinates and the corresponding object quality score.

Compared with other based Transformertrackers such as MixFormerand SimTracket al., our method effectively removes customized convolution classification and regression heads for the first time, making the tracking process more unified, efficient and flexible. Simply by inputting template tokens, search region tokens, and learnable prediction tokens, the model predicts object bounding boxes and quality scores in an end-to-end manner.

Prediction-Token-Involved Mixed Attention

Compared with MixViTthe original hybrid attention in , the key difference of the hybrid attention module designed in this paper is the introduction of special learnable prediction tokens for capturing the correlation between target templates and search regions . These predictive tokens can progressively compress target information and serve as compact representations for subsequent regression and classification. Specifically, given multiple templates, searches, and concatenated tokens of four learnable prediction tokens, we feed them into a mixed attention module with N layers of prediction token participation (as shown in the figure) P-MAM:

Similar to the original MixFormer, this paper uses an asymmetric mixed attention scheme for efficient online inference. Like CLS tokens in standard ViT, learnable prediction tokens are automatically learned on tracking datasets to compress template and search information.

Direct Prediction Based on Tokens

After extraction through the backbone network embedding, let's see how to use the prediction token directly to regress the target position and estimate its reliability score.

Specifically, this paper first utilizes distributional regression based on four special learnable predictive tokens. It should be noted that what is regressed here is the probability distribution of the coordinates of the four bounding boxes, not their absolute positions. Second, since predicting tokens can compress object-related information through a hybrid attention module that predicts tokens to participate in, we can use the same MLP head to simply predict four bounding box coordinates:

In a concrete implementation, the MLP weights are shared between these four prediction tokens. For predicted object quality assessment, we average the output predicted tokens and use the MLP head to estimate the object's confidence score. These token-based headers greatly reduce the complexity of bounding box estimation and quality score estimation, leading to a simpler and unified tracking architecture. (For details, please refer to the above frame diagram for supporting understanding~~~)

The following is a brief description of the two distillation methods mentioned above.

Dense-to-Sparse Distillation

In MixFormerV2, we directly regress the distribution of the target bounding box to four random variables based on the predicted tokens, denoting the top, left, bottom, and right coordinates of the bounding box, respectively. Specifically, a probability density function for each coordinate is first predicted. The final bounding box coordinates can be derived from the expected value of the regression probability distribution. Since the original MixViTuses a dense convolutional corner head to predict a 2D probability map, i.e., a joint distribution for the upper left and lower right corners, the 1D distribution of bounding box coordinates can be easily derived from the marginal distribution.

Therefore, this modeling approach can effectively bridge the gap between dense corner predictions and our sparse token-based predictions, i.e., the original MixViTregression output can be regarded as soft labels for dense-to-sparse distillation. Specifically, the authors apply KL divergence loss for supervision. In this way, localization information is transferred from MixViT's dense corner-point header to MixFormerV2's sparse token-based header.

Deep-to-Shallow Distillation

To further improve the efficiency, this paper tries to prune Transformerthe backbone network. However, designing a new lightweight backbone network is not suitable for fast single-flow tracking. Because new backbones of single-stream trackers usually highly rely on large-scale pre-training for good performance, which requires a lot of computing resources. Therefore, the author directly clips some layers of the MixFormerV2 backbone network by means of feature imitation and logits distillation.

Since directly removing some layers may lead to inconsistency and discontinuity, the paper explores a progressive model depth pruning method based on feature and logits distillation. This strategy is to keep the student model consistent with the teacher model during clipping and to provide a smooth transition scheme. At the beginning, the student model exactly replicates the structure and weights of the teacher model. Then, by gradually eliminating some layers of the student model, the remaining layers are trained to mimic the representation of the teacher model. The advantage of this is that the consistency between students and teachers is maintained in the clipping process, and the difficulty of feature imitation is reduced.

Specifically, by introducing a decay rate γ \gammaγ , which decays the weights of the clipped layers. During the first few epochs of training, the decay rate is gradually reduced from 1 to 0, using a cosine function to achieve a progressive clipping process. In this way, the clipped layers will gradually reduce their influence on the output, and finally become an identity transformation, thus achieving depth clipping. Through this gradual model depth pruning strategy, it can reduce the complexity of the model while maintaining better performance and improve the operating efficiency of the model.

Intermediate Teacher

Although we have completed the model compression to a certain extent through this teacher-student distillation method above, the introduction of the intermediate teacher model has a certain impact on the distillation of very shallow models (such as 4-layer MixFormerV2). Often, the knowledge of the teacher model may be too complex for a small student model to learn. Therefore, this article also introduces an intermediate role as a teaching assistant here to ease the difficulty of extreme knowledge distillation. In this case, we can decompose the problem of knowledge distillation between a teacher and a small class of students into several subproblems.

Specifically, the number of layers of the intermediate teacher model is between the deep teacher model (12-layer MixFormerV2) and the shallow student model (4-layer MixFormerV2). The role of the intermediate teacher model is to decompose the knowledge distillation problem into multiple sub-problems, making the small student model easier to learn and absorb. Through this decomposition and gradual approach, complex knowledge can be effectively transferred, and bridges can be established between different levels to improve the effect of knowledge distillation.

experiment

Ablation experiment

qualitative analysis

qualitative analysis

In summary, the MixFormerV2 tracker achieves significant improvements in performance and efficiency, outperforming existing trackers and achieving state-of-the-art performance on several benchmark datasets. For real-time running requirements, the author's lightweight model MixFormerV2-S achieves real-time speed on CPU devices and performs well. These results demonstrate the effectiveness and superiority of the authors' proposed method.

Summarize

This paper proposes an innovative tracking framework that achieves efficient and accurate object tracking MixFormerV2by using a network and a simplified head structure. TransformerThrough methods such as model simplification and knowledge distillation, MixFormerV2 has achieved significant improvements in speed and performance, and achieved state-of-the-art performance on multiple benchmark datasets. This research provides valuable reference for future tracker design and development.


CVHub is a high-quality knowledge sharing platform focusing on the field of computer vision. The original rate of technical articles on the whole site reaches 99%. It presents you with comprehensive, multi-field, in-depth cutting-edge AI paper solutions and supporting industry-level application solutions every day. Solution, provide scientific research | technology | employment one-stop service, covering supervised/semi-supervised/unsupervised/self-supervised various 2D/3D detection/classification/segmentation/tracking/pose/super-resolution/reconstruction and other full-stack fields And generative models such as the latest AIGC. Pay attention to the WeChat public account, welcome to participate in real-time academic & technical interactive exchanges, receive a CV learning spree, and subscribe to the latest information on school recruitment & social recruitment of major domestic and foreign companies!

おすすめ

転載: blog.csdn.net/CVHub/article/details/131039287