[CVPR2023] Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompts

6e95ddc7a34a5c27402161a7287a6315.png

来源:专知
本文为论文介绍,建议阅读5分钟本文提出一种多模态提示学习方案,在单一统一训练下平衡有监督和零样本的性能。

7a439c6408bd5bab45157d1ed8b813a5.png

Using contrastive image-text pretrained models such as CLIP for video classification has attracted attention for its cost-effectiveness and competitive performance. However, recent work in this area faces a tradeoff. Fine-tuning a pretrained model to achieve strongly supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to preserve zero-shot capability leads to a significant drop in supervised accuracy. Consequently, recent literature work usually trains separate models for supervised and zero-shot action recognition. In this paper, we propose a multimodal hint learning scheme that balances supervised and zero-shot performance under a single unified training. The visual aspect of the cueing approach satisfies three requirements: 1) global video-level cues to model the data distribution; 2) local frame-level cues to provide discriminative conditions for each frame; and 3) for extracting condensed video representations Summary prompt for . Furthermore, a hinting scheme is defined on the text side to enhance text context. With this incentive scheme, it is possible to achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51, and UCF101 while remaining competitive in supervised settings. By keeping the pre-trained backbone frozen, it optimizes for a smaller number of parameters and preserves the existing general representation, which helps achieve strong zero-shot performance. Our code/models are published at https://github.com/TalalWasim/Vita-CLIP.

f98ea8bf615b7e8f4d179a75ab7d3318.png

f8896a7a0040b8d4a3f86c8ebcebc060.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/130143261