Click the card below to follow the " CVer " public account
AI/CV heavy dry goods, delivered as soon as possible
Fengse is from the concave temple and
reproduced from: qubit (QbitAI)
Prompt tuning , as a "new darling" in the field of NLP, was even praised by scholars as a new paradigm of NLP pre-training.
So, can it borrow from the CV field and produce the same results?
Now, from institutions such as Cornell University and Meta AI, using Prompt to adjust Transformer-based vision models, the results found:
absolutely okay!
Visual Prompt Tuning
Paper: https://arxiv.org/abs/2203.12119
Compared with the full fine-tuning, the performance of Prompt has improved significantly. Regardless of model size and training data, 20 out of 24 cases outright win.
At the same time, it can drastically reduce the storage cost required for each task.
Uses less than 1% of model parameters
The full fine-tuning that everyone has always used requires storing and deploying a separate copy of the backbone parameters for each downstream task. The cost is too high, especially now that Transformer-based models are getting bigger and bigger, and have surpassed the CNN architecture.
The so-called Prompt originally refers to the pre-programming of language instructions in the input text, so that the pre-trained language model can directly understand various downstream tasks.
It has allowed GPT-3 to show strong generalization even in the case of few or zero samples.
Some recent results show that Prompt is comparable to fully fine-tuned performance, and the parameter storage is reduced by a factor of 1000.
The high performance in NLP has led many people to explore the magic of Prompt in the field of CV, but they are limited to the input of text encoders in cross-modal tasks.
In this paper, the authors refer to their proposed Visual Prompt Tuning method, or VPT for short . This is the first time anyone has applied Prompt to the backbone of a vision model and achieved results.
Specifically, compared to full fine-tuning, VPT is inspired by the latest large-scale NLP model tuning methods, and only introduces a small number of task-specific parameters in the input space (less than 1% of the model parameters), while training downstream tasks during the training period. Freeze the backbone of the pretrained model .
In practice, these additional parameters are simply pre-added to the input sequence of each Transformer layer and learned together with the linear head during fine-tuning.
In total, they explored two variants:
The VPT-Deep variant pre-sets a set of learnable parameters for the input of each layer of the Transformer encoder;
The VPT-Shallow variant only inserts hint parameters into the input of the first layer.
Both during the training of downstream tasks, only the task-specific cues and the parameters of the linear head are updated, while the entire Transformer encoder is frozen.
Next, is it a mule or a horse? pull it out
20/24 win rate
The experiments involve two backbones pretrained on ImageNet-21k, one from Vision Transformer and one from Swin Transformer .
There are three kinds of fine -tuning methods for comparison , and 7 kinds , including:
(1) Full fine-tuning: update all backbone and classification head parameters
(2) Fine-tuning focusing on the classification head, including Linear, Partial-k and Mlp-k;
(3) and the methods of updating a subset of backbone parameters or adding new trainable parameters to the backbone during fine-tuning, which are divided into three types: Sidetune, Bias, and Adapter.
There are two sets of experimental datasets, involving a total of 24 downstream recognition tasks across different domains , including:
(1) FGVC consisting of 5 benchmark fine-grained visual classification tasks;
(2) VTAB-1k, consisting of 19 different sets of visual classifications, subdivided into natural imagery tasks captured with standard cameras (Natural), imagery tasks captured with specialized equipment (such as satellite imagery), and imagery tasks that require geometric understanding (Specialized) Tasks (Structured), such as object counting.
After measuring the average accuracy on each task , the main results are as follows:
VPT-Deep outperformed full fine-tuning on 20 of the 24 tasks, while using significantly fewer total model parameters (1.18× vs. 24.02×);
You know, no matter how powerful Prompt is in the NLP field, its performance will not exceed full fine-tuning. This shows that Prompt is very suitable for the visual Transformer model .
Compared with other fine-tuning methods (groups b and c), the performance of VPT-Deep all outperforms.
In addition, selecting ViTs with different backbone parameter scales and model scales (ViT-B, ViT-L and ViT-H) for testing also found that the VPT method would not be affected and still basically maintains the leading performance.
In the Swin Transformer, although the average accuracy of the comprehensive fine-tuning method is higher, it also pays a huge parameter cost.
All other fine-tuning methods are inferior to VPT.
about the author
The first author, Jia Menglin , is a doctoral student in Information Science at Cornell University. His main research direction is fine-grained recognition of visual and textual information. So far, he has published 4 top papers.
The common one, Tang Luming , is also a PhD student in computer science at Cornell University. He graduated from Tsinghua University with a major in mathematics and physics.
His main research interests are at the intersection of machine learning and computer vision.
VPT 论文下载
后台回复:VPT,即可下载上面论文
ICCV and CVPR 2021 Paper and Code Download
Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection
Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection
Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF
CVer-Transformer exchange group established
Scan the QR code below, or add WeChat: CVer6666, you can add CVer assistant WeChat, you can apply to join the CVer- Transformer WeChat exchange group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, autonomous driving, Reinforcement learning, lane line detection, model pruning & compression, denoising, dehazing, deraining, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication, PyTorch, TensorFlow and Transformer Wait.
Be sure to remark: research direction + location + school/company + nickname (such as Transformer + Shanghai + handover + Kaka) , according to the format remarks, it can be passed faster and invited to the group
▲Scan the code or add WeChat: CVer6666, enter the exchange group
CVer Academic Exchange Group (Knowledge Planet) is here! If you want to know the latest, fastest and best CV/DL/ML paper express, high-quality open source projects, learning tutorials, practical training and other materials , please scan the QR code below to join the CVer academic exchange group, which has gathered thousands of people!
▲Scan the code to enter the group
▲Click the card above to follow the CVer official account
It is not easy to organize, please like and watch