Go beyond traditional fine-tuning! Meta's new work VPT: Visual Prompt is here! Freeze the trunk, adjust only 1% of the parameters, and the performance has improved significantly! ...

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Fengse is from the concave temple and
reproduced from: qubit (QbitAI)

Prompt tuning , as a "new darling" in the field of NLP, was even praised by scholars as a new paradigm of NLP pre-training.

So, can it borrow from the CV field and produce the same results?

Now, from institutions such as Cornell University and Meta AI, using Prompt to adjust Transformer-based vision models, the results found:

absolutely okay!

6139d2a6c8a22c7bc78e3b926d413642.png

Visual Prompt Tuning

Paper: https://arxiv.org/abs/2203.12119

Compared with the full fine-tuning, the performance of Prompt has improved significantly. Regardless of model size and training data, 20 out of 24 cases outright win.

b86c63dc6a5dd4a0c3cd9fceacf6bf7f.png

At the same time, it can drastically reduce the storage cost required for each task.

e0839177a9f2a8d7eb2ff8dd090fd0dd.png

Uses less than 1% of model parameters

The full fine-tuning that everyone has always used requires storing and deploying a separate copy of the backbone parameters for each downstream task. The cost is too high, especially now that Transformer-based models are getting bigger and bigger, and have surpassed the CNN architecture.

The so-called Prompt originally refers to the pre-programming of language instructions in the input text, so that the pre-trained language model can directly understand various downstream tasks.

It has allowed GPT-3 to show strong generalization even in the case of few or zero samples.

Some recent results show that Prompt is comparable to fully fine-tuned performance, and the parameter storage is reduced by a factor of 1000.

The high performance in NLP has led many people to explore the magic of Prompt in the field of CV, but they are limited to the input of text encoders in cross-modal tasks.

In this paper, the authors refer to their proposed Visual Prompt Tuning method, or VPT for short . This is the first time anyone has applied Prompt to the backbone of a vision model and achieved results.

Specifically, compared to full fine-tuning, VPT is inspired by the latest large-scale NLP model tuning methods, and only introduces a small number of task-specific parameters in the input space (less than 1% of the model parameters), while training downstream tasks during the training period. Freeze the backbone of the pretrained model .

963e926a8498e0a173d00284fc0ab4a6.png

In practice, these additional parameters are simply pre-added to the input sequence of each Transformer layer and learned together with the linear head during fine-tuning.

In total, they explored two variants:

The VPT-Deep variant pre-sets a set of learnable parameters for the input of each layer of the Transformer encoder;

The VPT-Shallow variant only inserts hint parameters into the input of the first layer.

Both during the training of downstream tasks, only the task-specific cues and the parameters of the linear head are updated, while the entire Transformer encoder is frozen.

9fc80f9bdfad30c47ddaaaf22fcd53e3.png

Next, is it a mule or a horse? pull it out

20/24 win rate

The experiments involve two backbones pretrained on ImageNet-21k, one from Vision Transformer and one from Swin Transformer .

There are three kinds of fine -tuning methods for comparison , and 7 kinds , including:

(1) Full fine-tuning: update all backbone and classification head parameters

(2) Fine-tuning focusing on the classification head, including Linear, Partial-k and Mlp-k;

(3) and the methods of updating a subset of backbone parameters or adding new trainable parameters to the backbone during fine-tuning, which are divided into three types: Sidetune, Bias, and Adapter.

2f1c90263934cc81847f92dfd44f7cc8.png

There are two sets of experimental datasets, involving a total of 24 downstream recognition tasks across different domains , including:

(1) FGVC consisting of 5 benchmark fine-grained visual classification tasks;

(2) VTAB-1k, consisting of 19 different sets of visual classifications, subdivided into natural imagery tasks captured with standard cameras (Natural), imagery tasks captured with specialized equipment (such as satellite imagery), and imagery tasks that require geometric understanding (Specialized) Tasks (Structured), such as object counting.

After measuring the average accuracy on each task , the main results are as follows:

VPT-Deep outperformed full fine-tuning on 20 of the 24 tasks, while using significantly fewer total model parameters (1.18× vs. 24.02×);

You know, no matter how powerful Prompt is in the NLP field, its performance will not exceed full fine-tuning. This shows that Prompt is very suitable for the visual Transformer model .

Compared with other fine-tuning methods (groups b and c), the performance of VPT-Deep all outperforms.

7d193a0c4885ddff3d3b6575a9a3c44d.png

In addition, selecting ViTs with different backbone parameter scales and model scales (ViT-B, ViT-L and ViT-H) for testing also found that the VPT method would not be affected and still basically maintains the leading performance.

5ecdf3c989bf7a3cf1729727daa8bf18.png

In the Swin Transformer, although the average accuracy of the comprehensive fine-tuning method is higher, it also pays a huge parameter cost.

All other fine-tuning methods are inferior to VPT.

02adeecf998f75b8eddcadd2ba321cea.png

about the author

The first author, Jia Menglin , is a doctoral student in Information Science at Cornell University. His main research direction is fine-grained recognition of visual and textual information. So far, he has published 4 top papers.

a249febdcdb5f971994074c13ee2982e.png

The common one, Tang Luming , is also a PhD student in computer science at Cornell University. He graduated from Tsinghua University with a major in mathematics and physics.

His main research interests are at the intersection of machine learning and computer vision.

f869bc048a5fe29ca06db564c9bf13bf.png

 
  
VPT 论文下载

后台回复:VPT,即可下载上面论文

ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

CVer-Transformer exchange group established

Scan the QR code below, or add WeChat: CVer6666, you can add CVer assistant WeChat, you can apply to join the CVer- Transformer  WeChat exchange group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, autonomous driving, Reinforcement learning, lane line detection, model pruning & compression, denoising, dehazing, deraining, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication, PyTorch, TensorFlow and Transformer Wait.

Be sure to remark: research direction + location + school/company + nickname (such as Transformer + Shanghai + handover + Kaka) , according to the format remarks, it can be passed faster and invited to the group

f402748fd27393d4e800ffc5de6cf331.png

▲Scan the code or add WeChat: CVer6666, enter the exchange group

CVer Academic Exchange Group (Knowledge Planet) is here! If you want to know the latest, fastest and best CV/DL/ML paper express, high-quality open source projects, learning tutorials, practical training and other materials , please scan the QR code below to join the CVer academic exchange group, which has gathered thousands of people!

16d3fa52ed55f971c82176274b64b4af.png

▲Scan the code to enter the group

▲Click the card above to follow the CVer official account

It is not easy to organize, please like and watchcfc3950454c5c1be763d592166bb8593.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123767266