Visual Prompt

Start with NLP

Simply put, Prompt is to perform certain processing on the original input text, so that the performance of the corresponding task becomes higher without changing the parameters of the pre-training model. For example, the original input text is: I received the offer from ETH. For text classification, we modify it to I received the offer form ETH, I'm so [MASK]; [MASK] can be some emotional words, For example, happy, then compared to the original text, the modified sentence is more likely to be classified as happy. If it is changed to I received the offer from ETH. Chinese:[MASK], it is easier to obtain the correct translation effect for the translation task. The so-called modification methods are mentioned in the big guy’s paper (as shown below):

Prompt algorithm steps in NLP :

Prompt Addition: This step is how to modify the original text.

Answer Search: Build the corresponding answer space, for example, text classification, set as (happy, good, terrible, etc.).

Answer Mapping: Sometimes the answer is not the final result we want, for example, the final result we want is positive and negative; then we need to map happy and good to positive and terrible to negative.

VPT(Visual prompt tuning)

1. Paper information

Paper Title: Visual Prompt Tuning

Author team:

 Conference: ECCV2022

Github: https://github.com/kmnp/vpt

2. Motivation and innovation

motivation:

  • The current method of adjusting the pre-training model is full fine-tuning, that is, fully fine-tuning . When the pre-trained model is migrated to downstream tasks using the full fine-tuning method, the entire model needs to be stored, and all parameters of the model will be trained, resulting in a large amount of calculation ;

  • With the development of the field of computer vision, the Transformer-based model is larger than the CNN-based model, which leads to a sharp increase in model parameters and increases the difficulty of training;

  • In recent years, NLP has entered the stage of large models. For how to transfer NLP pre-trained large models to downstream tasks, relevant personnel have proposed a method different from Fine-tuning, namely Prompt-tuning, while keeping the pre-trained model frozen. , only need to train a small number of additional parameters to transfer this large model to downstream tasks, and the effect is good.

  • How to more effectively adapt the pre-trained Transformer for downstream tasks?

Innovation:

  • This article proposes a simple and effective method for tuning pre-trained Transformer models for downstream tasks, namely Visual-Prompt Tuning (VPT) .

 3. Method

 The VPT-Deep variant pre-sets a set of learnable parameters for the input of each layer of the Transformer encoder;

The VPT-Shallow variant only inserts hint parameters into the input of the first layer.

During the training of the downstream tasks, only the task-specific hints and parameters of the linear head are updated, while the entire Transformer encoder is frozen.

4. Experimental results 20/24

 The experimental data set has two groups, involving a total of 24 downstream recognition tasks across different fields, including:

(1) FGVC consisting of 5 benchmark fine-grained visual classification tasks;

(2) VTAB-1k consists of 19 different visual classification sets, subdivided into natural image tasks captured with standard cameras (Natural), image tasks captured with specialized equipment (such as satellite images) (Specialized), and geometric understanding Tasks (Structured), such as object counting. After measuring the average accuracy on each task, the main results are as follows:

VPT-Deep outperforms full fine-tuning on 20 of 24 tasks while using significantly fewer total model parameters (1.18× vs. 24.02×);

No matter how powerful Prompt is in the NLP field, its performance will not exceed comprehensive fine-tuning. This shows that Prompt is very suitable for the visual Transformer model.

Exploring Visual Prompts for Adapting Large-Scale Models

1. Paper information

Paper Title: Exploring Visual Prompts for Adapting Large-Scale Models

Author team:

Github: https://hjbahng.github.io/visual_prompting/

2. Motivation

Just as the attention mechanism and transformer become mainstream in NLP, CV models based on attention and transformer such as attention+CNN, Vit, Swin-transformer, and ShiftVit continue to emerge; seeing prompting becoming more and more popular in NLP, The author naturally asked: Why not visual prompting? To prove that in the CV field, Prompt is feasible and works well on certain tasks and data sets.

3. Method

Ways to use (migrate) a pretrained model :

In CV, the methods of migrating a pre-trained model to new tasks mainly include Fine-tuning, Linear Probe, and Visual Prompting ; the differences between the three methods are shown in the figure below:

 Fine-tuning will modify the parameters of the pre-training model. Linear Probe will not modify the parameters of the pre-training model, but will add a linear layer related to the task after the pre-training model. Visual Prompting does not modify the parameters of the pre-training model, but only modifies the original image. .

Prompt form:

  • For a picture, adding a prompt to the original picture is naturally thinking of adding some pixels; in fact, the advantage of adding a prompt in the form of pixels is that it can be task-special and input-agnostic; that is, because the prompt contains information learned from a large amount of data, So it is task-related; because for the same task, you can directly use the prompt obtained during the test, no matter which picture you input, so the input is irrelevant.

  • How to add: The author mentioned three ways: 1) Add pixel patches (pixel patch) at random positions; 2) Add pixel patches (pixel patch) at fixed positions; 3) Pad some pixels on the inner edge of the image (similar to convolution) padding) The third way works best.

  • Padding: Use the pad method to add, the added width is p; the size of the image is C, H, W; then a total of 2*C*p*(Hp)+2*C*p*(Wp) needs to be added, as shown in the figure:

 How to get it: For a task, you need to obtain the prompt related to the task through training, and you can directly apply it after you get it.

 4. Experimental results

  ​​​​​​

 The purpose of the article is not to achieve state-of-the-art, but to prove the effectiveness of visual prompting, and the experimental results are good.

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/129924446
Recommended