[Paper Reading Notes 75] P-Tuning v2

1. Basic information

topic Paper author and unit source years
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks Xiao Liu et alTsinghua University Tsinghua University 2021

Citations, References

Paper link: https://arxiv.org/pdf/2110.07602.pdf

[1] Liu X , Ji K , Fu Y , et al. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks[J]. 2021.

Paper code: https://github.com/THUDM/P-tuning-v2

2. Key points

Research Topics problem background Core method flow highlights data set in conclusion thesis type keywords
fine-tuning large models The prompt tuning model has great limitations, and prompt tuning does not perform well for normal-sized pre-trained models; in addition, the current prompt tuning cannot handle sequence labeling tasks, and the solution is not universal. Especially on NLU tasks. It is proposed that P-Tuning v2 is an implementation of deep Prompt Tuning. Optimize and adapt NLU. Only fine-tuning 0.1~0.3% of the parameters can reach the level of fine-tuning with all parameters. SuperGLUE The performance of the number model below P-tuning10B has been improved. On NLU tasks such as entity naming, the training parameters of 0.1%~3% have reached the level of Fine-tuning.

P-TuningV2 increases the amount of data by almost 10 times on the basis of P-tuning, and has a better effect on general models. The effect of P-tuning in general small parameter models has been improved, basically reaching the level of Fine-tuning.

(average accuracy of RTE, BoolQA, CB validation set):

pCFzhdO.png

3. Model (core content)

3.1 Comparison between P-tunning and P-tunning V2

p9jUtTs.png

Prefix-Tuning:Optimizing Continuous Prompts for Generation

Lester et al is google and proposed "prompt tuning".

The P-tuning and Lester models only add prompts to the embedding layer. P-tuning v2 adds prompts to all layers.

There are problems with the old method:

a. Adjustable parameters are limited.

b. embedding has no direct relationship with the output of the model.

To solve these problems, P-tuning v2 employs the idea of ​​deep prompt tuning.

Optimization and implementation details:

Reparameterization (reparameterization): The previous research likes to use MLP to convert parameters, but in NLU tasks, this method depends on the task and the data set;

**Prompt Length:** Prompt length plays a key role in P-Tuning v2. In general, easy classification tasks are biased towards shorter cues ( less than 20 ); hard sequence labeling tasks are biased towards longer cues ( about 100 ).

pCk3SNF.png

**Multi-task Learning:** Multi-task is optional for P-Tuning v2, but performance can be further improved by providing better initialization.

**Classifification Head(分类头):**P-tuning v2 instead applies a randomly-initialized classifification head on top of the tokens as in BERT。

4. Experiment and analysis

4.1 Experimental content

NLU Tasks:SuperGLUE.

BoolQ: question answering task;

CB (Commitment Bank ): text entailment task;

COPA (Choice of Plausibe Answers ): Choose reasoning tasks;

MultiRC (Multi-Sentence Reading Comprehension ): true and false question answering task;

ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset ): question-and-answer NER;

RTE (Recognizing Textual Entailment ): Textual Entailment Task;

WiC (Words in Context ): Whether the target word has the same meaning in the two sentences to be analyzed;

WSC (The Winograd Schema Challenge): **Reading comprehension task;**

Pre-trained Models:BERT-large,RoBERTa-large,DeBERTa-xlarge,GLM-xlarge/xxlarge

Multitask Learning:Name entity recognition (NER),(Extractive) Question Answering (QA),Semantic Role Labeling (SRL)

NER(IOB2格式): CoNLL03,OntoNotes 5.0,CoNLL04. multi-task is combination of three datasets;

**QA:**SQuAD1.1, SQuAD2.0,multi-task setting is combines the training sets of SQuAD 1.1,and 2.0;

**SRL:**CoNLL05, CoNLL12,multi-task setting is combination of the training set of CoNLL05, CoNLL12;

4.2 Effect

Regarding the comparison of different model sizes:

pCkG9yR.png

Across Tasks comparison:
pCkBrrD.png

NER for multi-task: pre-training by combining training sets from three datasets. Share continuous prompts to use a different linear classifier for each dataset.

QA for multi-task: Pre-training uses the combined dataset of SQuAD 1.1 and 2.0 for training set, assuming that all questions, regardless of source, may not have answers during predictive training.

Ablation experiment:

Verbalizer with LM head v.s. [CLS] label with linear head

pCk6IED.png

Prompt depth: adding them in descending order is better than adding them in ascending order

pCkcNIe.png

Judging from this figure, the layer where the prompt is added has a great relationship with the data set, and RTE can be added to the 17-24 layer. But BoolQ is the more layers the better.

5. Code

6. Summary

The effect of this experiment is gratifying, especially in the NLU task, when one has an advantage, the pre-model does not need to be too large, and the other does not need to save an extra copy of the model. There is another one, where CLS&linear head is used to replace the classic Verbalizer.

7. Knowledge collation (knowledge points, literature to be read, extracting the original text)

The verbalizer is a label word mapping, which converts the prediction of the vocabulary in the vocabulary at the **[MASK]** position into a classification label. For example {POLITICS: "politics", SPORTS: "sports"} .

Prompt tuning is the idea of ​​only fine-tuning consecutive prompts. Specifically, Liu et al. (2021b); Lester et al. (2021) propose to augment the original sequence of input word embeddings with trainable continuous embeddings.

Extract the original text of learning:

Deep prompt tuning increases the capacity of continuous prompts and closes the gap to fine-tuning across various settings, especially for small models and hard tasks.

Deep prompt tuning increases the ability of continuous prompting and closes the fine-tuning gap in various settings, especially for small models and hard tasks.

About SuperGLUE tasks:

BoolQ

BoolQ is a question answering dataset of Yes/No questions with 15942 examples . These questions arise naturally – they arise in an unprompted and unconstrained setting. Each example is a triplet of (question, paragraph, answer) , with the page title as optional additional context.

{ “question”: “is windows movie maker part of windows essentials”,

“passage”: “Windows Movie Maker – Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.”,

“idx”: 2,

“label”: true}

CB: Commitment Bank
CB A short text corpus in which at least one sentence contains an embedded clause. Each of the embedded clauses is marked with the expected degree of truth of the clause. The resulting task framework is a three-class textual entailment with samples from the Wall Street Journal, British National Corpus fiction, and Switchboard. Each sample contains a premise that contains an embedded clause, and the corresponding hypothesis (hypothesis) is the extraction of the clause. SuperCLUE uses a subset of this dataset where the agreement between annotations exceeds 0.85. These data are not well balanced (neutral samples are relatively small), so the evaluation metrics are accuracy and F1 score, where multi-class F1 score is the unweighted average of each class F1 score.
In fact, CB is a textual entailment task . After the model processes the premise, it checks whether the hypothesis based on the premise is neutral or implied or contradictory.

{ “premise”: “The Susweca. It means ‘‘dragonfly’’ in Sioux, you know.
Did I ever tell you that’s where Paul and I met?”
“hypothesis”:“Susweca is where she and Paul met,”
“label”: “entailment”, “idx”: 77}

The COPA: Choice of Plausibe Answerers
dataset represents a causal inference task that presents the system with a premise sentence and two possible alternatives. The system must choose the alternative that has a more plausible causal relationship to the premise sentence. The methods used to construct alternatives ensure that causal reasoning is required to solve the task. Samples are either for possible causes or possible outcomes of the premise sentences, plus simple question disambiguation between two instance types of the model.

Premise(前提): I knocked on my neighbor’s door.
What happened as a result?

Alternative 1 (two examples): My neighbor invited me in.
Alternative 2: My neighbor left his house.

MultiRC: Multi-Sentence Reading Comprehension
MultiRC is a true-false question answering task. Each sample consists of a context passage, a question about that passage, and a list of possible answers to that question, which must be labeled "true" or "false". Question answering is a very common problem and there are many datasets.
The reasons for choosing MultiRC here include:
(1) each question can have multiple possible correct answers, so each question-answer pair must be evaluated independently of other question-answer pairs; (2) the questions are designed in such a way that each question's All answers need to extract facts from multiple context sentences;
(3) Compared with range-based extractive question answering, the question-answer pair format of this dataset is more compatible with the API of other SuperGLUE tasks.
The passages were drawn from seven domains, including news, fiction, and historical texts. The evaluation metrics are the macro-average F1 score (F1m) on the correct answer set for each question and the binary F1 score (F1a) on all answer options. For example given text:

“Text”: “text”: "The rally took place on October 17, the shooting on
February 29. Again, standard filmmaking techniques are interpreted as
smooth distortion: “Moore works by depriving you of context and
guiding your mind to fill the vacuum – with completely false ideas.
It is brilliantly, if unethically, done.” As noted above, the “from
my cold dead hands” part is simply Moore’s way to introduce Heston.
Did anyone but Moore’s critics view it as anything else? He certainly
does not “attribute it to a speech where it was not uttered” and, as
noted above, doing so twice would make no sense whatsoever if Moore
was the mastermind deceiver that his critics claim he is. Concerning
the Georgetown Hoya interview where Heston was asked about Rolland,
you write: “There is no indication that [Heston] recognized Kayla
Rolland’s case.” This is naive to the extreme – Heston would not be
president of the NRA if he was not kept up to date on the most
prominent cases of gun violence. Even if he did not respond to that
part of the interview, he certainly knew about the case at that point.
Regarding the NRA website excerpt about the case and the highlighting
of the phrase “48 hours after Kayla Rolland is pronounced dead”:
This is one valid criticism, but far from the deliberate distortion
you make it out to be; rather, it is an example for how the facts can
sometimes be easy to miss with Moore’s fast pace editing. The reason
the sentence is highlighted is not to deceive the viewer into
believing that Heston hurried to Flint to immediately hold a rally
there (as will become quite obvious), but simply to highlight the
first mention of the name “Kayla Rolland” in the text, which is in
this paragraph. "

and the answer

“question”: “When was Kayla Rolland shot?” “answers”: [{“text”: “February 17”, “idx”: 168, “label”: 0}, {“text”: “February 29”, “idx”: 169, “label”: 1}, {“text”: “October 29”, “idx”: 170, “label”: 0}, {“text”: “October 17”, “idx”: 171, “label”: 0}, {“text”: “February 17”, “idx”: 172, “label”: 0}], “idx”: 26}, {“question”: “Who was president of the NRA on February 29?”, “answers”: [{“text”: “Charleton Heston”, “idx”: 173, “label”: 1}, {“text”: “Moore”, “idx”: 174, “label”: 0}, {“text”: “George Hoya”, “idx”: 175, “label”: 0}, {“text”: “Rolland”, “idx”: 176, “label”: 0}, {“text”: “Hoya”, “idx”: 177, “label”: 0}, {“text”: “Kayla”, “idx”: 178,“label”: 0}], “idx”: 27}

8. References

https://blog.csdn.net/weixin_39754630/article/details/119146018

made by happyprince

Guess you like

Origin blog.csdn.net/ld326/article/details/131108644