P-Tuning v2 paper overview

P-Tuning v2 paper overview

Preface

Limitations of fine-tuning

Fine-tuning is a method of adjusting target tasks based on a pre-trained model. It updates the entire model parameter set. Although fine-tuning can achieve good performance, it consumes memory due to the needto store the gradients and optimizer states of all parameters during the training process. Furthermore, since pretrained models are usually large, it is inconvenient to keep a copy of the model parameters for each taskduring inference.

Disadvantages of P-Tuning

Insufficient generality at different scales: Although P-Tuning is comparable to fine-tuning on large models (more than 10 billion parameters), for medium-sized models (from 100M to 1B), the performance of hint tuning is much lower than that of fine-tuning .

Insufficient generality on different tasks: Although P-Tuning performs superiorly on some natural language understanding (NLU) benchmarks, its effectiveness on difficult sequence labeling tasks has not been verified. The sequence labeling task predicts a sequence of labels for each input token, which can be more difficult and incompatible with verbalizers.

P-Tuning v2

Prompt tuning does not work well for ordinary size models (less than 10B).

Based on these challenges, P-Tuning v2 is proposed, which adopts depth cue tuning as a general solution in various scales and NLU tasks.

Prompt tuning v2 only fine-tunes 0.1%-3% of parameters and is suitable for models of common sizes (300M-10B).

Insert image description here

Summary

Cue tuning only uses a frozen language model to tune consecutive cues, significantly reducing per-task storage and memory usage during training.

However, in the context of NLU, previous work has shown that cue tuning performs poorly for normal-sized pretrained models. We also find that existing cue-tuning methods cannot handle hard sequence labeling tasks, indicating a lack of generality.

We present a novel empirical finding that properly optimized cue tuning can be generally effective across a wide range of model sizes and NLU tasks. It matches fine-tuned performance while requiring only 0.1%-3% of the tuning parameters.

Our method P-Tuning v2 is an implementation of Deep Prompt Tuning (Li and Liang, 2021; Qin and Eisner, 2021), optimized and adapted for NLU. Given the generality and simplicity of P-Tuning v2, we believe it can serve as an alternative to fine-tuning and provide a strong baseline for future research.

Ten questions about the paper

  1. What problem is the paper trying to solve?

This paper attempts to solve the problem of insufficient generalization of prompt tuning on model scale and hard sequence labeling tasks.

  1. Is this a new problem?

It cannot be said to be a new problem. The article mentioned that the generalization problem of prompt tuning has been discovered by previous work.

  1. What scientific hypothesis does this article test?

This article will verify that the optimized prompt tuning method can work in a consistent and generalized manner with fine-tuning.

  1. What are the relevant studies? How to classify? Who are the noteworthy researchers in the field on this topic?

Relevant studies include the work of Lester et al. (2021) and Liu et al. (2021). These two works explored prompt tuning on models with a scale of 1 billion parameters. Researchers worthy of attention include Jie Tang, one of the authors of the paper.

  1. What is the key to the solution mentioned in the paper?

The key to P-tuning v2, the solution proposed in the article, is to add continuous prompts to each transformer layer.

  1. How were the experiments in the paper designed?

The paper designed experiments on different model sizes and NLP tasks to compare the effects of P-tuning v2 and fine-tuning.

  1. What is the data set used for quantitative evaluation? Is the code open source?

The data sets used include SuperGLUE, named entity recognition, reading comprehension and semantic role annotation, etc. The code is open source on GitHub

  1. Do the experiments and results in the paper well support the scientific hypothesis that needs to be tested?

The experimental results fully support the hypothesis that the optimized prompt tuning method can generalize consistent with fine-tuning.

  1. What contribution does this paper make?

The main contribution of this paper is the discovery that appropriately optimized prompt tuning can be as effective as fine-tuning.

  1. What’s next? Is there any work that can be further developed?

The next step of work can continue to explore the application and optimization of other NLP tasks such as generation tasks based on this generalized prompt tuning method.

NLU tasks

Generally, shorter cues (fewer than 20) are preferred for simple categorization tasks; longer ones (around 100) are preferred for hard sequence labeling tasks.

  • Simple classification task

Simple classification tasks involve classifying on the label space, e.g., most datasets in GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019).

  • Hard sequence marking tasks

Hard sequence labeling tasks refer to those natural language understanding (NLU) tasks that involve classifying a series of labels. Such tasks are often difficult because they require prediction over a set of labels rather than just classifying a single label.

Examples of hard sequence labeling tasks include Named Entity Recognition and Extractive Question Answering. In these tasks, the model needs to classify each element in the input sequence to generate a sequence of labels. Such tasks are often more challenging than simple classification tasks.

Optimization points

P-tuning v2 is the implementation of Deep Prompt Tuning on natural language understanding tasks.

The key takeaways are:

(1) Insert hints for each transformer layer of the pre-trained language model.

(2) Reparameterize prompt representation (optional).

(3) Use a linear classification head instead of a language modeling head.

Insert image description here

experiment

data set

SuperGLUE, named entity recognition, reading comprehension, semantic role annotation, etc.

Pre-trained model

Model size
BERT-large 335M
RoBERTa-large 355M
DeBERTa-xlarge 750M
GLM-xlarge 2B
GLM-xxlarge 10B

Experimental results

Results on the SuperGLUE development set. P-tuning v2 exceeds P-tuning on models smaller than 10B, matching the performance of fine-tuning at different model scales.
Insert image description here

Results for Named Entity Recognition (NER), Question Answering (Extractive QA), and Semantic Role Labeling (SRL). All metrics in NER and SRL are micro-f1 scores
Insert image description here

ablation experiment

  • Language analyzer with LM header and CLS tag with linear header.

On RoBERTa-large, experimental results show that the performance of the two is similar.
Insert image description here

  • Tip depth effects

Given the same number of adjustable parameters, adding continuous cues to deeper layers (closer to the output layer) can achieve better performance than adding to the beginning layer, which verifies the effectiveness of multi-layer continuous cues.

In the case of RTE, adding hints to only layers 17-24 yields very close performance to all layers
Insert image description here

in conclusion

  • The performance of P-Tuning v2 is comparable to fine-tuning methods on different model sizes (300M to 10B parameters) and natural language understanding (NLU) tasks.

  • P-Tuning v2 only needs to tune 0.1% to 3% of task-specific parameters, while the fine-tuning method requires tuning all parameters of the entire model.

  • Compared with other hint tuning methods, P-Tuning v2's performance on simple classification tasks and difficult sequence labeling tasks such as extractive question answering and named entity recognition is closer to that of fine-tuning methods.

In summary, the P-Tuning v2 method has lower training time, memory cost, and storage cost per task, and has comparable performance to the fine-tuning method on various model sizes and NLU tasks. This makes P-Tuning v2 an alternative to fine-tuning methods and a strong baseline for future research.

Guess you like

Origin blog.csdn.net/qq128252/article/details/134769856