Paper Notes P-Tuning v2 Suggestive optimization equal to fine-tuning performance

Recently, I was reading Liu Pengfei's review on prompt, and the paper was too long. . I can only read a part of the record and a part of the notes. The discrete prompt is easier to understand, so I will focus on recording the notes about the continuous prompt. P-tuning v2 also mentioned the work of some previous papers, so I recorded the notes related to this paper today. The part mentioned in the previous paper is also roughly written.

Paper address

https://arxiv.org/pdf/2110.07602.pdf​arxiv.org/pdf/2110.07602.pdf

1. Abstract

Prompt tuning is a method proposed by other previous papers. By freezing the language model, only continuous prompts are adjusted. On models with more than 10B parameters, the effect catches up with fine-tune, but it does not perform well on normal-sized models. , and cannot solve the sequence labeling task. In response to these two problems, the author proposed P-tuning v2.

2. Related work

This paper is related to the previous prompt tuning, prefix tuning and p-tuning v1, so here is a general introduction to facilitate the understanding of this paper. These methods are all based on optimizing the continuous prompt. The previous work was to manually design templates or automatically generate templates, collectively referred to as discrete prompts. Discrete prompt has certain limitations, the result may not be optimal, and it is very sensitive to token changes, so the future research directions are also prompts in continuous space.

2.1 Prefix tuning

This is for NLG tasks. The specific method is to add different prefixes for different tasks . Freeze the pre-trained language , only fine-tune the task-specific prompt, and the prompt is not only added to the embedding layer at the beginning , but fine-tunable parameters are added to each layer, and an MLP is added at the end.

2.2 P-tuning v1

The starting point is to make the generative model of GPT also have the ability of NLU. The main structure is to use a prompt encoder (BiLSTM+MLP) to encode some pseudo prompts first and then splicing them with input embedding, and the position of the pseudo token is not necessarily a prefix, but also an intermediate position .

2.3 Prompt tuning

On the basis of the input word vector of the original sequence, trainable continuous embeddings are added, and through a special initialization method, when the model parameters reach 10B, the effect reaches the same level as fine-tune, and only needs to fine-tune 0.01% of the parameter amount. P-tuning v2 also continues the idea of ​​this article and solves some of its problems.

(In fact, the P-tuning approach is more similar to Prefix tuning, which is optimized prefix-tuning)

3. Structure

The picture on the left is the previous p-tuning, and the picture on the right is p-tuning v2

It can be seen that in p-tuning v2, like prefix-tuning , the continuous prompt is added to the front end of the sequence , and trainable prompts are added to each layer . In the model before the left picture, only inserting the prompt into the input embedding will cause the trainable parameters to be limited by the length of the sentence, and the input embedding will directly affect the prediction of the model. This change allows the model to have more task-specific parameters (increased from 0.01% to 0.1%-3%). Adding prompt to each layer will also have a more direct impact on the model prediction results .

(The prompts added here are outside the pre-training model, which is equivalent to when the output of each layer is passed to the next layer, several vectors of length*embedding dimension are added in front, and the model calculation attention will use these prompts , but The prompt itself does not calculate the following sentences to output the results, but each layer is added to the front , so the total number of trainable parameters is number of layers * p rompt length * embedding dimensions . If I understand wrong, please correct me in time. .)

4. Optimization and Implementation

4.1 Reparameterization

A structure above the right figure is an encoder of prompt, which can use an MLP as a trainable parameter. This structure is optional, because for different tasks and datasets , the author found that it may improve the effect, but it may also The improvement is small or even negative, so it is necessary to decide whether to use it according to different situations.

4.2 Prompt Length

Through experiments, the author also found that the length requirements are different according to different tasks . For example, some simple tasks tend to have shorter prompts (less than 20 ), while some more difficult sequence labeling tasks have relatively large length requirements (around 100 ).

4.3 Multi-task Learning

Before fine-tune for individual tasks, use shared prompts for multi-task pre-training , so that prompts can have better initialization.

4.4 Classification Head

The verbalizer structure in prompt tuning is mentioned here , and the label is mapped to the extended word of the label with a language expression device. The author believes that this structure is sometimes very limited, on fully supervised datasets and some sequence labeling tasks (maybe this is why prompt tuning is not effective in sequence labeling tasks), so the author still uses CLS for labeling Classification. (The result is shown below)

Comparison between [CLS] label with linearhead and verbalizer with LM head on RoBERTa-large.

5. Experiments

In the paper, the model is evaluated on a variety of NLU tasks. By freezing the pre-trained language model, only fine-tuning is performed on the continuous prompt, and the experiments are all supervised methods instead of few-shot.

Results on SuperGLUE development set

Results on Named Entity Recognition (NER), Question Answering (Extractive QA), and Semantic RoleLabeling (SRL)

It can be seen that on many NLU tasks, the effect is similar to that of fine-tune, and some of them exceed the results of fine-tune. In terms of sequence labeling, the effect of p-tuning v2 is not as obvious as that of NLU tasks, but after multi-task training, the effect has been greatly improved, many of which exceed fine-tune.

6. Conclusion

The previous prompt method all wanted to make progress . On fully supervised data, it is still not as good as fine-tune . Unless the model is large enough (more than 10B), this paper makes the prompt method not On large-scale, it has also achieved better results than fine-tune, and the training parameters are much less than directly tuning the entire pre-training model . (I personally feel that this kind of thinking is a bit far from the prompt learning proposed at the beginning. It may already be considered another field. Maybe there will be a new term to define it later..)

7. References

  1. Alternative AI: Soul Torture: Has prompt tuning surpassed fine-tuning?
  2. Jinqin: One article to understand the basic knowledge and classic work of Prompt
  3. Liu Pengfei: The "Fourth Paradigm" of the Development of Modern Natural Language Processing Technology
  4. Eating Meat Does Not Gain Meat: Prompt Column No. 2 - Comparison with Fine-tuning

3. P-Tuning v2

3.1 Lack of universality

Hint optimization and P-tuning have been shown to be quite effective in many NLP applications (see Section 5). However, given the lack of generality, P-tuning is not yet a comprehensive alternative to fine-tuning.

Generalizability across scales is missing . Lester et al. (2021) show that hint optimization is comparable to fine-tuning when the model size exceeds 10 billion parameters . But for those smaller models (from 100M to 1B), there is a large difference in the performance of hint optimization and fine-tuning , which greatly limits the applicability of hint optimization.

Lack of versatility across tasks. Although Lester et al. (2021) and P-tuning have shown superiority on NLU benchmarks such as GLUE and SuperGLUE, their effectiveness on another large class of hard sequence NLU tasks, namely sequence labeling, has not been verified. First, sequence labeling requires predicting a sequence of labels rather than a single label. Second, sequence annotation often predicts moot labels, which can be challenging to translate into effective verbalizers (Schick and Schütze, 2020). In our experiments (see Section 4.3 and Table 3), we show that Lester et al. (2021) & P-tuning performs worse than fine-tuning on typical sequence labeling tasks.

Table 3: Results of question answering (extractive QA). Prompt tuning & P-tuning performs extremely poorly on question answering, while P-tuning v2 performs basically reasonably and can be better than DeBERTa-xlarge's fine-tuning. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2; MPT-2: multi-task P-tuning v2)

With these challenges in mind, we propose Ptuning v2, which introduces prefix fine-tuning as a general solution across scale and NLU tasks.

3.2 Deep Promts Optimization deep promts optimization

Prefix fine-tuning (Li and Liang, 2021) was originally proposed for the task of natural language generation (NLG), but we found that it is also very effective for NLU. We describe a prefix fine-tuned version suitable for NLU.

In (Lester et al., 2021) and P-tuning, continuous cues are only inserted into the input embedding sequence of the first layer of the transformer (cf. Fig. 2(a)). In subsequent transformer layers , the embeddings of the locations where successive cues are inserted are computed by previous transformer layers, which can lead to two possible optimization challenges.

1. The number of adjustable parameters is limited . Most language models currently only support a maximum sequence length of 512 (due to the quadratic computational complexity cost of attention). If we additionally subtract the length of our context (e.g., sentences to classify), then we have a finite length to fill with consecutive cues.

2. When fine-tuning with a deep transformer, the stability is limited . As the transformer goes deeper, the influence of hints from the first transformer layer may be unexpected due to the computation of many intermediate layers (with non-linear activation functions) , making our optimization not a very smooth one .

In view of these challenges, P-tuning v2 utilizes multiple layers of hints (i.e. deep hint optimization ), like prefix optimization (Li and Liang, 2021) (see Fig. 2(b)), as an improvement to P-tuning and Lester et al. major improvements. Hints in different layers are added to the input sequence as prefix tokens and are independent of other layers (instead of being computed by previous transformer layers). On the one hand, in this way, P-tuning v2 has more tunable task-specific parameters (from 0.01% to 0.1%-3%) to allow more capacity per task, while it is still better than the full Pre-trained language models are much smaller; on the other hand, hints added to deeper layers (e.g. LayerN Prompts in Figure 2) can have a more direct and significant impact on output predictions with fewer intermediate transformer layers (cf. Section 4.4).

3.3 Optimization and implementation

There are also some useful optimizations and implementation details.

optimization. Reparameterize. Previous methods exploit reparameterization features to improve training speed, robustness, and performance (e.g., prefix fine-tuning for MLPs and P-tuning for LSTMs). However, for NLU tasks, we find that the benefit of this technique depends on the task and dataset. For some datasets (e.g. RTE and CoNLL04), reparameterization of MLPs brings more stable improvements than embeddings; for other datasets, reparameterization may not show any effect (e.g. BoolQ), sometimes even worse (eg CoNLL12). See our ablation studies in Section 4.4.

optimization. Prompt length. Hint length plays a central role in hyperparameter search for hint optimization methods. In our experiments, we found that different comprehension tasks usually achieve their best performance with different hint lengths, which is consistent with the findings in prefix-tuning (Li and Liang, 2021), and different text generation tasks may have different The optimal prompt length for . See discussion in Section 4.4.

optimization. Multi-task learning. Multi-task learning is optional to our method, but can be quite helpful. On the one hand, the random inertia of continuous cues poses optimization difficulties, which can be alleviated by more training data or task-related unsupervised pre-training (Gu et al., 2021); on the other hand, continuous cues are A perfect vehicle for task-specific knowledge of tasks and datasets. Our experiments show that multi-task learning can serve as a beneficial complement to P-tuning v2, denoted as MPT-2, in some difficult sequential tasks (see Tables 2, 3, 4).

Table 2: Results on the named entity recognition (NER) test set (all metrics are micro-f1 scores). P-tuning v2 is generally comparable to fine-tuning, and multi-task P-tuning v2 can lead to further improvements. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2; MPT-2: multi-task P-tuning v2)

implement. [CLS] and annotation classification, not verbalizer. Verbalizer (Schick and Schütze, 2020) has been a core component of hint optimization, which turns one-hot class labels into meaningful words to leverage pre-trained language model heads. Despite its potential necessity in the few-shot setting, verbalizers are not necessary in the full-data supervised setting. It hinders the application of hint optimization in scenarios where we need meaningless labels and sentence embeddings. Therefore, P-tuning v2 returns to the traditional [CLS] label classification (see Fig. 2) paradigm with a randomly initialized linear head. See comparison in Section 4.4.

4. Experiment

4.1 Settings

We conduct extensive experiments on different commonly used pre-trained models and NLU tasks to verify the effectiveness of P-tuning v2.

Evaluation settings. In this work, all the results of "prompt tuning", " P-tuning", "P-tuning v2", and "multi-task P-tuning v2" are obtained by freezing the parameters of the transformer and only optimizing the continuous prompts of. The ratio of task-specific parameters (eg, 0.1%) is derived by comparing the parameters of the continuous cue and the parameters of the transformer. Only "fine-tuned" results are obtained by tuning the parameters of the transformer (without using continuous hinting).

Another thing to note is that our experiments were all conducted in the context of full-data supervised learning, rather than few-shot learning, which is important because some of the features we exploit (e.g., Using class labels with linear heads instead of speakers with LM heads) is only possible in a supervised setting.

NLU tasks. First, we include partial datasets of the GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks to test the general NLU capabilities of P-tuning v2, including SST-2, MNLI-m, RTE, BoolQ, and CB. More importantly, we introduce a set of tasks in the form of sequence annotation, requiring the language model to predict the category of each annotation in the input sequence, including named entity recognition (CoNLL03 (Sang and De Meulder, 2003), OntoNotes 5.0 (Weischedel et al. 2013) and CoNLL04 (Carreras and Màrquez, 2004)), extractive question answering (SQuAD 1.1 and SQuAD 2.0 (Rajpurkar et al., 2016)) and semantic role labels (CoNLL05 (Carreras and Màrquez, 2005) and CoNLL12 (Pradhan et al. People, 2012)).

pre-trained model. We include BERT-large (Devlin et al., 2018), RoBERTa-large (Liu et al., 2019), DeBERTa-xlarge (He et al., 2020), GLMxlarge/xxlarge (Du et al., 2021) for evaluation. They are all bidirectional models designed for NLU purposes, covering a wide range of scales from about 300M to 10B.

Compare methods. We compare our P-tuning v2 (PT-2) with vanilla fine-tuning (FT), P-tuning & Lester et al. (2021) (PT). Furthermore, for the difficult task on sequence labeling, we present results on Multi-Task P-tuning v2 (MPT-2), see Section 4.3 for more details.

4.2 P-tuning v2: different scales

Table 1 shows the performance of P-tuning v2 at different model sizes. For simple NLU tasks such as SST-2 (single sentence classification), Lester et al. (2021) and P-tuning show no significant disadvantage at smaller scales. But when it comes to complex challenges, such as natural language reasoning (RTE) and multiple-choice question answering (BoolQ ), their performance can be very poor. In contrast, P-tuning v2 matches fine-tuned performance on all tasks at smaller scales. To our surprise, P-tuning v2 significantly outperforms fine-tuning in RTE, especially in BERT.

Table 1: Partial GLUE and SuperGLUE development set results (all metrics are accuracy). On models smaller than 10B, P-tuning v2 significantly outperforms P-tuning & Lester et al. (2021), and matches the performance of fine-tuning. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2)

For larger scale (2B to 10B) GLMs (Du et al., 2021), the gap between P-tuning & Lester et al. (2021) and fine-tuning gradually narrows. At the 10B scale, we have an observation similar to that reported in (Lester et al., 2021), that hint optimization becomes competitive with fine-tuning. However, P-tuning v2 is comparable to fine-tuning at all scales, but requires only 0.1% of the task-specific parameters compared to fine-tuning.

Furthermore, we observe that RoBERTa-large performs worse than BERT-large in some datasets. This is partly because we have empirically found that hinting optimization is quite sensitive to hyperparameters, and sometimes the optimization gets stuck. P-tuning v2 can be more stable and robust during optimization. For more details on hyperparameters, please refer to our code base.

4.3 P-tuning v2: across tasks

In Section 4.2, we discuss the consistency of P-tuning v2, which is comparable to fine-tuning at any scale . However, most tasks of GLUE and SuperGLUE are relatively simple NLU problems . Another important series of hard NLU challenges is sequence annotation, which is related to some more advanced NLP applications, including open information extraction , reading comprehension, and so on.

To evaluate the capabilities of P-tuning v2 on these difficult NLU challenges, we selected three typical sequence labeling tasks. Named entity recognition, extractive question answering (QA) and semantic role labeling (SRL), a total of eight datasets.

Table 4: Results on Semantic Role Labels (SRL ). P-tuning v2 shows consistent improvements over Lester et al. (2021) and P-tuning on SRL. (FT: fine-tuning; PT: P-tune and Lester et al. (2021); PT-2: P-tune v2; MPT-2: multitasking P-tune v2)

Name Entity Recognition (NER). The purpose of NER is to predict all word spans and sentences representing some given entity category. We adopted CoNLL03 (Sang and De Meulder, 2003), OntoNotes 5.0 (Weischedel et al., 2013) and CoNLL04 (Carreras and Màrquez, 2004). For CoNLL03 and CoNLL04, we train our models on the standard train-dev-test split. For OntoNotes 5.0, we use the same train, dev, test split as in (Xu et al., 2021b). All datasets are annotated in IOB2 format. We use sequence labeling to solve the task of NER by assigning labels to mark the beginning of entities and some categories inside. The language model generates a representation for each token, and we use a linear classifier to predict the label. We use the official script to evaluate the results. For the multi-task setting, we combine the training sets of the three datasets for pre-training. We use a different linear classifier for each dataset while sharing continuous hints.

(Extractive) Question Answering (QA). Extractive QA aims to extract answers from a given context and question. We use SQuAD (Rajpurkar et al., 2016) 1.1 and 2.0 , where each answer is within a continuous span of context . Following tradition, we formulate the problem as sequence labeling, assigning it one of two labels. One of the two labels "start" or "end" is given to each annotation, and finally the span of the most confident start and end pair is selected as the extracted answer . If the probability of the most confident pair is below a threshold, the model The question will be assumed to be unanswerable. For the multi-task setting, the training set we use for pre-training combines the training set of SQuAD 1.1 and 2.0. While pre-training, we assume that all questions, regardless of their origin, are likely is unanswerable.

Semantic Role Labels (SRL). SRL assigns labels to words or phrases in a sentence, indicating their semantic role in the sentence. We evaluate P-tuning v2 on CoNLL05 (Carreras and Màrquez, 2005) and CoNLL12 (Pradhan et al., 2012). Since a sentence can have multiple verbs, we add target verb tokens at the end of each sentence to help identify which verb is used for prediction. We use a linear classifier to classify each word according to its corresponding semantic role representation. For the multi-task setting, the pre-training training set is a combination of the training sets of CoNLL05 (Carreras and Màrquez, 2005), CoNLL12 (Pradhan et al., 2012) and propbank-release (common extension data for training SRL). The multi-task training strategy is similar to NER.

result. From Tables 2, 3, 4, we observe that Ptuning v2 is comparable to finetuning on all tasks. P-tuning and Lester et al. (2021) perform much worse, especially on QA, which is probably the hardest challenge among the three tasks. We also note that some outlier results appear in SQuAD 2.0 (BERT/RoBERTa/DeBERTa show the same performance using Lester et al. (2021) and P-tuning). This may be because SQuAD 2.0 contains unanswerable questions compared to SQuAD 1.1, whereas Lester et al. (2021) and P-tuning may lead to trivial solutions.

Multitasking P-tuning v2 generally leads to clear improvements in overall tasks, with the exception of QA (which may again be the result of mixing all answerable SQuAD 1.1 and unanswerable SQuAD 2.0), which means that the potential for random initialization prompts has no be fully developed.

4.4 Ablation studies

We investigate some important hyperparameters and architecture designs that may play a central role in P-tuning v2.

Cue depth. The main difference between Lester et al. (2021) & P-tuning and P-tuning v2 is the multi-layer continuous hints we introduce. Intuitively, due to the many non-linear activation functions of the intermediate transformer layers, the deeper a hint is in the transformer layer, the more directly its impact on the output prediction will be. To verify its exact impact, given a certain number of k to add hints, we select k layers in ascending and descending order to add hints as prefix tokens; for the rest of the layers, we change their attention masks to disallow their prefix hints Participate in calculations.

As shown in Figure 4, in the case of the same amount of parameters (that is, the number of transformer layers to add prompts ), adding in descending order is always better than adding in ascending order. In the case of RTE, only adding hints at layers 17-24 yields very close performance , further cutting out the parameters of matching fine-tuning that we may need to tune.

Figure 4: Ablation study of cue depth using BERTlarge. "[xy]" refers to the layer interval at which we add successive hints (for example, "21-24 " means we add hints to 21 to 24 transformer layers). The same number of consecutive hints added to deeper transformer layers (i.e. closer to the output layer) can yield better performance than adding to the start layer.

Embedding and MLP reparameterization. In prefix fine-tuning (Li and Liang, 2021) and Ptuning (Liu et al., 2021b), the authors found that reparameterization is useful for improving training speed, robustness, and performance. However, experiments we conducted show that the effect of reparameterization is inconsistent across different NLU tasks and datasets.

As shown in Figure 3, in both RTE and CoNLL04, the reparameterization of the MLP generally shows better performance than the embedding at almost all cue lengths. However, in BoolQ, the results of MLP and embedding are competitive; in CoNLL12, the result of embedding is consistently better than that of MLP.

Figure 3: Ablation study of tip length and reparamerization using RoBERTa-large . Given certain NLU tasks and datasets, the conclusions can be very different. (MQA: Multiple Choice QA)

Prompt length. Prompt length is another influential hyperparameter of P-tuning v2, the optimal value of which varies from task to task. From Figure 3 we observe that for easy NLU tasks, usually shorter cues achieve the best performance; for difficult sequential tasks, usually longer cues than 100 help.

We also found that reparameterization was strongly correlated with optimal cue length. For example, in RTE, CoNLL04, and BoolQ, the reparameterization of the MLP reaches its best results earlier than the embedding. This conclusion may help some thinking about the optimization properties of P-tuning.

Verbalizer with LM header and [CLS] tag with linear header. Verbalizer with LM head has been a core component of previous suggestive fine-tuning methods. However, in the supervised setting, tuning a linear head with around a few thousand parameters is affordable for P-tuning v2. We present our comparison in Table 5, where we keep the other hyperparameters and only change the linear head of the [CLS] tag to the LM head of the verbalizer. Here, for simplicity, we use "true" and "false" for SST-2, RTE and BoolQ; and "true", "false" and "neutral" for CB. The results show that there is no significant difference in performance between the verbalizer and [CLS].

Table 5: Comparison between [CLS] labels with linear head and colloquial speakers with LM head on RoBERTa-large.

5. Related work

Pretrained language model. Self-supervised (Liu et al., 2020) pretrained language models (Han et al., 2021a) have become the backbone of natural language processing. From the limited number of parameters (less than 350M) of early GPT (Radford et al., 2019), BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2019) and GPT-3 (Brown et al., 2020) have driven the development of giant language models with billions or even trillions of parameters.

hint. Hinting (Liu et al., 2021a) refers to the utilization of special templates in the input context to aid the understanding and generation of language model predictions. Recently, due to the success of GPT-3 (Brown et al., 2020), various prompting strategies have emerged, including discrete natural language prompting (Shin et al., 2020; Gao et al., 2020), continuous prompting (Liu et al., 2021b ; Li and Liang, 2021; Lester et al., 2021; Qin and Eisner, 2021; Zhong et al., 2021), adjusting for bias (Logan IV et al., 2021), and many other hinting strategies.

The strength and effectiveness of hinting in a wide range of NLP applications has been validated in recent literature, including text classification (Hu et al., 2021; Min et al., 2021; Sun et al., 2021; Li et al., 2021; Zhang et al., 2021b), entity typing (Ding et al., 2021), few-shot human learning (Zheng et al., 2021; Xu et al. 2021a; Zhao et al., 2021; Gu et al., 2021; Zhang et al., 2021a), relation extraction (Chen et al., 2021a; Han et al., 2021b; Sainz et al., 2021), knowledge detection (Zhong et al., 2021), named entity recognition (Chen et al., 2021b), machine translation (Tan et al., 2021; Wang et al., 2021b) and dialogue systems (Wang et al., 2021a).

In this work, we pay special attention to extending the hinting method to smaller models and difficult sequence NLU tasks.

6. Summary

We propose P-tuning v2, a hinting method that is comparable to fine-tuning across scales and tasks . P-tuning v2 is not a conceptually new approach, but rather an NLU challenge to optimize and adapt to prefix optimization and depth hint optimization. Ptuning v2 shows consistent improvements for models from 330M to 10B , and outperforms Lester et al. (2021) and P-tuning by a large margin on difficult sequence tasks such as sequence labeling. Ptuning v2 can be a comprehensive alternative and a strong baseline for future work.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131415317