P-Tuning v2: Prompt optimization equal to fine-tuning performance

原文:P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

Author: Xiao Liu1,2 , Kaixuan Ji1

Code:  https://github.com/THUDM/P-tuning-v2

1. Introduction
2. Preparation
---- 2.1 Tasks of NLU
---- 2.2 Prompt optimization
3. P-Tuning v2
---- 3.1 Lack of universality
---- 3.2 In-depth prompt optimization
---- 3.3 Optimization and implementation
4. Experiments
---- 4.1 Setup
---- 4.2 P-tuning v2: Different scales
---- 4.3 P-tuning v2: Across tasks
---- 4.4 Ablation studies
5. Related work
6. Summary

1. Introduction

Cue fine-tuning, using only a frozen language model to fine-tune consecutive cues, greatly reduces the storage and memory usage of each task during training. However, in the context of NLU, previous work has shown that cue fine-tuning does not perform well for normal-sized pre-trained models. We also find that existing cue fine-tuning methods cannot handle difficult sequence labeling tasks, indicating a lack of generalizability. We present a new empirical finding that properly optimized hint fine-tuning can be generally effective across a wide range of model sizes and NLU tasks. It matches the fine-tuned performance with only 0.1%-3% fine-tuning parameters. Our method P-Tuning v2 is not a new method, but a version of prefix fine-tuning (Li and Liang, 2021), optimized and fine-tuned for NLU. Given the generality and simplicity of P-Tuning v2, we believe it can serve as an alternative to fine-tuning and a strong baseline for future research.

Pretrained language models (Han et al., 2021a) improve performance on a range of natural language understanding (NLU) tasks, such as question answering (Rajpurkar et al., 2016) and textual entailment (Dagan et al., 2005). A widely used method, fine-tuning, updates the entire set of model parameters for the target task. Although fine-tuning achieves good performance, it is memory intensive during training because the gradients and optimizer state of all parameters must be stored. Furthermore, fine-tuning requires keeping a copy of the model parameters for each task during inference, which is inconvenient because pre-trained models are usually large.

Figure 1: SuperGLUE dev’s average scores on RTE, BoolQ and CB. At 0.1% task-specific parameters, P-tuning v2 can be comparable to fine-tuning in pre-trained models at different scales, while Lester et al. (2021) & P-tuning can only do it at 10B scale to this point.

The hinting method, on the other hand, freezes all parameters of the pre-trained model and uses natural language hints to query the language model (Brown et al., 2020). For example, for sentiment analysis, we can concatenate samples with the prompt "This movie is [MASK]" and ask the pretrained language model to predict the masked annotation. We can then use the predicted probabilities that "good" and "bad" are masked to predict the label of the sample. The prompting method requires no training at all and only needs to store a copy of model parameters. However, compared to fine-tuning, prompting leads to suboptimal performance in many cases (Liu et al., 2021b; Lester et al., 2021).

Hint optimization is the idea of ​​​​optimizing only consecutive hints. Specifically, Liu et al. (2021b); Lester et al. (2021) propose to add trainable continuous embeddings to the original sequence of input word embeddings. These sequential embeddings (also called sequential cues) are similar to discrete, manually designed cues within cues. During training, only consecutive cues are updated. Although hint optimization improves over hint methods on many tasks (Liu et al., 2021b; Lester et al., 2021), it is still inferior to fine-tuning methods when the model size is small, especially less than 10 billion parameters (Lester et al. People, 2021). Furthermore, as our experiments show, hint optimization performs worse than fine optimization on several hard sequence tasks, such as extractive question answering and sequence labeling (see Section 4.3).

Our main contribution in this paper is a new empirical finding that properly optimized hint optimization can be universally comparable to fine-tuning across different model sizes and NLU tasks. Contrary to observations in previous work, our findings reveal the generality and great potential of hint optimization in NLU.

Technically, our method P-tuning v2 can be seen as an optimized version of prefix optimization (Li and Liang, 2021), a method designed for generation and suitable for NLU. The most notable improvements stem from using deep hint optimization, which applies successive hints to each layer of a pretrained model (Li and Liang, 2021; Qin and Eisner, 2021). Deep cueing optimization increases the ability for continuous cueing and closes the gap in fine-tuning in various settings, especially for small models and difficult tasks. Furthermore, we propose some optimization and implementation details to further improve the results.

Experimental results show that P-tuning v2 matches fine-tuning performance on different model sizes (from 300M to 100B parameters) and various difficult NLU tasks (such as question answering and sequence labeling). Compared with fine-tuning, P-tuning v2 has 0.1% to 3% trainable parameters per task, which greatly reduces the memory consumption of training time and storage cost per task.

2. Preparation work

2.1 Tasks of NLU

In this work, we divide the challenges of NLU into two families: simple tasks and difficult sequence tasks.

- Simple NLU tasks involve classification of a single label. Most datasets of GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019), including text classification (e.g. SST-2), natural language inference (NLI, e.g. MNLI-m, RTE), multiple choice questions Answers (such as BoolQ), etc., all fall into this category.

- Difficult sequence NLU tasks involve classifying a sequence of labels. Most of them are problems related to information extraction, such as open information extraction, named entity recognition, extractive question answering and semantic role labeling.

2.2 Prompt optimization

Prompt tuning (Lester et al., 2021), or Ptuning (Liu et al., 2021b), introduces trainable continuous prompts as a substitute for NLU's natural language prompts when the parameters of the backbone model are frozen. For example, V refers to the word list of the language model M, and e is used as the embedding function of the model M.

To classify a movie review with "'s conditional probability as a classification. In this case, the prompt tokens {"it", "is", "[MASK]"} all belong to the model's word list V, and the input embedding sequence will be

However, since model M is continuous in nature, from an optimization perspective, it is never possible to achieve optimality with discrete natural prompts. In contrast, P-tuning proposes to replace the hint token with a trainable continuous embedding [h0,...,hi] and turns the input sequence into

Therefore, different optimizations can be performed (refer to Figure 2(a)). Under the strict constraint that the parameters of the bone intervention training model are frozen, cue optimization has been shown to have Comparable performance to fine-tuning of 10 billion parameter models.

figure 2

Figure 2: From Lester et al. (2021) & P-tuning to P-tuning v2. The orange tokens (including h0, hi) refer to the hint embeddings we added; the blue tokens are the embeddings stored or calculated by the frozen pre-trained language model. In contrast to Lester et al. (2021), P-tuning v2 adds trainable continuous cues to the input of each transformer layer independently (as prefix optimization (Li and Liang, 2021) does). Furthermore, P-tuning v2 removes verbalizers with LM headers and returns to traditional class labels with plain linear headers to allow for generalizability of its task.

3. P-Tuning v2

3.1 Lack of universality

Hint optimization and P-tuning have proven to be quite effective in many NLP applications (see Section 5). However, given the lack of generalizability, P-tuning is not yet a comprehensive alternative to fine-tuning.

Lack of generalizability across scales. Lester et al. (2021) show that hint optimization is comparable to fine-tuning when model size exceeds 10 billion parameters. But for those smaller models (from 100M to 1B), the performance of hint optimization and fine-tuning is very different, which greatly limits the applicability of hint optimization.

Lack of cross-task versatility. Although Lester et al. (2021) and P-tuning show superiority on NLU benchmarks such as GLUE and SuperGLUE, their effectiveness on another large class of hard sequence NLU tasks (i.e., sequence labeling) has not been verified. First, sequence annotation requires predicting a series of labels rather than a single label. Second, sequence annotations often predict meaningless labels, which can be a challenge to convert them into effective verbalizers (Schick and Schütze, 2020). In our experiments (see Section 4.3 and Table 3), we show that Lester et al. (2021) & P-tuning perform worse than fine-tuning on typical sequence labeling tasks.

Table 3: Results of question and answer (extractive QA). Prompt tuning & P-tuning's performance in question and answer is extremely poor, while P-tuning v2's performance is basically reasonable and can be better than DeBERTa-xlarge's fine-tuning. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2; MPT-2: multi-task P-tuning v2)

Considering these challenges, we propose Ptuning v2, which implements prefix fine-tuning as a general solution across scale and NLU tasks.

3.2 Depth prompt optimization

Prefix fine-tuning (Li and Liang, 2021) was originally proposed for natural language generation (NLG) tasks, but we found that it is also very effective for NLU. We describe a fine-tuned version of prefix suitable for NLU.

In (Lester et al., 2021) and P-tuning, continuous cues are only inserted into the input embedding sequence of the first layer of the transformer (see Figure 2(a)). In the following transformer layer, the embeddings of the locations where consecutive cues are inserted are calculated by the previous transformer layer, which can lead to two possible optimization challenges.

1. The number of controllable parameters is limited. Most language models currently only support a maximum sequence length of 512 (due to the quadratic computational complexity cost of attention). If we additionally subtract the length of our context (e.g., sentences to classify), then we have a finite length to fill with consecutive cues.

2. When fine-tuning with a deep transformer, the stability is limited. As the transformer goes deeper, the influence of hints from the first transformer layer may be unexpected due to the computation of many intermediate layers (with non-linear activation functions), making our optimization not a very smooth one.

In view of these challenges, P-tuning v2 utilizes multiple layers of hints (i.e. deep hint optimization), like prefix optimization (Li and Liang, 2021) (see Fig. 2(b)), as an improvement to P-tuning and Lester et al. major improvements. Hints in different layers are added to the input sequence as prefix tokens and are independent of other layers (rather than calculated by the previous transformer layer). On the one hand, in this way, P-tuning v2 has more tunable task-specific parameters (from 0.01% to 0.1%-3%) to allow more capacity per task, while it is still better than the full Pre-trained language models are much smaller; on the other hand, hints added to deeper layers (e.g. LayerN Prompts in Figure 2) can have a more direct and significant impact on output predictions with fewer intermediate transformer layers (cf. Section 4.4).

3.3 Optimization and implementation

There are also some useful optimization and implementation details.

optimization. Reparameterization. Previous methods exploit reparameterization features to improve training speed, robustness, and performance (e.g., prefix fine-tuning for MLPs and P-tuning for LSTMs). However, for NLU tasks, we find that the benefit of this technique depends on the task and dataset. For some datasets (e.g. RTE and CoNLL04), reparameterization of MLPs brings more stable improvements than embeddings; for other datasets, reparameterization may not show any effect (e.g. BoolQ), sometimes even worse (such as CoNLL12). See our ablation study in Section 4.4.

optimization. Prompt length. Hint length plays a central role in hyperparameter search for hint optimization methods. In our experiments, we found that different comprehension tasks often achieve their best performance with different prompt lengths, which is consistent with the findings in prefix-tuning (Li and Liang, 2021), which may be different for different text generation tasks The optimal prompt length for . See discussion in Section 4.4.

optimization. Multi-task learning. Multi-task learning is optional for our approach but can be quite helpful. On the one hand, the random inertia of continuous cues poses optimization difficulties, which can be alleviated by more training data or task-related unsupervised pre-training (Gu et al., 2021); on the other hand, continuous cues are The perfect vehicle for task-specific knowledge of tasks and datasets. Our experiments show that multi-task learning can serve as a beneficial complement to P-tuning v2, denoted as MPT-2, in some difficult sequential tasks (see Tables 2, 3, 4).

Table 2: Results on the named entity recognition (NER) test set (all metrics are micro-f1 scores). P-tuning v2 is generally comparable to fine-tuning, and multi-task P-tuning v2 can lead to further improvements. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2; MPT-2: multi-task P-tuning v2)

implementation. [CLS] and annotation classification instead of verbalizer. Verbalizer (Schick and Schütze, 2020) has been a core component of hint optimization, which turns one-hot class labels into meaningful words to leverage pre-trained language model heads. Although it is potentially necessary in a few-shot setting, verbalizers are not necessary in a fully supervised setting. It hinders the application of hint optimization in scenarios where we need meaningless label and sentence embeddings. Therefore, P-tuning v2 returns to the traditional [CLS] label classification (see Figure 2) paradigm, using randomly initialized linear heads. See comparison in Section 4.4.

4. Experiment

4.1 Settings

We conduct extensive experiments on different commonly used pre-trained models and NLU tasks to verify the effectiveness of P-tuning v2.

Evaluation settings. In this work, all results for “prompt tuning”, “P-tuning”, “P-tuning v2”, and “multitasking P-tuning v2” are obtained by freezing the parameters of the transformer and optimizing only the continuous prompts of. The ratio of task-specific parameters (e.g. 0.1%) is derived by comparing the parameters of the continuous prompt and the parameters of the transformer. Only "fine-tuned" results are obtained by tuning the parameters of the transformer (without using continuous hinting).

Another thing to note is that our experiments were all conducted in the context of full-data supervised learning, rather than in the context of few-shot learning, which is important because some of the features we exploit (e.g., Using class labels with linear heads instead of speakers with LM heads) is only possible in a supervised setting.

NLU tasks. First, we include partial datasets of the GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks to test the general NLU capabilities of P-tuning v2, including SST-2, MNLI-m, RTE, BoolQ, and CB. More importantly, we introduce a set of tasks in the form of sequence annotation, requiring the language model to predict the category of each annotation in the input sequence, including named entity recognition (CoNLL03 (Sang and De Meulder, 2003), OntoNotes 5.0 (Weischedel et al. al. 2013) and CoNLL04 (Carreras and Màrquez, 2004)), extractive question answering (SQuAD 1.1 and SQuAD 2.0 (Rajpurkar et al., 2016)) and semantic role labeling (CoNLL05 (Carreras and Màrquez, 2005) and CoNLL12 (Pradhan et al. People, 2012)).

pre-trained model. We include BERT-large (Devlin et al., 2018), RoBERTa-large (Liu et al., 2019), DeBERTa-xlarge (He et al., 2020), GLMxlarge/xxlarge (Du et al., 2021) for evaluation. They are both bidirectional models designed for NLU purposes and cover a wide range of sizes from ~300M to 10B.

Compare methods. We compare our P-tuning v2 (PT-2) with vanilla fine-tuning (FT), P-tuning & Lester et al. (2021) (PT). Furthermore, for the difficult task regarding sequence labeling, we present the results of multi-task P-tuning v2 (MPT-2), see Section 4.3 for more details.

4.2 P-tuning v2: different scales

Table 1 shows the performance of P-tuning v2 at different model sizes. For simple NLU tasks such as SST-2 (single sentence classification), Lester et al. (2021) and P-tuning show no significant disadvantage at smaller scales. But when it comes to complex challenges, such as natural language reasoning (RTE) and multiple choice question answering (BoolQ), their performance can be very poor. In contrast, P-tuning v2 matches fine-tuned performance on all tasks at smaller scales. To our surprise, P-tuning v2 performs significantly better than fine-tuning in RTE, especially in BERT.

Table 1: Results on some GLUE and SuperGLUE development sets (all metrics are accuracy). On models smaller than 10B, P-tuning v2 significantly surpasses P-tuning & Lester et al. (2021) and is consistent with the performance of fine-tuning. (FT: fine-tuning; PT: P-tuning & Lester et al. (2021); PT-2: P-tuning v2)

For larger scale (2B to 10B) GLMs (Du et al., 2021), the gap between P-tuning & Lester et al. (2021) and fine-tuning gradually narrows. At the 10B scale, we have an observation similar to that reported by (Lester et al., 2021), namely that hint optimization becomes competitive with fine-tuning. However, P-tuning v2 is comparable to fine-tuning at all scales, but requires only 0.1% of task-specific parameters compared to fine-tuning.

Furthermore, we observe that RoBERTa-large performs worse than BERT-large in some datasets. This is partly because we have found empirically that hint optimization is quite sensitive to hyperparameters, and sometimes the optimization gets stuck. P-tuning v2 can be more stable and robust during the optimization process. For more details on hyperparameters, please refer to our code base.

4.3 P-tuning v2: spanning tasks

In Section 4.2, we discuss the consistency of P-tuning v2, which is comparable to fine-tuning at any scale. However, most tasks of GLUE and SuperGLUE are relatively simple NLU problems. Another important family of hard NLU challenges lies in sequence annotation, which is related to some more advanced NLP applications, including open information extraction, reading comprehension, etc.

To evaluate the capabilities of P-tuning v2 on these difficult NLU challenges, we selected three typical sequence labeling tasks. Name entity recognition, extractive question answering (QA) and semantic role labeling (SRL), a total of eight data sets.

Table 4: Results on Semantic Role Labels (SRL). P-tuning v2 shows consistent improvements in SRL over Lester et al. (2021) and P-tuning. (FT: Fine-tuning; PT: P-Tune and Lester et al. (2021); PT-2: P-Tune v2; MPT-2: Multitasking P-Tune v2)

Named Entity Recognition (NER). The aim of NER is to predict all word spans and sentences that represent some given entity category. We used CoNLL03 (Sang and De Meulder, 2003), OntoNotes 5.0 (Weischedel et al., 2013) and CoNLL04 (Carreras and Màrquez, 2004). For CoNLL03 and CoNLL04, we train our models on the standard train-dev-test split. For OntoNotes 5.0, we used the same training, development, and testing division as (Xu et al., 2021b). All datasets are annotated in IOB2 format. We use sequence annotation to solve the task of NER by assigning labels to label the beginning of entities and some categories within them. The language model generates a representation for each token and we use a linear classifier to predict the label. We use the official script to evaluate the results. For the multi-task setting, we combine the training sets of the three datasets for pre-training. We use different linear classifiers for each dataset while sharing continuous hint information.

(Extractive) Question and Answer (QA). Extractive QA is about extracting answers from a given context and question. We use SQuAD (Rajpurkar et al., 2016) 1.1 and 2.0, where each answer is within a contiguous span of the context. Following tradition, we formulate the problem as sequence labeling, assigning it one of two labels. One of the two labels "start" or "end" is given to each annotation, and finally the span of the most confident start and end pair is selected as the extracted answer. If the probability of the most confident pair is below a threshold, the model The question will be assumed to be unanswerable. For the multi-task setting, the training set we use for pre-training combines the training set of SQuAD 1.1 and 2.0. While pre-training, we assume that all questions, regardless of their origin, are likely is unanswerable.

Semantic Role Labels (SRL). SRL assigns labels to words or phrases in a sentence, indicating their semantic role in the sentence. We evaluate P-tuning v2 on CoNLL05 (Carreras and Màrquez, 2005) and CoNLL12 (Pradhan et al., 2012). Since a sentence can have multiple verbs, we add a target verb token at the end of each sentence to help identify which verb is used for prediction. We classify each word using a linear classifier based on its corresponding semantic role representation. For the multi-task setting, the pre-training training set is a combination of the training sets of CoNLL05 (Carreras and Màrquez, 2005), CoNLL12 (Pradhan et al., 2012) and propbank-release (common extension data for training SRL). The multi-task training strategy is similar to NER.

result. From Tables 2, 3, and 4, we observe that Ptuning v2 is comparable to finetuning on all tasks. P-tuning and Lester et al. (2021) perform much worse, especially on QA, which is probably the most difficult challenge of the three tasks. We also note that some outlier results appear in SQuAD 2.0 (BERT/RoBERTa/DeBERTa show the same performance using Lester et al. (2021) and P-tuning). This may be because SQuAD 2.0 contains unanswerable questions compared to SQuAD 1.1, while Lester et al. (2021) and P-tuning may lead to trivial solutions.

Multi-task P-tuning v2 generally leads to significant improvements across tasks, with the exception of QA (which may still be a mix of all answerable SQuAD 1.1 and unanswerable SQuAD 2.0 results), meaning that the potential of random initialization hints is not be fully developed.

4.4 Ablation studies

We studied some important hyperparameters and architectural designs that may play a central role in P-tuning v2.

Cue depth. Lester et al. (2021) & The main difference between P-tuning and P-tuning v2 is the multi-layered continuous prompting we introduce. Intuitively, due to the many non-linear activation functions of the intermediate transformer layers, the deeper a hint is in the transformer layer, the more directly its impact on the output prediction will be. To verify its exact impact, given a certain number of k to add hints, we select k layers in ascending and descending order to add hints as prefix tokens; for the rest of the layers, we change their attention masks to disallow their prefix hints Participate in calculations.

As shown in Figure 4, when the number of parameters is the same (that is, the number of transformer layers to add prompts), adding in descending order is always better than adding in ascending order. In the case of RTE, only adding hints at layers 17-24 yields very close performance to all layers, further cutting out the parameters of matching fine-tuning that we may need to tune.

Figure 4: Ablation study of cue depth using BERTlarge. "[xy]" refers to the layer interval at which we add successive hints (for example, "21-24 " means we add hints to transformer layers 21 to 24). The same number of consecutive hints added to deeper transformer layers (i.e. closer to the output layer) can yield better performance than adding to the beginning layer.

Embedding and MLP reparameterization. In prefix fine-tuning (Li and Liang, 2021) and Ptuning (Liu et al., 2021b), the authors found that reparameterization is useful for improving training speed, robustness, and performance. However, the experiments we conducted show that the effect of reparameterization is inconsistent across different NLU tasks and datasets.

As shown in Figure 3, in RTE and CoNLL04, the reparameterization of the MLP generally shows better performance than the embedding at almost all prompt lengths. However, in BoolQ, the results of MLP and embedding are competitive; in CoNLL12, the results of embedding are consistently better than MLP.

Figure 3: Ablation study of tip length and reparamerization using RoBERTa-large. Given certain NLU tasks and datasets, the conclusions can be very different. (MQA: Multiple Choice QA)

Prompt length. Prompt length is another influential hyperparameter of P-tuning v2, the optimal value of which varies from task to task. From Figure 3 we observe that for simple NLU tasks, shorter prompts usually achieve the best performance; for difficult sequence tasks, usually longer prompts than 100 are helpful.

We also found that reparameterization was strongly correlated with optimal cue length. For example, in RTE, CoNLL04, and BoolQ, the reparameterization of the MLP reaches its best results earlier than the embedding. This conclusion may help some thinking about the optimization properties of P-tuning.

Verbalizer with LM header and [CLS] tag with linear header. Verbalizer with LM head has been a core component of previous suggestive fine-tuning methods. However, in the supervised setting, tuning a linear head with around a few thousand parameters is affordable for P-tuning v2. We present our comparison in Table 5, where we keep the other hyperparameters and only change the linear head of the [CLS] tag to the LM head of the verbalizer. Here, for simplicity, we use "true" and "false" for SST-2, RTE and BoolQ; and "true", "false" and "neutral" for CB. The results show that there is no significant difference in performance between the verbalizer and [CLS].

Table 5: Comparison between [CLS] tags with linear headers and spoken language with LM headers on RoBERTa-large.

5. Related work

Pretrained language model. Self-supervised (Liu et al., 2020) and pre-trained language models (Han et al., 2021a) have become the backbone of natural language processing. From the early GPT (Radford et al., 2019), BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019) with a limited number of parameters (less than 350M), T5 (Raffel et al., 2019) et al., 2019) and GPT-3 (Brown et al., 2020) have promoted the development of giant language models with billions or even trillions of parameters.

hint. Hinting (Liu et al., 2021a) refers to the use of special templates in the input context to aid the understanding and generation of language model predictions. Recently, due to the success of GPT-3 (Brown et al., 2020), various prompting strategies have emerged, including discrete natural language prompts (Shin et al., 2020; Gao et al., 2020), continuous prompts (Liu et al., 2021b ; Li and Liang, 2021; Lester et al., 2021; Qin and Eisner, 2021; Zhong et al., 2021), adjustment bias (Logan IV et al., 2021), and many other cueing strategies.

The advantages and effectiveness of prompting methods in a wide range of NLP applications have been verified in recent literature, including text classification (Hu et al., 2021; Min et al., 2021; Sun et al., 2021; Li et al., 2021; Zhang et al., 2021b), entity typing (Ding et al., 2021), few-shot human learning (Zheng et al., 2021; Xu et al. 2021a; Zhao et al., 2021; Gu et al., 2021; Zhang et al., 2021a), relationship extraction (Chen et al., 2021a; Han et al., 2021b; Sainz et al., 2021), knowledge detection (Zhong et al., 2021), named entity recognition (Chen et al., 2021) 2021b), machine translation (Tan et al., 2021; Wang et al., 2021b) and dialogue systems (Wang et al., 2021a).

In this work, we focus specifically on extending hinting methods to smaller models and difficult sequential NLU tasks.

6. Summary

We propose P-tuning v2, a prompting method that is comparable to fine-tuning at different scales and tasks. P-tuning v2 is not a conceptually new approach, but an NLU challenge of optimizing and adapting prefix optimization and depth hint optimization. Ptuning v2 shows consistent improvement on models from 330M to 10B and outperforms Lester et al. (2021) and P-tuning by a large margin on difficult sequence tasks such as sequence annotation. Ptuning v2 can become a comprehensive alternative for fine-tuning and a strong baseline for future work.

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131304269