[Paper & Model Explanation] Text Classification Towards Unified Prompt Tuning for Few-shot Text Classification

foreword

Paper title: Towards Unified Prompt Tuning for Few-shot Text Classification
Paper URL: https://arxiv.org/abs/2205.05313

0 summary

Prompt-based fine-tuning improves the performance of pre-trained language models ( PLMs , Pre-trained Language Models) in few-shot text classification   by using task-specific hints .

  However, since PLMs are not familiar with prompt-style representations during pre-training, few-shot learning performance in downstream tasks is limited.

  It is desirable if the model can acquire some prompting knowledge prior to a specific NLP task.

  This paper proposes a UPT (Unified Prompt Tuning) framework to provide better few-shot text classification for BERT-style models by explicitly capturing prompt semantics from non-target NLP datasets. In UPT, a new paradigm Prompt-Options-Verbalizer ( POV ) is proposed for joint prompt learning of different NLP tasks, enabling PLMs to capture task-invariant prompting knowledge. In order to improve the generalization ability of PLM so that it can accurately adapt to previously unseen tasks, the author further designs a self-supervised task- Knowledge-enhanced Selective Masked Language Modeling ( KSMLM , knowledge-enhanced selective MLM) .

  After multi-task learning across multiple tasks, PLM can be better prompt-tuned for different target tasks with low resources. Experiments on various NLP tasks show that UPT outperforms SOTA in Prompt-based fine-tuning.

1 Introduction

  The emergence of pre-trained language models ( PLMs , Pre-trained Language Models) has improved the performance of various NLP tasks. However, during the fune-tuning process, PLMs perform poorly with fewer training samples due to model overfitting.

  To alleviate this problem in low-resource scenarios, natural language cues have been applied to PLMs for few-shot or zero-shot learning. To make prompts more flexible and applicable to individual tasks, prompt tuning freezes the PLM backbone and tunes the representation of prompts. This approach is especially suitable for very large PLMs that are difficult to fine-tune. For BERT-style PLMs, prompt-based fine-tuning is proposed to transform the text classification task into a cloze-style problem. Task-specific discrete templates with masked tokens will be added to the input text. The tokens at masked positions predicted by the MLM head are used for class label prediction. Therefore, the pre-trained knowledge acquired by PLMs can be better utilized by “reusing” the MLM training objectives. Due to the successful use of prompts in few-shot learning, various follow-up works such as continuous prompt encoding, just prompt learning and prompt generation have been performed.

  Recently, there have been some works focusing on multi-task prompt-tuning on very large PLMs. Specifically, they fine-tune PLMs based on all training samples from different tasks, forcing PLMs to learn more prompting knowledge, and directly predict target tasks through zero-shot learning. However, the authors observe that for BERT-style PLMs, the performance is not satisfactory due to the following two reasons:

  1. These PLMs are sensitive to the design of different prompt templates and verbalizers, which cannot be adapted to the target tasks of new prompts and verbalizers;

  2. There is a difference in the lexical distribution of prompt-style text and sentences in the pre-training corpus.

  It would be even better if BERT-style PLMs could acquire some prompting knowledge before adapting to downstream tasks. Therefore, a question naturally arises: how to make BERT-style PLMs accurately adapt to the target NLP task with more prompting knowledge?

  To address these issues, we introduce a new framework, UPT (Unified Prompt Tuning) , to provide BERT-style models with better few-shot text classification performance by explicitly capturing universal prompt semantics from untargeted datasets. In particular, we propose the unified paradigm Prompt-Options-Verbalizer (POV) , which enables hybrid prompt-tuning for a range of different types of goal-free NLP tasks. To further improve the model's generalization ability to previously unseen tasks, we propose a new auxiliary task, Knowledge-enhanced Selective MLM (KSMLM) , which mimics the behavior of MLM, follows the POV paradigm, and uses explicit prompts. After multi-task training, the same prompting paradigm can be used to fine-tune the underlying PLM for any few-shot tasks.

  In experiments, we verify the effectiveness of UPT on public NLP datasets for various tasks. Experimental results show that UPT consistently outperforms SOTA in prompt-based few-shot fine-tuning.

The author's contributions are as follows:

  1. In order to improve the prompt-based fine-tuning of the BERT-style model, the author introduces a new UPT framework, which recaptures unified prompting semantics from multiple different types of source tasks for few-shot text classification of new target tasks .
  2. In UPT, a new paradigm POV is proposed for joint prompt tuning across different NLP tasks. The authors further design a self-supervised KSMLM task to improve the generalization ability of PLM and achieve precise task adaptation.
  3. Extensive experiments on various NLP datasets show that UPT consistently outperforms SOTA by a large margin in prompt-based few-shot fine-tuning.

2 UPT: The Proposed Framework

We start with a brief overview of the UPT framework, followed by its detailed techniques.

2.1 Overview of UPTs

Figure 1: UPT is a unified framework that learns prompting knowledge from untargeted NLP datasets in the form of Prompt-Options-Verbalizer to improve the performance of target tasks. Figure a) and Figure b) show examples of supervised and self-supervised learning tasks (i.e. Knowledge-enhanced Selective MLM ).

  First some basic notation is introduced. Let D ∗ \mathcal D^*D is a target NLP taskT ∗ \mathcal T^*T N-way-K-shottraining set. The underlying PLM is given byΘ \ThetaΘ as a parameter. The basic goal of few-shot learning is based onD ∗ \mathcal D^*D Get one forT ∗ \mathcal T^*T high-performance model, the parameters are determined byΘ \ThetaΘ initialization. SinceD ∗ \mathcal D^*DThe size of is only N × KN\times KN×K , so the model performance will be greatly limited. Here, we assume that there isMMM withT ∗ \mathcal T^*T Different NLP tasks, namelyT ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,T( M ) , their (usually non-few-shot) training sets are defined asD ( 1 ) , ⋯ , D ( M ) \mathcal D^{(1)},\cdots,\mathcal D^{(M) }D(1),,D( M ) . The UPT framework aims to explore how to useD ( 1 ) , ⋯ , D ( M ) \mathcal D^{(1)},\cdots,\mathcal D^{(M)}D(1),,D( M ) to improve PLM in a new task (such asT ∗ \mathcal T^*T ), based on its own few-shot training setD ∗ \mathcal D^*D

  Note that we restrict T ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,T( M ) withT ∗ \mathcal T^*T differently to handle real low-resource scenarios where no task-like training set is available. IfT ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,T( M ) withT ∗ \mathcal T^*T similar, transfer learning techniques can be directly applied to train the model, which is considered a relatively trivial problem and not the focus of this paper.

  In UPT, the model first performs all source tasks T ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,T( M ) trained on, aiming to learn the semantics of prompts and solve downstream tasks through prompts. Then, in low-resource scenarios, for a specific target taskT ∗ \mathcal T^*T Perform prompt-tuning. In order to unify the learning process, all different tasks (T ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,T( M ) orT ∗ \mathcal T^*T ) for each training sampleiii all pass the Prompt-Options-Verbalizer (POV)triplein the same formatP i P_iPi, O i O_iOi, V i V_iVi) to enhance. Here P i P_iPi是 prompt,O and O_iOiare all possible options in masked tokens (ie the set of tag words), V i V_iViIt is a verbalizer that maps the target token predicted by the MLM head of PLM to the class label. An example of a supervised learning task can be found in Figure 1 .

  In addition, we observe that the original task T ( 1 ) , ⋯ , T ( M ) \mathcal T^{(1)},\cdots,\mathcal T^{(M)}T(1),,TThe label vocabulary in ( M ) is limited. For previously unseen tasks, optimizing only for these tasks tends to result in a model that is biased towards those tasks that are less general. Therefore, we introduce self-supervisedKnowledge-enhanced Selective MLM (KSMLM)$ \widetilde {\mathcal T}$ as an auxiliary task. Specifically, the source task training dataD ~ = D ( 1 ) ∪ D ( 2 ) ∪ ⋯ ∪ D ( M ) \widetilde{\mathcal D}=\mathcal D^{(1)}\cup\mathcal D ^{(2)}\cup\cdots\cup\mathcal D^{(M)}D =D(1)D(2)D( M ) as input. These sentences are selectively masked out, and the options are generated from the rich knowledge mined from a large corpus. There is also an example inFigure 1Therefore, the model has better generalization ability and avoids catastrophic forgetting of pre-trained knowledge.

2.2 Unified prompting paradigm

  For BERT-style models across D ( 1 ) , ⋯ , D ( M ) \mathcal D^{(1)},\cdots,\mathcal D^{(M)}D(1),,D( M ) A fundamental challenge of prompt-based training is that different NLP tasks have different label words for masked tokens. When dealing with mixed training samples, a simple solution is to build a unified output prediction space consisting of candidate label words from all tasks. However, the enlarged output space makes the optimization of PLM challenging. Furthermore, the output prediction space may not cover all possible label words for unseen NLP tasks.

  We propose a unified prompt paradigm through the Prompt-Options-Verbalizer (POV) triplet ( P i P_iPi, O i O_iOi, V i V_iVi) to enhance each sample iii . Among themP i P_iPiIt is a prompt that provides task guidance (consistent with PET ( Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference )), O i O_iOiis a fixed expression that explicitly provides the model with all candidate label words. For fast adaptation to arbitrary tasks, the verbalizer V i V_iViMap the output of the masked token to the entire word list V \mathcal Vin V. We can see that these options are crucial because they give a strong indication of the likely output (ie candidate words) of the PLM. In general, regarding the training samplesiii,tokenv ∈ V v \in \mathcal VvV 's export rateq ( v ∣ i , P i , O i , Θ ) q(v|i,P_i,O_i,\Theta)q(vi,Pi,Oi,Θ) 计算如下:
q ( v ∣ i , P i , O i , Θ ) = e x p ( s ( v ∣ i , P i , O i , Θ ) ) ∑ v ′ ∈ V e x p ( s ( v ′ ∣ i , P i , O i , Θ ) ) q(v|i,P_i,O_i,\Theta)=\frac{ {\rm exp}(s(v|i,P_i,O_i,\Theta))}{\sum_{v^{'}\in \mathcal V}{\rm exp}(s(v^{'}|i,P_i,O_i,\Theta))} q(vi,Pi,Oi,i )=vVexp(s(vi,Pi,Oi,I ))exp(s(vi,Pi,Oi,Θ ) )
Among them , s ( v ∣ i , P i , O i , Θ ) s(v|i,P_i,O_i,\Theta)s(vi,Pi,Oi,Θ ) is the unnormalized score of the MLM head (before the softmax function) fori , P i , O ii,P_i,O_ii,Pi,OiAs input, generate token vv at [MASK] positionv . The entire prediction vector (of length∣ V ∣ |\mathcal V|V ) Definition isQ ( V ∣ i , P i , O i , Θ ) Q(\mathcal V|i,P_i,O_i,\Theta)Q(Vi,Pi,Oi,I ) .

Multitasking prompting loss (defined as LMP \mathcal L_{MP}LMP) For:
LMP = − ∑ i ∈ DP ( V ∣ i , P i , O i , Θ ) ⋅ log Q ( V ∣ i , P i , O i , Θ ) \mathcal L_{MP}=-\sum_{ i\in \mathcal D} P(\mathcal V|i,P_i,O_i,\Theta)\cdot {\rm log}\Q(\mathcal V|i,P_i,O_i,\Theta);LMP=iDP(Vi,Pi,Oi,i )log Q(Vi, Pi,Oi,Θ )
whereD = ∪ k = 1 MD ( k ) \mathcal D=\cup_{k=1}^M \mathcal D^{(k)}D=k=1MD(k) P ( V ∣ i , P i , O i , Θ ) P(\mathcal V|i,P_i,O_i,\Theta) P(Vi,Pi,Oi,Θ ) is the one-hot ground-truth prediction vector.

  In addition, we notice that D ( 1 ) , ⋯ , D ( M ) \mathcal D^{(1)},\cdots,\mathcal D^{(M)}D(1),,D( M ) can be a dataset with arbitrary labels and different sizes. Optimize LMP \mathcal L_{MP}directly on the original datasetLMPwould make the few-shot learner more likely to be biased towards larger datasets. In our work, we perform stratified sampling to form a batch, where from D ( 1 ) , ⋯ , D ( M ) \mathcal D^{(1)},\cdots,\mathcal D^{(M)}D(1),,D( M ) draw a training sampleiii , whose probability is proportional to its own data set size (denoted aswi w_iwi), inwi = log ∣ D ( k ) ∣ + γ M ⋅ γ + ∑ k ′ = 1 M log ∣ D ( k ′ ) ∣ w_i=\frac{\rm log|\mathcal D^{(k)} |+\gamma}{M\cdot \gamma+\sum_{k^{'}=1}^M\rm log|\mathcal D^{(k^{'})}|}wi=Mγ+k=1Mlog∣D(k)log∣D( k )+c, where γ > 0 \gamma>0c>0 is a smoothing factor,i ∈ D ( k ) i\in \mathcal D^{(k)}iD( k ) . Therefore, theLPT \mathcal L_{PT}LPTRedefined as weighted multi-task prompting (WMP, weighted multi-task prompting ) loss LWMP \mathcal L_{WMP}LWMP _:
LWMP = − ∑ i ∈ D wi ⋅ P ( V ∣ i , P i , Oi , Θ ) ⋅ log Q ( V ∣ i , P i , Oi , Θ ) \mathcal L_{WMP}=-\sum_ {i\in \mathcal D} w_i\cdot P(\mathcal V|i,P_i,O_i,\Theta)\cdot {\rm log}\ Q(\mathcal V|i,P_i,O_i,\Theta);LWMP _=iDwiP(Vi,Pi,Oi,i )log Q(Vi, Pi,Oi,i )

2.3 Generalization of Unified Hints in Self-Supervised Learning

Figure 2: An example of the POV generation process in the KSMLM task .

  A disadvantage of the above approach is that the label vocabulary in these supervised learning tasks is limited, covering the vocabulary V \mathcal VThe V range is also narrow. The model does not generalize well to tasks with new label words. Therefore, we will usethe POVparadigm to propose the idea of ​​MLM pre-training.

  Given a sentence, we can randomly mask a word, give the correct word and a randomly chosen word as options, and let the model make a prediction. Unfortunately, what seems to work breaks the training process because not all words are suitable label words. For example, stop words and a large number of verbs, adverbs, which are not used in any of the verbizers in the following tasks. The alternative words used in the options must be reasonable so that the model can learn really useful knowledge. To solve this problem, we propose a self-supervised KSMLM task, as shown in Figure 2 . In the following, the construction process of the POV of KSMLM will be described , and then the loss function of the task will also be given.

P-Generation: This process aims to generate a template with [MASK] token for each sentence, which is fixed as "It is [MASK]." in the multi-task training phase . In the fine-tuning stage of a specific task, templates are automatically generated for each task following LM-BFF ( Making Pre-trained Language Models Better Few-shot Learners ). During training, the PLM is asked to predict the actual word in position [MASK].

O-Generation: Referring to LM-BFF , it can be seen that most of the label words for language understanding tasks are adjectives (such as "great" and "terrible" in sentiment analysis). So in our work, we use the part-of-speech tagging model (using the spacy tool, https://spacy.io/ ) to detect all adjectives in the corpus and filter out low-frequency adjectives. Adjectives are then clustered using K-Means and their token representations are generated from the underlying PLM as features. We constructed a knowledge base called Options Knowledge Repository (OKR), which is in the form of a triplet R = ( v , v ⃗ , cv ) \mathcal R={(v,\vec{v}, c_v)}R=(v,v ,cv) , wherevvv is a candidate tag word,v ⃗ \vec{v}v and cv c_vcvrespectively defined as vvThe representation vector of v and the cluster to which it belongs. Cluster centers are also stored. We do not use existing dictionaries such as WordNet as they may cover limited label words. Furthermore, the automated process enables our algorithm to scale to arbitrary languages ​​and domains.

  Since R \mathcal RWith the availability of R , we can generate knowledge-guided options. given avvv is a sentence of [MASK] words, we aim atR \mathcal RR query aboutvvThe most dissimilar cluster of v , defined as c ~ v \widetilde c_vc v, using vector representation v ⃗ \vec{v}v The cosine similarity with the cluster center is used as the evaluation index of similarity. Finally, from c ~ v \widetilde c_vc vRandomly select an adjective in as an alternative label word to generate knowledge-induced options ( knowledge-induced options ). The text expression of the option is fixed, ie "Is it [x1] or [x2] ?" . You can further refer to the example in Figure 2 .

V-Generation: For verbalizers, the real label words and generated label words in the options are mapped to two classes, namely Class: Correct and Class: Incorrect . For example, the verbizers for the example statement in Figure 2 are:

It is “effective”. →“Class: Correct”
It is “ineffective”. →“Class: Incorrect”

Loss function: KSMLM loss is significantly different from the auxiliary MLM loss used in PET . In D ~ \widetilde{\mathcal D}D , each training sample iii can be directly extended tothe training samples ofKSMLMthroughPOVO i O_iOi 和 prompt P i P_i Pi. Train the PLM to predict the correct [MASK] word in the sentence, the loss function is
LKSMLM = − ∑ i ∈ D ~ P ( V ∣ i , P i , O i , Θ ) ⋅ log Q ( V ∣ i , P i , O i , Θ ) \mathcal L_{KSMLM}=-\sum_{i\in \widetilde{\mathcal D}} P(\mathcal V|i,P_i,O_i,\Theta)\cdot {\rm log}\ Q (\mathcal V|i,P_i,O_i,\Theta)LK SM L M=iD P(Vi,Pi,Oi,i )log Q(Vi, Pi,Oi,Θ )
In general, the loss function L of UPT\mathcal LL is defined as the sum of WMP and KSMLM:
L = LWMP + λ ⋅ LKSMLM \mathcal L=\mathcal L_{WMP}+\lambda\cdot\mathcal L_{KSMLM}L=LWMP _+lLK SM L M
λ ≥ 0 \lambda\geq 0l0 is the balance hyperparameter.

Discussion: To the best of our knowledge, external knowledge has also been applied to other prompt-based methods such as KPT ( Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification ). The main difference with KPT is that UPT uses knowledge for option creation of the self-supervised task KSMLM to improve the generalization ability of the model to accurately adapt to new tasks. In contrast, previous work considered the extension of verbalizers for specific downstream NLP tasks.

2.4 Few-shot Fine-tuning

  For a specific downstream task T ∗ \mathcal T^*T , target few-shot training setD ∗ \mathcal D^*DThe samples in can be processed and computed inUPTThe two-stage learning consistency ensures that the underlying PLM has acquiredT ∗ \mathcal T^*T Prompting knowledge. Furthermore, it is possible to prompt-tune a single PLM on various tasks and use it to fine-tune any target task, thus computationally efficiently generating corresponding models for these applications.

3 experiments

3.1 Experimental setup

  In the experiments, 9 publicly available text classification datasets are used to evaluate the proposed UPT framework, which are divided into 3 groups:

  • Sentiment Analysis: SST-2, MR, CR;
  • Natural Language Inference (NLI, Natural Language Inference): MNLI, SNLI, QNLI, RTE;
  • Paraphrase: MRPC, QQP;

See the appendix for statistics. By default, K = 16 K=16K=16 (training instances for each class).

  In UPT, we only utilize the full training data from all dissimilar task groups, and then prompt-tune the model on the target task in a low-resource setting. For example, when the target task is SST-2, the training data in UPT comes from NLI and Paraphrase. Unless otherwise stated, the underlying PLM model is generally RoBERTa-large. The baseline includes standard fine-tuning , and recently proposed 4 few-shot learning algorithms: PET, LM-BFF, P-tuning, PPT. For a fair comparison with these single-task baselines, a variant of our method (called UPT-Single) is also implemented by fine-tuning only POV -based few-shot tasks without using dissimilar supervised sources. Task.

  When using other dissimilar datasets to train the model, we also use two same dissimilar datasets as strong baselines for meta-tune as two multi-task methods, namely MT (Zero-shot) and MT (Few-shot). We still have the zero-shot version of UPT first, namely UPT (Zero-shot). Furthermore, given a supervised NLP task, multiple prompts can be created manually. By augmenting a training sample with these prompts, we can automatically implement self-ensemble learning . For the self-ensemble version of UPT, 5 different prompts are used. For each input sample, an option expression and a set of verbalizers are randomly selected. We refer to this method as UPT-SE . The designed prompts, options, and verbalizers are shown in Table 5 . Results for all these models are evaluated with mean accuracy and standard deviation over 5 random seeds.

  The Adam optimizer was used when training the model, and the learning rate was fixed at 1e-5 during all training epochs. The hyperparameters are set to γ ​​= 0.001 \gamma=0.001 by defaultc=0.001 ,λ = 0.1 \lambda=0.1l=0.1 , these were also fine-tuned on the dev set. The parameter regularizer isLM-BFF.

3.2 Main results

Table 1: Comparison of accuracy (%) and standard deviation of UPT and baselines in all test sets. "FT" and "PT" refer to the paradigms of fine-tuning and prompt-based fine-tuning , respectively . Method in bold refers to our method and its variants. The baselines scores are rerun using their open source code.

In Table 1 , the experimental results of UPT and all baselines are shown. The results show:

  1. Prompt-based methods (i.e. PET, LM-BFF, P-tuning, PPT) show great improvements over standard fine-tuning .
  2. The average performance of UPT -Single is due to previous few-shot learning models, which shows that using POV outperforms ordinary prompts ( PET ).
  3. UPT (both different and ensemble versions) outperforms all baselines on all tasks, suggesting that our frameworkhas better generalization performance by learning from groups of dissimilar tasks.
  4. The results of MT (Zero-shot) and UPT (Zero-shot) on BERT-style models are not satisfactory. Unlike very large models, we believe that few-shot prompt-tuning is necessary for BERT-style models to produce good results on these tasks.
  5. Through the comparison of UPT and MT (few-shot), it can be seen that the proposed POV paradigm and self-supervised KSMLM task are more suitable for few-shot learning.
  6. In general, UPT -SE achieves an average accuracy improvement of 1.2% over UPT on all tasks . This means that self-ensemble learning can improve the generalization ability of the model, but this improvement is not the same in all tasks. One possible reason is that some prompts and options are not optimal for the target task.

3.3 Model Analysis

Figure 3: Parametric analysis on the hyperparameter λ

Parameter analysis: for the balance coefficient λ \lambdaThe optimal choice of λ was analyzed parametrically. The results on SST-2 and RTE are shown inFigure3Whenλ = 0.1 \lambda=0.1l=The performance is best at 0.1 , which shows that our proposedUPTwith the self-supervisedKSMLMtask. We also observe that whenλ \lambdaWhen λ becomes larger, the performance degrades. This means thatKSMLMis a suitable regularization task, but may also introduce many prompts and options that are not relevant to downstream tasks. This provides new ideas for model improvement.


Table 3: Ablation studies in terms of accuracy (%). The standard deviation is ignored here to save space.

Ablation Study: To explicitly verify the contribution of each component in UPT , we performed an ablation study on all groups and demonstrated the average accuracy. As shown in Table 3 :

  • w/o. POV means using the method of manually designing prompts instead of using any options;
  • w/o. KSMLM is equivalent to setting λ = 0 \lambda=0l=0 , which is the same asUPT-Single;
  • w/o. OKR refers tothe random selection of alternative tag words in options without knowledge guidance when optimizing the KSMLM task;
  • w/o. POV & KSMLM means there is no options method and auxiliary KSMLM task

The results show that no matter which module is removed, the model performance will be affected. Especially low, when POV and KSMLM are removed at the same time, the performance drops by 1.4%, 1.5%, 4.4%, respectively, the accuracy under this setting condition is lower than w/o. POV and w/o . Both contribute greatly to the high performance of our framework. We also found that w/o. POV and w/o. KSMLM outperformed MT in all groups (few-shot). Furthermore, we find that if we use KSMLM but remove OKR , the performance of all these tasks decreases, but is still higher than w/o. KSMLM , implying that our option knowledge mined from the corpus is suitable for self-supervised learning tasks .


Figure 4: Results of sample efficiency analysis. The model performance of UPT and standard fine-tuning with different number of training samples K is compared in two tasks .

Sample Efficiency: We further explored the number of training samples per class ( KKK ) Model efficiency from 16 to 512. We also use standardfine-tuningas a reference. AsFigure 4, each point refers to the tie score on 5 randomly sampled datasets. It can be observed that UPTachieves higher scoresregardless of the number of training samplesIn addition,the variation ofUPTof fine-tuning, indicating that the stability of our method is better. This is different from other prompt-based methods (PET,LM-BFF).


Table 2: Model size analysis results. We show the accuracy (%) of UPT based on BERT at other scales , and the relative improvement compared to w/o. prompt learning on dissimilar datasets.

Model scale analysis: In order to further illustrate that UPT can improve model performance regardless of the scale, we use multiple small-scale BERTs as the backbone of the model. Due to space limitations, we only illustrate the results of SST-2, MR and CR in Table 2 . For a fair comparison, we also test performance without using dissimilar NLP datasets and demonstrate relative improvements. The results show that the model size has a significant impact on the generalization ability of the model. We also found that using dissimilar datasets can greatly improve efficiency, especially on small-scale PLMs. Therefore, our method is more suitable for producing small PLMs for online applications.


Adaptation Efficiency of Task Groups: Because we focus on multi-task training and then prompt-tuning the target task in a low-resource setting, which group/how many groups of tasks are better for adaptation improvement is worth discussing. Specifically, when given a target task (such as MNLI), we only select a group of tasks (such as MRPC and QQP in Group 3 (Paraphrase)) for multi-task prompt-tuning, and then fine the model on the target task -tune. As shown in Figure 5 , section iirow i , jjThe cell in column j represents from for the jjthSingle-task learning for j tasks up toiiGroup i is added to the relative improvement in multi-task prompt learning. For visualization, we normalized the values ​​of each column to show the percentage influence of each group. The results show that the performance of the target task improves the most when adding data samples from other datasets in the same task group. However, similar datasets are not available in low-resource scenarios. WithUPT, we can evendissimilartasks to target tasks.

Specifically, with NLI as the source group, MM   is randomly selected from this groupM data sets are used as source tasks, and then prompt-tune the model for each target task. The results in Figure 6show that when increasingMMWhen the M value, the accuracy will increase. We also find that the improvements are more pronounced for MRPC and QQP. We argue that NLI is easier to adapt to the paraphase task since they both model the relationship between sentence pairs.

4 Related work

Pre-trained language models: In recent years, thanks to the powerful modeling capabilities and computing resources of PLMs, we have witnessed qualitative improvements in several NLP tasks. For example, the large GPT model family utilizes a multi-layer Transformer decoder to capture the left-to-right semantics of natural language. BERT focuses on the learning of bidirectional contextual representations. Other notable PLMs include Transformer-XL, ELMo, RoBERTa, AlBERT, XLNet, StructBERT, T5, and many more. Since the structure of the model is not the focus of our work, we will not elaborate on it.

Prompt-based Learning: Learning via [CLS][CLS][ C L S ] head direct fine-tuning PLMs may perform poorly with few training samples. Recently, the giantGPT-3model was proposed to support in-context learning, which introduces handcrafted prompts and demonstrations. PETapplies handcrafted prompts to prompt-based fine-tuning of BERT-style models. To facilitate automatic prompt generation, Gao et al. proposedLM-BFFto generate discrete templates. Other works extract prompts from training corpora based on heuristic rules/semantic relations. However, these methods are time-consuming for the optimization prompt of the mining target task. A series of methods have been proposed to learn continuous prompt or soft prompt embeddings, such asP-tuning,P-tuning V2, OptiPrompt, Prefix-tuning.

  Recently, Finetuned Language Models Are Zero-Shot Learners , Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , MetaICL: Learning to Learn In Context , Reframing Instructional Prompts to GPTk's Language will come from different NLP tasks PLMs are fine-tuned on extracted mixed data samples and using manually designed task-specific prompts. The resulting PLMs are then utilized in a zero-shot learning manner to handle unseen tasks. These methods are successful due to large PLMs (e.g., GPT-3, T5), but consume a lot of computing resources. We further leverage data from untargeted NLP tasks to enable prompt-tune of PLMs with better adaptability on unseen NLP tasks.

5 Conclusions and future work

  In this paper, we propose the Unified Prompt Tuning framework ( UPT ) to enable BERT-style models to perform better for few-shot text classification by explicitly capturing prompt semantics from untargeted datasets.

  Experiments show that UPT consistently outperforms SOTA in terms of prompt-based fine-tuning.

  In future work, we seek to extend UPT to other tasks such as named entity recognition, text generation, and machine translation. Additionally, we will explore the extended prompt-tuning of UPT .

Guess you like

Origin blog.csdn.net/Friedrichor/article/details/128428830