Paper information
name | content |
---|---|
paper title | AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts |
Paper address | https://arxiv.org/abs/2010.15980 |
field of study | NLP, text classification, hint learning, text classification |
proposed model | AutoPrompt |
source code | http://ucinlp.github.io/autoprompt |
read summary
Prompt
The task requires building the fit Pattern
, but writing the fit Pattern
requires manual work and human guesswork, with a lot of uncertainty. To solve this problem, the proposed AUTOPROMPT
model is created based on gradient descent search Pattern
.
[0] Summary
This paper also proposes an AUTOPROMPT
automated method called , for creating appropriate prompt templates for diverse tasks Pattern
.
Although the traditional MLM
task can evaluate the knowledge level of the language model, it needs to manually write the appropriate one Pattern
, which involves a large human cost and guesswork. AUTOPROMPT
Methods based on gradient descent search Pattern
solve this problem.
AUTOPROMPT
Without additional parameters or fine-tuning, they can sometimes even achieve comparable performance to state-of-the-art supervised models.
[1 Introduction
The paper points out that traditional analysis methods, such as probing classifiers and attention visualization , have certain limitations and deficiencies. By prompting
means of hints, transforming the task into the form of a language model can obtain the knowledge possessed by the model more directly.
However, existing hinting methods require manual construction of context, which is time-consuming and error-prone, and the model is highly sensitive to context. To address this issue, the researchers propose AUTOPROMPT
methods to replace manual construction by automatically generating prompts, thereby improving the efficiency and broad applicability of the analysis.
AUTOPROMPT
Based on a gradient search strategy, the method combines raw task inputs and trigger word (trigger tokens)
sets to generate cues applicable to all inputs. A language model can be evaluated as a classifier by combining its predictions for cues with the class probabilities of associated tag words.
The effectiveness of the paper is proved through multiple experiments AUTOPROMPT
. First of all, the researchers used to AUTOPROMPT
construct hints for sentiment analysis and natural language reasoning, without fine-tuning, only using the pre-trained masked language model can (MLMs)
achieve good performance, SST-2
reaching 91% accuracy on the dataset, which is better than Fine-tuned ELMo
model. Second, the researchers AUTOPROMPT
applied LAMA
the fact retrieval task to successfully extract MLMs
fact knowledge by constructing more effective hints. Additionally, the researchers introduced a variant of the task similar to relation extraction, testing MLMs
whether knowledge can be extracted from a given text. MLMs
Experimental results show that we can outperform existing relation extraction models when provided with context sentences of real facts , but Pattern
perform poorly when provided with context sentences of artificial templates.
Finally, the paper also points out that AUTOPROMPT
it has certain practical advantages over fine-tuning. In the case of low data volume, AUTOPROMPT
higher average and worst-case accuracies are achieved. Unlike fine-tuning, using hints LMs
does not require a lot of disk space to store model checkpoints, and once a hint is found, it can be used on top of existing pre-training LMs
. This has benefits when serving models for multiple tasks.
[2] Model overview
Writing Pattern
templates is time-consuming and unclear whether the same Pattern
template will work for every model, and what criteria determine Pattern
whether a template is best for eliciting the desired information. Based on this consideration, it is introduced AUTOPROMPT
, and its structure is shown in the figure below.
[2.1] Background and math notation
To build the prompt template, the original task input xinp x_{inp} is differentiatedxinp, trigger token xtrig x_{trig}xtrigMLM
and the prompt given as input xprompt x_{prompt}xprompt. By using the template λ λλ,将xinp x_{inp}xinpmaps to xprompt x_{prompt}xprompt。
[Note] Looking at the lower left of the above picture, n
[T]
are actually trigger tokens xtrig x_{trig}xtrig, they are the tokens that will be searched by gradient descent, and they are initially[MASK]
initialized,[P]
which is the real one that we usually predict[MASK]
.
For Verbalizer
part, the form of multi-map token probability summation is adopted:
[2.2] Gradient-based hint template search
The idea is prompts
to add something that is prompts
shared by all trigger tokens
, that is, in the template [T]
. These are initialized token
at the beginning [MASK]
, and then iteratively updated to maximize the likelihood estimate.
At each step, calculate the jjthjtrigger token
replaced byanothertoken
www( w w w belongs to the words in the vocabulary) the maximum value of the gradient. will cause the largest change intop − k top-ktop−K constitute thetokens
candidate setV cand V_{cand}Vc and d:
[2.3] Automatic tag word selection
Although in some tasks label tokens
the choice is quite obvious, in some abstract class labels
problems label tokens
it is not clear what to choose. Therefore, in the paper, the author proposes a general two-stage method for automatically selecting label tokens
the set V y V_yVyMethods.
In the first step, train a prediction logistic classifier
using [MASK] token
as input class label
:
classifier input
:
classifier output
:
where yy in the right formulay和 β y \beta_y βy是learned weight
和bias terms
。
在第二步中,将MLM
的output word embeddings
w o u t w_{out} wout作为训练好的logistic classifier
的输入(替换上式中的 h ( 1 ) h^{(1)} h(1)),获得分数 s ( y , w ) = p ( y ∣ w o u t ) s(y,w) = p(y|w_{out}) s(y,w)=p(y∣wout) . 直观上来看 s w = e x p ( w o u t ⋅ y + β y ) s_w = exp(w_{out}·y + \beta_{y}) sw=exp(wout⋅y+βy)得分对于与给定label
相关的词汇会更大。通过上述方式可以构造出topk最高得分词汇:
[Note]
Pattern
There are hard templates and soft templates.AutoPrompt
It belongs to the category of hard templates. Personally, I still thinkSoft Prompt
it would be better to train continuous pseudo-markers. No matter how hard the template is searched, the word you are looking for is stillPLM
in the vocabulary, and the continuous vector has a higher degree of freedom.