Prompt-"Design Prompt Template: Use less data to achieve superior performance of pre-trained models, helping Few-Shot and Zero-Shot tasks"

Prompt任务(Prompt Tasks)

By designing the prompt (prompt) template, a smaller amount of data can be used to get better results on the pretrained model (Pretrained Model), which is mostly used for tasks such as Few-Shot and Zero-Shot.

1. Background introduction

Prompt is a very important direction in the current study of small sample learning in NLP. For example, if there are two comments like this today:

  1. What kind of apples, there is no apple taste, weird taste, and not sweet at all, super unpalatable!
  2. The speed of this broken notebook is too slow, there is no need for it to get stuck.

Now we need to perform a classification task based on the type of goods they describe,

That is, the first sentence needs to be classified into the "fruit" category; the second sentence needs to be classified into the "computer" category.

An intuitive way is to model this problem as a traditional text classification task, and set an id for each category through manual labeling, for example:

{
    '电脑': 0,
    '水果': 1,
    ....
}

In this way, the labeled data set looks like this:

什么苹果啊,都没有苹果味,怪怪的味道,而且一点都不甜,超级难吃!  1
这破笔记本速度太慢了,卡的不要不要的。    0
...

This method is feasible, but requires "more labeled data" to achieve good results.

Since most pre-training models (such as BRET) use the [MASK] token for MLM tasks during pretraining, and we often do not use the [MASK] token in real downstream tasks, this means that today we When training downstream tasks, more data sets are needed to smooth out the inconsistent gap between upstream and downstream tasks.

So, what if we don't have enough training data?

The emergence of prompt learning is to solve this problem. It introduces the token of [MASK] into the downstream task, and constructs the downstream task as a task similar to MLM.

For example, we could rewrite the above comment as:

这是一条[MASK][MASK]评论:这破笔记本速度太慢了,卡的不要不要的。

Then let the model predict what the real values ​​of the two [MASK] tokens are, and the model can infer that the masked word should be "computer" based on the context.

Since the downstream task also uses the same MLM task as the pre-training task, we can use less training data for fine-tuning.

However, this is not P-tuning.

Through the above example, we can observe that the most critical part of constructing a sentence is the generation of prompt, namely:

「这是一条[MASK][MASK]评论:」(prompt) + 这破笔记本速度太慢了,卡的不要不要的。(content)

The generation of the prefix (prompt) enclosed in brackets is very important, and different prompts will greatly affect the accuracy of the model's prediction of [MASK].

So how is this prompt generated?

Of course, we can manually design many different types of prefix prompts, we call them prompt patterns, for example:

这是一条[MASK][MASK]评论:
下面是一条描述[MASK][MASK]的评论:
[MASK][MASK]:
...

However, it is very troublesome to manually list the prompt pattern. Different data sets require different prompt patterns, and the reusability is very low.

So, can we learn the prompt pattern through the machine itself?

This is P-Tuning.

1.1 P-Tuning

Artificially constructed templates are reasonable for humans, but in the eyes of machines, is it really important what the prompt pattern looks like?

The machine's understanding of natural language is likely to be different from that of humans. We once did a comparative experiment between model attention and human understanding of the importance of language, and found that there is a certain difference between the machine's understanding of language and humans. of deviation.

So, don't we need to set a bunch of prompt patterns that we think are "reasonable" for the model, but let the model find the prompt pattern that they think is "reasonable"?

Therefore, the training of P-Tuning is divided into three steps: prompt token(s) generation, mask label generation, and mlm loss calculation.

1.1.1 prompt token(s) generation

Now that we don't have to manually build prompt templates, we don't know what kind of templates the machine likes...

Why don't we just make up a template and throw it to the model.

Sounds sloppy, but that's exactly how it's done.

We choose Chinese BERT as the backbon model, and choose the [unused] token in vocab.txt as the element that constitutes the prompt template.

[unused] is an unused token reserved in the BERT vocabulary, which has no meaning in itself, and random combination will not have a great semantic impact, which is why we use it to build prompt templates.

Then, the constructed prompt pattern looks like this:

[unused1][unused2][unused3][unused4][unused5][unused6] 

1.1.2 mask label generation

After completing the construction of the prompt template, we also need to add the mask label to the sentence so that the model can help us complete the label prediction task.

We set the length of the label to 2 ('fruit', 'computer', both are 2 characters long), and put the label at the beginning of the sentence:

[CLS][MASK][MASK]这破笔记本速度太慢了,卡的不要不要的。[SEP]

Among them, [MASK] token is the label token that we need the model to help us predict. Now we put the two parts together:

[unused1][unused2][unused3][unused4][unused5][unused6][CLS][MASK][MASK]这破笔记本速度太慢了,卡的不要不要的。[SEP]

This is our final input to the model.

1.1.3 mlm loss calculation

The next step is to start fine-tuning the model. We feed the model such data:

[unused1][unused2][unused3][unused4][unused5][unused6][CLS][MASK][MASK]这破笔记本速度太慢了,卡的不要不要的。[SEP]

And obtain the prediction result of the model prediction [MASK] token, and calculate the CrossEntropy Loss between it and the real label.

The tag data in P-Tuning looks like this:

水果    什么苹果啊,都没有苹果味,怪怪的味道,而且一点都不甜,超级难吃!
电脑    这破笔记本速度太慢了,卡的不要不要的。
...

That is to say, what we need to calculate is the CrossEntropy Loss between the output of the model to the [MASK] token and the two label tokens of "computer", so as to teach the model that in such a context, the label captured by [MASK] should be Reverted to "item type".

1.1.4 Experiment

We selected 63 comments (8 categories) as training data, and performed classification tests on 417 comments, and the model F1 can converge at 76%. From the experimental results, we can see that the prompt-based model can achieve relatively good results even when the number of training samples is small. Compared with the traditional classification method, P-Tuning can better alleviate the overfitting of the model under small sample data, and thus has better robustness.

Paper link: https://arxiv.org/pdf/2103.10385.pdf

2.PET (PatternExploiting Training)

  • Environment installation
    This project is based on the implementation of pytorch+ transformers, please install the relevant dependency packages before running:
pip install -r ../../requirements.txt

2.1 Dataset preparation

2.1.1 Label data preparation

A part of sample data is provided in the project, predicting the item category of user comments (classification task) based on user comments, the data is in data/comment_classify.

To use 自定义数据training, just build a dataset like the example data:

水果	什么苹果啊,都没有苹果味,怪怪的味道,而且一点都不甜,超级难吃!
书籍	为什么不认真的检查一下, 发这么一本脏脏的书给顾客呢!
酒店	性价比高的酒店,距离地铁近,邻华师大,环境好。
...

Each line \tis separated by a separator, the first half is 标签(label)and the second half is 原始输入.

2.1.2 Verbalizer preparation

Verbalizer is used to define the mapping between "real labels" and "label prediction words".

In some cases, using the "real label" as [MASK] to predict may not have good semantic fluency. Therefore, we will do a certain mapping on the "real label".

For example:

"日本爆冷2-1战胜德国"是一则[MASK][MASK]新闻。	体育

The label in this sentence is "sports", but it would be more predictable if we set the label to "soccer".

Therefore, we can construct many sub-labels for the label "sports". When inferring, we only need to predict the sub-labels and finally infer the real label, as follows:

体育 -> 足球,篮球,网球,棒球,乒乓,体育
...

A portion of sample data is provided in the project at data/comment_classify/verbalizer.txt.

To use 自定义数据training, just build a dataset like the example data:

电脑	电脑
水果	水果
平板	平板
衣服	衣服
酒店	酒店
洗浴	洗浴
书籍	书籍
蒙牛	蒙牛
手机	手机

In the example, we use a 1-to-1 verbalizer. If you want to define a one-to-many mapping, you only need to ','separate it with , eg:

...
水果	苹果,香蕉,橘子
...

2.1.3 Prompt Settings

The promot is a manually constructed template, and a part of the sample data is provided in the project data/comment_classify/prompt.txt.

这是一条{
    
    MASK}评论:{
    
    textA}

Among them, the part enclosed in curly brackets is "custom parameter", and the value in curly brackets can be customized.

In the example, {MASK} represents the position of [MASK] token, and {textA} represents the position of comment data.

You can change it to the template you want, for example, if you want to add a {textB} parameter:

{
    
    textA}{
    
    textB}{
    
    MASK}同的意思。

At this point, in addition to modifying the prompt file, you also need to modify the function utils.pyin the file to assign a value to each "custom parameter":convert_example()inputs_dict

...
content = content[:max_seq_len-10]      # 防止当[MASK]在尾部的时候被截掉

inputs_dict={
    
                               # 传入对应prompt的自定义参数
    'textA': content,                   
    'MASK': '[MASK]',
    'textB' = ...                       # 给对应的自定义字段赋值
}
...

2.2. Model Training

Modify train.shthe corresponding parameters in the training script to start model training:

python pet.py \
    --model "bert-base-chinese" \
    --train_path "data/comment_classify/train.txt" \
    --dev_path "data/comment_classify/dev.txt" \
    --save_dir "checkpoints/comment_classify/" \
    --img_log_dir "logs/comment_classify" \
    --img_log_name "BERT" \
    --verbalizer "data/comment_classify/verbalizer.txt" \       # verbalizer文件位置
    --prompt_file "data/comment_classify/prompt.txt" \          # prompt_file文件位置
    --batch_size 8 \
    --max_seq_len 256 \
    --valid_steps 40  \
    --logging_steps 5 \
    --num_train_epochs 200 \
    --max_label_len 2 \                                         # 子标签最大长度
    --rdrop_coef 5e-2 \
    --device "cuda:0"                                           # 指定使用GPU

After the training is started correctly, the terminal will print the following information:

...
DatasetDict({
    
    
    train: Dataset({
    
    
        features: ['text'],
        num_rows: 63
    })
    dev: Dataset({
    
    
        features: ['text'],
        num_rows: 590
    })
})
Prompt is -> 这是一条{
    
    MASK}评论:{
    
    textA}100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.96ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.55ba/s]
global step 5, epoch: 0, loss: 3.74432, speed: 2.67 step/s
global step 10, epoch: 1, loss: 3.06417, speed: 5.86 step/s
global step 15, epoch: 1, loss: 2.51641, speed: 5.73 step/s
global step 20, epoch: 2, loss: 2.12264, speed: 5.84 step/s
global step 25, epoch: 3, loss: 1.80121, speed: 5.82 step/s
global step 30, epoch: 3, loss: 1.52964, speed: 5.78 step/s
...

The training curve graph will be saved under logs/sentiment_classificationthe file:

2.3. Model Prediction

Once the model training is complete, run inference.pyto load the trained model and apply:

...
contents = [
        '地理环境不错,但对面一直在盖楼,门前街道上打车不方便。',
        '跟好朋友一起凑单买的,很划算,洗发露是樱花香的,挺好的。。。'
    ]                           # 自定义评论
res = inference(contents)       # 推测评论类型
...

Run the inference program:

python inference.py

The following inference results are obtained:

Prompt is -> 这是一条{
    
    MASK}评论:{
    
    textA}。
Used 0.47s.
inference label(s): ['酒店', '洗浴']

3.P-tuning:Auto Learning prompt pattern

  • Environment installation
    This project is based on the implementation of pytorch+ transformers, please install the relevant dependency packages before running:
pip install -r ../../requirements.txt

torch
transformers==4.22.1
datasets==2.4.0
evaluate==0.2.2
matplotlib==3.6.0
rich==12.5.1
scikit-learn==1.1.2
requests==2.28.1

3.1 Dataset preparation

3.1.1 Label data preparation

A part of sample data is provided in the project, predicting the item category of user comments (classification task) based on user comments, the data is in data/comment_classify.

To use 自定义数据training, just build a dataset like the example data:

水果	什么苹果啊,都没有苹果味,怪怪的味道,而且一点都不甜,超级难吃!
书籍	为什么不认真的检查一下, 发这么一本脏脏的书给顾客呢!
酒店	性价比高的酒店,距离地铁近,邻华师大,环境好。
...

Each line \tis separated by a separator, the first half is 标签(label)and the second half is 原始输入.

3.1.2 Verbalizer preparation

Verbalizer is used to define the mapping between "real labels" and "label prediction words".

In some cases, using the "real label" as [MASK] to predict may not have good semantic fluency. Therefore, we will do a certain mapping on the "real label".

For example:

"日本爆冷2-1战胜德国"是一则[MASK][MASK]新闻。	体育

The label in this sentence is "sports", but it would be more predictable if we set the label to "soccer".

Therefore, we can construct many sub-labels for the label "sports". When inferring, we only need to predict the sub-labels and finally infer the real label, as follows:

体育 -> 足球,篮球,网球,棒球,乒乓,体育
...

A portion of sample data is provided in the project at data/comment_classify/verbalizer.txt.

To use 自定义数据training, just build a dataset like the example data:

电脑	电脑
水果	水果
平板	平板
衣服	衣服
酒店	酒店
洗浴	洗浴
书籍	书籍
蒙牛	蒙牛
手机	手机

In the example, we use a 1-to-1 verbalizer. If you want to define a one-to-many mapping, you only need to ','separate it with , eg:

...
水果	苹果,香蕉,橘子
...

3.2 Model training

Modify train.shthe corresponding parameters in the training script to start model training:

python p_tuning.py \
    --model "bert-base-chinese" \               # backbone
    --train_path "data/comment_classify/train.txt" \
    --dev_path "data/comment_classify/dev.txt" \
    --verbalizer "data/comment_classify/verbalizer.txt" \ # verbalizer存放地址
    --save_dir "checkpoints/comment_classify/" \
    --img_log_dir "logs/comment_classify" \     # loss曲线图存放地址
    --img_log_name "BERT" \                     # loss曲线图文件名
    --batch_size 16 \
    --max_seq_len 128 \
    --valid_steps 20  \
    --logging_steps 5 \
    --num_train_epochs 50 \
    --max_label_len 2 \                         # 标签最大长度
    --p_embedding_num 15 \                      # p_token长度
    --device "cuda:0"                           # 指定使用哪块gpu

After the training is started correctly, the terminal will print the following information:

...
global step 5, epoch: 1, loss: 6.50529, speed: 4.25 step/s
global step 10, epoch: 2, loss: 4.77712, speed: 6.36 step/s
global step 15, epoch: 3, loss: 3.55371, speed: 6.19 step/s
global step 20, epoch: 4, loss: 2.71686, speed: 6.38 step/s
Evaluation precision: 0.70000, recall: 0.69000, F1: 0.69000
best F1 performence has been updated: 0.00000 --> 0.69000
global step 25, epoch: 6, loss: 2.20488, speed: 6.21 step/s
global step 30, epoch: 7, loss: 1.84836, speed: 6.22 step/s
global step 35, epoch: 8, loss: 1.58520, speed: 6.22 step/s
global step 40, epoch: 9, loss: 1.38746, speed: 6.27 step/s
Evaluation precision: 0.75000, recall: 0.75000, F1: 0.75000
best F1 performence has been updated: 0.69000 --> 0.75000
global step 45, epoch: 11, loss: 1.23437, speed: 6.14 step/s
global step 50, epoch: 12, loss: 1.11103, speed: 6.16 step/s
...

The training curve graph will be saved under logs/sentiment_classificationthe file:

3.3 Model Prediction

Once the model training is complete, run inference.pyto load the trained model and apply:

...
contents = [
    "苹果卖相很好,而且很甜,很喜欢这个苹果,下次还会支持的", 
    "这破笔记本速度太慢了,卡的不要不要的"
]   # 自定义评论
res = inference(contents)       # 推测评论类型
...

Run the inference program:

python inference.py

The following inference results are obtained:

inference label(s): ['水果', '电脑']

Reference link: https://github.com/HarderThenHarder/transformers_tasks/blob/main/prompt_tasks/p-tuning

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132416452