I have been writing prompt learning code recently, so here is a code and related knowledge about discrete prompt learning.
If you know a lot about pre-training language models , pre-training , fine-tuning , and prompt learning and just want the code, you can jump directly to Chapter 2.
Table of contents
1 Basic knowledge
This chapter strives to use the simplest and most intuitive language to introduce you to what is a pre-trained language model , pre-training , fine-tuning , and prompt learning .
1.1 Pre-training
预训练(pre-training)
It refers to using pre-training technology (such as Masked Language model(MLM)
etc.) on a large corpus to train a model from scratch (it can be random initialization, or all zeros) or from a certain initial point (start point) [1]. The core of this task is that the model can learn the language knowledge you expect it to learn, rather than performing downstream tasks such as classification tasks on a certain data set. This type of task usually outputs one or more losses and backpropagates these losses.
1.2 Fine-tuning
微调(fine-tune)
It refers to using a certain model to perform specified tasks in downstream tasks. In these tasks, the language model used has been trained, and you can choose to participate in fine-tuning or be fixed (because the language model has learned a lot of knowledge in pre-training, it can perform well even if it is fixed) ). In the fine-tuning stage, what needs to be learned most are the task-specific parameters of the downstream tasks, such as the BertForSequenceClassification
fully [CLS]
connected layer that needs to go through (discussed below).
1.3 Pre-trained language model
预训练语言模型(pre-train language model, PTM)
It means that after training on a large-scale corpus, a model carrying knowledge word vectors such as semantics, syntax, and context can be obtained. Since it was proposed in 2013 Word2Vec
, many pre-trained language models based on neural networks have emerged.
- Word2Vec: For
Word2Vec
details about [2], please see my previous blog: Detailed derivation of the principles and formulas of Word2Vec , Hierarchical Softmax and Negative Sampling of Word2Vec . Simply put, it is a sum withWord2Vec
only one hidden layer , of which and are both fully connected layers. Through the sliding window, let the word iiEncoder
Decoder
Encoder
Decoder
i to predict nnaround itn contexts, or by wordiii 'snnn contexts to predict wordiii . Through training, the word vector of each word can carry the knowledge of its context (the easier it is to appear in the wordiiThe words surrounding i will have smaller losses during the training process. Through backpropagation, the information of this word will be more able to be included in wordiiin the word vector of i ). - GloVe: For details about
GloVe
[3], please see my previous blog: Explanation of GloVe principles and formulas .GloVe
It is mentioned in the paperWord2Vec
that the sliding window can only make the word iii obtains its context information, but ignores the global information, soGloVe
the global co-occurrence matrix is also considered, so that wordiii can not only have local context information, but also contain global word co-occurrence information.
The output of the pre-trained language model represented by and Word2Vec
is pre-trained word embedding , which can be input into the embedding layer of the downstream task model. But for the model of downstream tasks, only the parameters of the embedding layer are not randomly initialized, and the rest of the parameters need to be trained from scratch .GloVe
- ELMo: For details about
ELMo
[4], please see my previous blog: BERT Study Notes (4) - Newbie version of ELMo and BERT . SinceWord2Vec
andGloVe
capture both local context information (because it is a sliding window),ELMo
in order to capture all the information of a sentence, a bidirectional LSTM is used to capture words through forward and reverse LSTM iii ’s forward information and reverse information, and adjust the weight of forward and reverse information in downstream tasks.
Although maybe possibly it seems ELMo
possible to acquire a word iii contains all the upper and lower information in the sentence, but if you think about the structure of LSTM carefully,tanh
how much long-distance information can be retained after passing through it many times (forgetting gate)? So although the full name of LSTM isLong Short-Term Memory
, it actually extracts local information.
- BERT: For details about
BERT
[5], please see my previous blog: BERT Study Notes (4) - Newbie version of ELMo and BERT .BERT
UsingTransformer
the structure of ,self-attention
each word can obtain the information in the entire sentence, butself-attention
it will make the word iii andjjNo matter how far apart the js are in the sentence,attenton
the distance between them is1. Therefore,BERT
positional information embedding is also considered in , so thatself-attention
there will be positional information between words when .
Since BERT
the proposal of , the entire pre-trained language model has entered the third paradigm (the first paradigm is machine learning and rule-based methods represented by the one-hot
, model, and the second paradigm is the pre-trained word embedding represented by the , model and input to embedding layer of the downstream model). The third paradigm refers to a pre-trained language model (for example ) that only fine-tunes some parameters, and its backbone can achieve good results in downstream tasks without any training. Take the picture in the original article as an example:TF-IDF
Word2Vec
GloVe
BERT
- Text classification tasks: For single-sentence text classification tasks (Figure b) and two-sentence text classification tasks (Figure a), you only need to obtain the
BERT
hidden[CLS]
state of ([CLS]
which isBERT
the special symbol at the first position of the input, which can be understood as the[CLS]
hidden state representative of contains the sentence vector of the entire sentence, that is,[CLS]
contains the information of the entire sentence), and then[CLS]
splices a fully connected layer after the hidden state to perform text classification. And this fully connected layer is the part that needs to be fine-tuned in downstream tasks. We can also see it in each instantiationBertForSequenceClassification
and in the source code:
# 实例化 BertForSequenceClassification 时的提示
Some weights of the model checkpoint at H:\huggingface\bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at H:\huggingface\bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# BertForSequenceClassification 有关全连接层的源码
# __init__:
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# forward:
# 这里的 outputs[1] 是 [CLS] 在 BertModel 经过 Pooler 层的输出
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
- Sequence Labeling tasks: For sequence labeling tasks such as QA tasks (Figure c) and NER (Figure d), you only need to obtain the
token
hidden state of each and pass through a fully connected layer to outputtoken
the hidden state of each position (can be analogous to LSTM The hidden state of each token is output, followed by a fully connected layer to output logits). And this fully connected layer is the part that is fine-tuned in downstream tasks.
For the third paradigm, the language model has been able to learn rich semantics, syntax, context and other information during pre-training. The model can be fixed when doing downstream tasks. The number of parameters that need to be fine-tuned is basically ( embed _ size × num _ outputs ) (embed\_size \times num\_outputs)(embed_size×num_outputs) 到 ( e m b e d _ s i z e × m a x _ s e q _ l e n g t h × n u m _ o u t p u t s ) (embed\_size \times max\_seq\_length \times num\_outputs) (embed_size×max_seq_length×num_outputs)。
But the third normal form also brings a problem:
- The large model has been pre-trained in a large corpus, and the language model itself can already carry a lot of language knowledge. So, can we just use the pre-trained language model to complete downstream tasks without fine-tuning?
- For example, GPT-3 and other models that can only be called for a fee but cannot be fine-tuned, and refugees who only have a 10G video memory 2080Ti cannot fine-tune. How to use powerful text representation?
So there is prompt learning
no need to fine-tune any parameters in the downstream task, and the downstream task can be completed [6-7].
注:
Students who have just finished learning Word2Vec
the pre-trained language model may mistakenly think that BERT embedding
it refers to BERT's embedding layer , but this is not the case. In torch, BERT
it is actually exactly the same as that embedding layer
of the traditional model , which is the mapping relationship between a word id and the weight in the embedding layer . However , it is also called etc. This is because BERT embedding refers to the 12 layers after the sentence is input. The hidden state obtained after Encoder is the word embedding process of BERT. As shown in the figure below, this is the cosine similarity after passing in different sentences .embedding layer
BERT
Transformer-based model
contextualized word embedding
苹果
BERT embedding
2 prompt learning
prompt learning
For details, please see my previous blog: What is prompt learning? Simple and intuitive understanding of prompt learning . Simply put, prompt learning
it is expected that the pre-trained language model does not require additional training parameters in downstream tasks, and only the model itself can solve the problem.
prompt learning
There are two very key contents: prompt template
and verbalizer
.
- prompt template: Prompt sentence, there must be
[MASK]
a label in the sentence, which is what the model is expected to output. For example, the sentiment analysis task can be constructed as: [X]. Totally, it was [MASK]. Where [X] is the input text. - verbalizer: Expect the model to fill in the blanks and its corresponding labels. Take the above sentence as an example. We expect the model to do a single-choice question and choose between 'good' and 'bad', and we also need to set 'good' to represent positive and 'bad' to represent negative.
Both of the above have a lot of research content, but this article only focuses on the artificially constructed discrete prompt template (that is, manually setting the text instead of letting the model learn the text by itself) and the given verbalizer range . We expect pre-trained data BERT
to IMDB
be used for sentiment classification.
The data and code have been packaged on github, the link is: https://github.com/Balding-Lee/prompt-learning , plug and play.
The data processing part and the encapsulated batch part will not be described in detail here. Details can be found on github. Only the prediction part is explained in detail here. Since it has been said before that prompt learning
it is possible to run in downstream tasks without fine-tuning, so given:
prefix = 'Totally, it was [MASK].'
verbalizer = {
'good': 1,
'fascinating': 1,
'perfect': 1,
'bad': 0,
'horrible': 0,
'terrible': 0,
}
The prediction code is as follows:
with torch.no_grad():
predict_all = np.array([], dtype=int)
labels_all = np.array([], dtype=int)
pper = pyprind.ProgPercent(batch_count)
for i in range(batch_count):
inputs = batch_X[i]
labels = batch_y[i]
tokens = tokenizer.batch_encode_plus(inputs, add_special_tokens=True,
max_length=config.max_seq_length,
padding='max_length', truncation=True)
ids = torch.tensor(tokens['input_ids']).to(device)
attention_mask = torch.tensor(tokens['attention_mask']).to(device)
# shape: (batch_size, max_seq_length, vocab_size)
logits = bert(ids, attention_mask=attention_mask).logits
# mask_token_index[0]: 第 i 条数据
# mask_token_index[1]: 第 i 条数据的 [MASK] 在序列中的位置
mask_token_index = (ids == tokenizer.mask_token_id).nonzero(as_tuple=True)
# 找到 [MASK] 的 logits
# shape: (batch_size, vocab_size)
masked_logits = logits[mask_token_index[0], mask_token_index[1], :]
# 将 [MASK] 位置中 verbalizer 里的词语的 logits 给提取出来
# shape: (batch_size, verbalizer_size)
verbalizer_logits = masked_logits[:, verbalizer_ids]
# 将这些 verbalizer 中的 logits 给构造一个伪分布
pseudo_distribution = softmax(verbalizer_logits)
# 找到伪分布中概率最大的 index
pred_indices = pseudo_distribution.argmax(axis=-1).tolist()
# 将 index 转换为词语的 id
pred_ids = [index2ids[index] for index in pred_indices]
# 将 id 转换为 token
pred_tokens = tokenizer.convert_ids_to_tokens(pred_ids)
# 找到 token 对应的 label
pred_labels = [config.verbalizer[token] for token in pred_tokens]
predict_all = np.append(predict_all, pred_labels)
labels_all = np.append(labels_all, labels)
pper.update()
acc = accuracy_score(labels_all, predict_all)
p = precision_score(labels_all, predict_all)
r = recall_score(labels_all, predict_all)
f1 = f1_score(labels_all, predict_all)
print('accuracy: %f | precision: %f | recall: %f | f1: %f' % (acc, p, r, f1))
The experimental results are:
accuracy: 0.616440 | precision: 0.568737 | recall: 0.963440 | f1: 0.715249
The following points need to be explained here:
mask_token_index
is a tuple, where [0] refers to the iithi piece of data, for example, if I set the batch to 4, thenmask _ token _ index [ 0 ] = [ 0 , 1 , 2 , 3 ] {\rm mask\_token\_index}[0]=[0, 1, 2 , 3]mask_token_index[0]=[0,1,2,3 ] , corresponding to the first piece of data, the second piece of data, the third piece of data and the fourth piece of data. And [1] refers toiiThe position of [MASK] of i piece of data, assuming, mask _ token _ index [ 1 ] = [ 4 , 16 , 8 , 32 ] {\rm mask\_token\_index}[1]=[4, 16, 8 , 32]mask_token_index[1]=[4,16,8,32 ] , then the 5th position of the 1st data, the 17th position of the 2nd data, the 9th position of the 3rd data and the 33rd position of the 4th data are [MASK].- Since the shape of is
BertForMaskedLM
( batch_size, max_seq_length, vocab_size) (batch\_size, max\_seq\_length, vocab\_size)logits
(batch_size,max_seq_length,v oc ab _ s i ze ) , refers to the logits corresponding to the hidden state of each token in the batch in the vocabulary. And we only needthe distribution of these words in[MASK]
the positionthe wordsin the position, and use tomake a pseudo distributionfor theseThe highest score in this pseudo distribution is the predicted word.verbalizer
[MASK]
verbalizer
logits
softmax
logits
- When we get the predicted word, we go back
verbalizer
to find the label corresponding to this word, and we getprompt learning
the predicted label in .
reference
[1] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu. ERNIE: Enhanced Language Representation with Informative Entities [C]// Proceeding of the 57th Annual Meeting of the Association for Computational Linguistics, 2020: 1441-1451.
[2] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space[C]//ICLR (Workshop Poster), 2013.
[3] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing, 2014.
[4] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. Semi-supervised sequence tagging with bidirectional language models[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019.
[6] LIU P, YUAN W, FU J, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing [J]. CoRR, 2021, abs/2107.13586.
[7] 刘鹏飞. 近代自然语言处理技术发展的“第四范式”[EB/OL]. (2021-08-01)[2022-01-03]. https://zhuanlan.zhihu.com/p/395115779