prompt learning - the basic knowledge you need to master and the code of discrete prompts

I have been writing prompt learning code recently, so here is a code and related knowledge about discrete prompt learning.
If you know a lot about pre-training language models , pre-training , fine-tuning , and prompt learning and just want the code, you can jump directly to Chapter 2.

1 Basic knowledge

This chapter strives to use the simplest and most intuitive language to introduce you to what is a pre-trained language model , pre-training , fine-tuning , and prompt learning .

1.1 Pre-training

预训练(pre-training)It refers to using pre-training technology (such as Masked Language model(MLM)etc.) on a large corpus to train a model from scratch (it can be random initialization, or all zeros) or from a certain initial point (start point) [1]. The core of this task is that the model can learn the language knowledge you expect it to learn, rather than performing downstream tasks such as classification tasks on a certain data set. This type of task usually outputs one or more losses and backpropagates these losses.

1.2 Fine-tuning

微调(fine-tune)It refers to using a certain model to perform specified tasks in downstream tasks. In these tasks, the language model used has been trained, and you can choose to participate in fine-tuning or be fixed (because the language model has learned a lot of knowledge in pre-training, it can perform well even if it is fixed) ). In the fine-tuning stage, what needs to be learned most are the task-specific parameters of the downstream tasks, such as the BertForSequenceClassificationfully [CLS]connected layer that needs to go through (discussed below).

1.3 Pre-trained language model

预训练语言模型(pre-train language model, PTM)It means that after training on a large-scale corpus, a model carrying knowledge word vectors such as semantics, syntax, and context can be obtained. Since it was proposed in 2013 Word2Vec, many pre-trained language models based on neural networks have emerged.

  • Word2Vec: For Word2Vecdetails about [2], please see my previous blog: Detailed derivation of the principles and formulas of Word2Vec , Hierarchical Softmax and Negative Sampling of Word2Vec . Simply put, it is a sum with Word2Veconly one hidden layer , of which and are both fully connected layers. Through the sliding window, let the word iiEncoderDecoderEncoderDecoderi to predict nnaround itn contexts, or by wordiii 'snnn contexts to predict wordiii . Through training, the word vector of each word can carry the knowledge of its context (the easier it is to appear in the wordiiThe words surrounding i will have smaller losses during the training process. Through backpropagation, the information of this word will be more able to be included in wordiiin the word vector of i ).
  • GloVe: For details about GloVe[3], please see my previous blog: Explanation of GloVe principles and formulas . GloVeIt is mentioned in the paper Word2Vecthat the sliding window can only make the word iii obtains its context information, but ignores the global information, soGloVethe global co-occurrence matrix is ​​also considered, so that wordiii can not only have local context information, but also contain global word co-occurrence information.

The output of the pre-trained language model represented by and Word2Vecis pre-trained word embedding , which can be input into the embedding layer of the downstream task model. But for the model of downstream tasks, only the parameters of the embedding layer are not randomly initialized, and the rest of the parameters need to be trained from scratch .GloVe

  • ELMo: For details about ELMo[4], please see my previous blog: BERT Study Notes (4) - Newbie version of ELMo and BERT . Since Word2Vecand GloVecapture both local context information (because it is a sliding window), ELMoin order to capture all the information of a sentence, a bidirectional LSTM is used to capture words through forward and reverse LSTM iii ’s forward information and reverse information, and adjust the weight of forward and reverse information in downstream tasks.

Although maybe possibly it seems ELMopossible to acquire a word iii contains all the upper and lower information in the sentence, but if you think about the structure of LSTM carefully,tanhhow much long-distance information can be retained after passing through it many times (forgetting gate)? So although the full name of LSTM isLong Short-Term Memory, it actually extracts local information.

  • BERT: For details about BERT[5], please see my previous blog: BERT Study Notes (4) - Newbie version of ELMo and BERT . BERTUsing Transformerthe structure of , self-attentioneach word can obtain the information in the entire sentence, but self-attentionit will make the word iii andjjNo matter how far apart the js are in the sentence,attentonthe distance between them is1. Therefore,BERTpositional information embedding is also considered in , so thatself-attentionthere will be positional information between words when .

Since BERTthe proposal of , the entire pre-trained language model has entered the third paradigm (the first paradigm is machine learning and rule-based methods represented by the one-hot, model, and the second paradigm is the pre-trained word embedding represented by the , model and input to embedding layer of the downstream model). The third paradigm refers to a pre-trained language model (for example ) that only fine-tunes some parameters, and its backbone can achieve good results in downstream tasks without any training. Take the picture in the original article as an example:TF-IDFWord2VecGloVeBERT
BERT downstream tasks

  1. Text classification tasks: For single-sentence text classification tasks (Figure b) and two-sentence text classification tasks (Figure a), you only need to obtain the BERThidden [CLS]state of ( [CLS]which is BERTthe special symbol at the first position of the input, which can be understood as the [CLS]hidden state representative of contains the sentence vector of the entire sentence, that is, [CLS]contains the information of the entire sentence), and then [CLS]splices a fully connected layer after the hidden state to perform text classification. And this fully connected layer is the part that needs to be fine-tuned in downstream tasks. We can also see it in each instantiation BertForSequenceClassificationand in the source code:
# 实例化 BertForSequenceClassification 时的提示
Some weights of the model checkpoint at H:\huggingface\bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at H:\huggingface\bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

# BertForSequenceClassification 有关全连接层的源码
# __init__:
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
# forward: 
# 这里的 outputs[1] 是 [CLS] 在 BertModel 经过 Pooler 层的输出
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
  1. Sequence Labeling tasks: For sequence labeling tasks such as QA tasks (Figure c) and NER (Figure d), you only need to obtain the tokenhidden state of each and pass through a fully connected layer to output tokenthe hidden state of each position (can be analogous to LSTM The hidden state of each token is output, followed by a fully connected layer to output logits). And this fully connected layer is the part that is fine-tuned in downstream tasks.

For the third paradigm, the language model has been able to learn rich semantics, syntax, context and other information during pre-training. The model can be fixed when doing downstream tasks. The number of parameters that need to be fine-tuned is basically ( embed _ size × num _ outputs ) (embed\_size \times num\_outputs)(embed_size×num_outputs) ( e m b e d _ s i z e × m a x _ s e q _ l e n g t h × n u m _ o u t p u t s ) (embed\_size \times max\_seq\_length \times num\_outputs) (embed_size×max_seq_length×num_outputs)

But the third normal form also brings a problem:

  1. The large model has been pre-trained in a large corpus, and the language model itself can already carry a lot of language knowledge. So, can we just use the pre-trained language model to complete downstream tasks without fine-tuning?
  2. For example, GPT-3 and other models that can only be called for a fee but cannot be fine-tuned, and refugees who only have a 10G video memory 2080Ti cannot fine-tune. How to use powerful text representation?

So there is prompt learningno need to fine-tune any parameters in the downstream task, and the downstream task can be completed [6-7].

注:Students who have just finished learning Word2Vecthe pre-trained language model may mistakenly think that BERT embeddingit refers to BERT's embedding layer , but this is not the case. In torch, BERTit is actually exactly the same as that embedding layerof the traditional model , which is the mapping relationship between a word id and the weight in the embedding layer . However , it is also called etc. This is because BERT embedding refers to the 12 layers after the sentence is input. The hidden state obtained after Encoder is the word embedding process of BERT. As shown in the figure below, this is the cosine similarity after passing in different sentences .embedding layerBERTTransformer-based modelcontextualized word embedding苹果BERT embedding
BERT embedding

2 prompt learning

prompt learningFor details, please see my previous blog: What is prompt learning? Simple and intuitive understanding of prompt learning . Simply put, prompt learningit is expected that the pre-trained language model does not require additional training parameters in downstream tasks, and only the model itself can solve the problem.

prompt learningThere are two very key contents: prompt templateand verbalizer.

  • prompt template: Prompt sentence, there must be [MASK]a label in the sentence, which is what the model is expected to output. For example, the sentiment analysis task can be constructed as: [X]. Totally, it was [MASK]. Where [X] is the input text.
  • verbalizer: Expect the model to fill in the blanks and its corresponding labels. Take the above sentence as an example. We expect the model to do a single-choice question and choose between 'good' and 'bad', and we also need to set 'good' to represent positive and 'bad' to represent negative.

Both of the above have a lot of research content, but this article only focuses on the artificially constructed discrete prompt template (that is, manually setting the text instead of letting the model learn the text by itself) and the given verbalizer range . We expect pre-trained data BERTto IMDBbe used for sentiment classification.

The data and code have been packaged on github, the link is: https://github.com/Balding-Lee/prompt-learning , plug and play.

The data processing part and the encapsulated batch part will not be described in detail here. Details can be found on github. Only the prediction part is explained in detail here. Since it has been said before that prompt learningit is possible to run in downstream tasks without fine-tuning, so given:

prefix = 'Totally, it was [MASK].'
verbalizer = {
    
    
    'good': 1,
    'fascinating': 1,
	'perfect': 1,
	'bad': 0,
	'horrible': 0,
	'terrible': 0,
}

The prediction code is as follows:

with torch.no_grad():
    predict_all = np.array([], dtype=int)
    labels_all = np.array([], dtype=int)
    pper = pyprind.ProgPercent(batch_count)
    for i in range(batch_count):
        inputs = batch_X[i]
        labels = batch_y[i]

        tokens = tokenizer.batch_encode_plus(inputs, add_special_tokens=True,
                                             max_length=config.max_seq_length,
                                             padding='max_length', truncation=True)
        ids = torch.tensor(tokens['input_ids']).to(device)
        attention_mask = torch.tensor(tokens['attention_mask']).to(device)

        # shape: (batch_size, max_seq_length, vocab_size)
        logits = bert(ids, attention_mask=attention_mask).logits

        # mask_token_index[0]: 第 i 条数据
        # mask_token_index[1]: 第 i 条数据的 [MASK] 在序列中的位置
        mask_token_index = (ids == tokenizer.mask_token_id).nonzero(as_tuple=True)

        # 找到 [MASK] 的 logits
        # shape: (batch_size, vocab_size)
        masked_logits = logits[mask_token_index[0], mask_token_index[1], :]

        # 将 [MASK] 位置中 verbalizer 里的词语的 logits 给提取出来
        # shape: (batch_size, verbalizer_size)
        verbalizer_logits = masked_logits[:, verbalizer_ids]

        # 将这些 verbalizer 中的 logits 给构造一个伪分布
        pseudo_distribution = softmax(verbalizer_logits)

        # 找到伪分布中概率最大的 index
        pred_indices = pseudo_distribution.argmax(axis=-1).tolist()
        # 将 index 转换为词语的 id
        pred_ids = [index2ids[index] for index in pred_indices]
        # 将 id 转换为 token
        pred_tokens = tokenizer.convert_ids_to_tokens(pred_ids)
        # 找到 token 对应的 label
        pred_labels = [config.verbalizer[token] for token in pred_tokens]

        predict_all = np.append(predict_all, pred_labels)
        labels_all = np.append(labels_all, labels)

        pper.update()

    acc = accuracy_score(labels_all, predict_all)
    p = precision_score(labels_all, predict_all)
    r = recall_score(labels_all, predict_all)
    f1 = f1_score(labels_all, predict_all)

    print('accuracy: %f | precision: %f | recall: %f | f1: %f' % (acc, p, r, f1))

The experimental results are:

accuracy: 0.616440 | precision: 0.568737 | recall: 0.963440 | f1: 0.715249

The following points need to be explained here:

  1. mask_token_indexis a tuple, where [0] refers to the iithi piece of data, for example, if I set the batch to 4, thenmask _ token _ index [ 0 ] = [ 0 , 1 , 2 , 3 ] {\rm mask\_token\_index}[0]=[0, 1, 2 , 3]mask_token_index[0]=[0,1,2,3 ] , corresponding to the first piece of data, the second piece of data, the third piece of data and the fourth piece of data. And [1] refers toiiThe position of [MASK] of i piece of data, assuming, mask _ token _ index [ 1 ] = [ 4 , 16 , 8 , 32 ] {\rm mask\_token\_index}[1]=[4, 16, 8 , 32]mask_token_index[1]=[4,16,8,32 ] , then the 5th position of the 1st data, the 17th position of the 2nd data, the 9th position of the 3rd data and the 33rd position of the 4th data are [MASK].
  2. Since the shape of is BertForMaskedLM( batch_size, max_seq_length, vocab_size) (batch\_size, max\_seq\_length, vocab\_size)logits(batch_size,max_seq_length,v oc ab _ s i ze ) , refers to the logits corresponding to the hidden state of each token in the batch in the vocabulary. And we only needthe distribution of these words in[MASK]the positionthe wordsin the position, and use tomake a pseudo distributionfor theseThe highest score in this pseudo distribution is the predicted word.verbalizer[MASK]verbalizerlogitssoftmaxlogits
  3. When we get the predicted word, we go back verbalizerto find the label corresponding to this word, and we get prompt learningthe predicted label in .

reference

[1] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu. ERNIE: Enhanced Language Representation with Informative Entities [C]// Proceeding of the 57th Annual Meeting of the Association for Computational Linguistics, 2020: 1441-1451.
[2] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space[C]//ICLR (Workshop Poster), 2013.
[3] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing, 2014.
[4] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. Semi-supervised sequence tagging with bidirectional language models[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019.
[6] LIU P, YUAN W, FU J, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing [J]. CoRR, 2021, abs/2107.13586.
[7] 刘鹏飞. 近代自然语言处理技术发展的“第四范式”[EB/OL]. (2021-08-01)[2022-01-03]. https://zhuanlan.zhihu.com/p/395115779

Guess you like

Origin blog.csdn.net/qq_35357274/article/details/126425211