NLP study notes (1)

This is a study note, there will be some study records and my own plans, some ideas...

1. The 10th Douban Movie Prediction Scoring Project of Greedy Academy

1. The text is converted into a vector, and three methods will be used, namely tf-idf, word2vec and BERT vector

2. Train logistic regression and naive Bayesian models, and do cross-validation

3. Evaluate the accuracy of the model

Code https://github.com/blockpanda/douban

2. The code implements simple SkipGram

Understanding the Skip-Gram model of Word2Vec - Programmer Sought

In the Word2Vec model, there are mainly two models, Skip-Gram and CBOW. From an intuitive understanding, Skip-Gram is a given input word to predict the context. And CBOW is given the context to predict the input word.

Simply put, the input is the central word, the output is the words near the central word, and the probability of each result is calculated, and the final result is the one with the highest probability.

The first part is to build the model, and the second part is to obtain the embedded word vector through the model

1. Resume dictionary 2. One-hot encoding 3. Training 4. Output 

The output layer is a softmax regression classifier, and each node of it will output a value (probability) between 0 and 1, and the sum of the probabilities of all neuron nodes in the output layer is 1.

Code: GitHub - blockpanda/skipgram

Ready to learn Prompt

Three, prompt

Question: How to use a smaller pre-trained model to give full play to the role of the pre-trained language model as a language model for downstream tasks.

From 2017 to 2019, the focus of researchers gradually shifted from the traditional task-specific supervised model to pre-training. The research idea based on the pre-training language model is usually "pre-train, fine-tune" , that is, applying PLM to downstream tasks, designing training objects according to downstream tasks in the pre-training stage and fine-tuning stage and adjusting the PLM ontology.

The new mode that incorporates Prompt can be roughly summarized as "pre-train, prompt, and predict". In this mode, downstream tasks are readjusted into a form similar to pre-training tasks. For example, the usual pre-training task is Masked Language Model. In the text sentiment classification task, for the  input of "I love this movie." , you can add  the form of prompt "The movie is ___"  after it  , and then let PLM Fill in the blanks with emotional answers such as "great", "fantastic", etc., and finally convert the answers into emotional classification labels. In this way, by selecting the appropriate prompt, we can control the model's prediction output, so that a completely unsupervised The trained PLM can be used to solve a variety of downstream tasks .

Prompt can help PLM "recall" what it "learned" during pre-training, and it was slowly called Prompt.

The specific "Prompt" approach is to give artificial rules to the pre-training model, so that the model can better understand a technology of human instructions, so as to make better use of the pre-training model.

Prompting is more dependent on the prior, while fine-tuning is more dependent on the posterior.

openprompt

General steps:

1. Define the task

2. Select the appropriate pre-trained language model

3. Define the template

4. Define the mapping

5. Data loading and training

2. Related literature

(1)Schick T, Schütze H. Exploiting cloze questions for few shot text classification and natural language inference[J]. arXiv preprint arXiv:2001.07676, 2020.

 Didn't read this carefully.....

Code https://github.com/timoschick/pet#/

(2)p-tuning

Essence 1: Automatically find knowledge templates in continuous space;

Essence 2: Training knowledge templates, but not fine-tune language models.

 Left: the blue Britain is the context, the red [MASK] is the target; and the orange the capital of ... is ... are prompt tokens.

Right: Replace some tokens with dense vectors that can be trained one by one, which is equivalent to learning a prompt in a continuous space, given a small amount of training samples, to see if it can be better predicted.

Detailed description:

  1. Given a pre-trained language model: M ;
  2. A discrete input sequence: x1:n={x0,x1,...,xn}, and the output {e(x0),e(x1),..., e(xn)};
  3. A sequence of "post-processed" tokens, y , that will be used to "participate" in the processing of downstream tasks. For example, in BERT pre-training, x represents the original input sequence that has not been masked, and y represents the words that have been masked (note that words that have not been masked are not included). For another example, in sentence classification, x represents the original sentence input, and y represents [CLS];
  4. The function of a prompt p is to organize the context x , the target sequence y , and the prompt itself, into a template T. For example, in an example of predicting the capital of a country: an example of a template T is: "The capital of Britain is [MASK]", where "The capital of ... is ..." is prompt," Britain" is the context, and [MASK] is the target.
  5. V is the vocabulary of the pre-trained language model M;
  6. [Pi] is the i-th prompt token in template T;
  7. Formally, a template T is expressed as: T={[P0:i],x,[Pi+1:m],y} , where each element is a "token".
  8. After the embedding layer, the above template T can be expressed as: {e([P0:i]),e(x),e([Pi+1:m]),e(y)} .
  9. Corresponding to the above, in P-tuning, instead of using discrete prompt tokens, each [Pi] is regarded as a pseudo token, and the template T is mapped to: {h0,..., hi,e(x),hi+1,...,hm,e(y)} , where hi(0≤i<m) is trainable embedding tensors (continuous tensors ) . The advantage of doing this (replacing discrete tokens with densely represented tensors) is that it can prompt us to find better continuous prompts that can "get rid of"/beyond the original dictionary V represents the capacity range.
  10. Finally, based on the downstream loss function L (according to different tasks, the specific calculation formula of the loss function will be different), we can optimize the continuous prompt hi(0≤i<m) through differential operations:

Literature: GPT Understands, Too

 Open source code:  https://github.com/THUDM/ P-tuning

(3)P-tuning v2——全vector prompt

The full vecotor can be directly spliced ​​into the layer of the pre-trained model, and this model can do sequence tagging tasks (label each token in the input sequence)

Literature:  https://arxiv.org/pdf/2110.07602.pdf

Open source code: https://github.com/THUDM/P-tuni

(4)PPT: Pre-trained Prompt Tuning for Few-shot Learning

After the above prompt adopts the vector form, the effect is good when the training set is relatively large (full-data), but it is not good in the few-shot (training set is small) scenario, because the amount of data is small and it is difficult to learn. then what should we do? Since all NLP tasks have pre-trained models, can prompt also be pre-trained and then fine-tuned? That's why the PPT (Pre-trained Prompt Tuning) model was born. This model is to use a large amount of unlabeled corpus to pre-train Prompt first, and then fine-tune downstream tasks.

 Literature: https://arxiv.org/pdf/2109.04332.pdf

Guess you like

Origin blog.csdn.net/qq_39953312/article/details/127353646