[Overview of 100 large models] Anthropic LLM (Anthropic)

[Overview of 100 large models] Anthropic LLM (Anthropic)

Author: Wang Jianing, the content of this article is original, warehouse link: https://github.com/wjn1996/LLMs-NLP-Algo

Subscribe to the column [Large Model & NLP & Algorithm] to get all the NLP, large model and algorithm dry goods data spree accumulated by the blogger for many years, nearly 200 papers, 300 markdown notes written by the blogger himself, and nearly 100 large model data cards . Help NLP research, study and job hunting.


Anthropic LLM large model basic information data card

serial number big model name attribution launch time scale pre-training corpus Benchmark Model and Training Method open source paper model address Relevant information
32 Anthropic LM Anthropic 2022-04 52b Core base data set: The data comes from the basic language model (context-distilled LM, with a parameter volume of 52 billion), including 44,000 valid data pairs and 42,000 harmless data pairs.
RS data set: collected through the rejection sampling model (this model is a preference model trained on the base data set, with a parameter volume of 52 billion), including 52,000 valid data pairs and 2,000 harmless data pairs, online
data Set: from the RLHF model, updated weekly for 5 weeks, containing 22,000 valid data pairs, excluding innocuous data.
image.pngThe overall process of the thesis is shown in the above figure, starting from the pre-training model (PLM), and then proceeding to the left to perform pre-training based on comparative data from the Internet to obtain a pre-training preference model (PMP), and then fine-tune it on the comparative data set returned by humans , get the preference model (PM); then start with PLM, and then distill the model result of 52 billion parameters to a smaller model according to the prompt data (multiple models with different parameters will be trained independently, from 13 million to 5.20 100 million), this model will be used as the initial policy model of reinforcement learning, and then the PM model will be used as the reward model to carry out reinforcement learning training based on the PPO method; according to the obtained reinforcement learning strategy model, new comparative data will be generated and retrained after manual labeling PM model, and then retrain the reinforcement learning model, so iteratively. HH-RLHF paper Introduction to Anthropic LM

论文标题:Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Paper link: https://arxiv.org/pdf/2204.05862.pdf
Paper data address: https://github.com/anthropics/hh-rlhf


The LLM series will mainly share large language models, including gpt1 , gpt2 , gpt3 , codex , InstructGPT , Anthropic LLM, ChatGPT, LIMA, RWKV and other papers or academic reports. This article mainly shares the papers of Anthropic LLM.

introduce

This paper is very similar to InstructGPT , and the publication time is very close, so you can read it together coherently.
The paper collects human preference data, uses preference model (PM) and reinforcement learning technology (PPO/RLHF) to train the natural language model, making the model both relatively effective and relatively harmless.
The model trained by the paper based on RLHF is better than the original model on most nlp tasks, as shown in the figure below. The paper believes that alignment tasks and specific nlp tasks can be mixed for training without affecting the alignment effect or task performance.
image.png

overall process

image.png
The overall process of the thesis is shown in the above figure, starting from the pre-training model (PLM), and then proceeding to the left to perform pre-training based on comparative data from the Internet to obtain a pre-training preference model (PMP), and then fine-tune it on the comparative data set returned by humans , get the preference model (PM); then start with PLM, and then distill the model result of 52 billion parameters to a smaller model according to the prompt data (multiple models with different parameters will be trained independently, from 13 million to 5.20 100 million), this model will be used as the initial policy model of reinforcement learning, and then the PM model will be used as the reward model to carry out reinforcement learning training based on the PPO method; according to the obtained reinforcement learning strategy model, new comparative data will be generated and retrained after manual labeling PM model, and then retrain the reinforcement learning model, so iteratively.
The paper believes that the most important steps in the whole process are: human feedback data collection, preference model training and RLHF training.
Compared with InstructGPT, InstructGPT includes a supervised learning stage. This paper does not have this stage, but there is a relatively similar distillation stage; InstructGPT does not perform harmless training, and this paper combines validity data and harmless data. Train together; the PM model of InstructGPT has only 6 billion parameters, and the PM model of this paper has a maximum of 52 billion parameters; InstructGPT mixed pre-training data in the RL stage to avoid the loss of evaluation performance, but this paper did not mix pre-training data.
There is another biggest difference. This paper conducts "online" training, deploying different models for users to use to obtain higher-quality data and reduce the impact of long-tail data distribution, and then iteratively trains new models.
The figure below shows the effect of the model after RLHF training. It can be seen that the model trained by RLHF is generally better than the original model, and the effect of the model is better as the model scale increases; "online" online training can make the model better, if only training on the validity data, The effect of the model is better, but at the same time, the harmlessness has also decreased to a certain extent, which shows the opposite relationship between effectiveness and harmlessness.
image.png

data

The paper demonstrates that effectiveness and harmlessness are usually opposites, so effectiveness data and harmlessness data are also collected separately. For the effectiveness data collection task, the annotators will judge which result is more effective, and for the innocuous data collection task, the annotator will judge which result is more harmless. The data mainly includes the following three categories:

  1. Core base data set: The data comes from the basic language model (context-distilled LM, with a parameter volume of 52 billion), including 44,000 valid data pairs and 42,000 harmless data pairs.
  2. RS data set: collected through the rejection sampling model (this model is a preference model trained on the base data set, with a parameter volume of 52 billion), including 52,000 valid data pairs and 2,000 harmless data pairs,
  3. The online data set: comes from the RLHF model, updated once a week for 5 weeks, containing 22,000 valid data pairs, excluding harmless data.

Most of the data or "static" data involved in the following refers to the base data set + RS data set, so it can be seen that the data for training PM basically comes from a model with 52 billion parameters, but when performing RLHF training, it is relatively The replies generated by the small-scale model according to the prompt may be just outside the distribution of the PM training data, which increases the difficulty of training the small-scale model. But in fact, it can be seen from the pictures in the previous section that the small-scale parameter model still learns useful information, even if the parameter amount is 55 times smaller, it is still better than the large-parameter model without RLHF training.

context distillation

In this stage, the effect of the 52 billion parameter model will be distilled to models of different sizes, which will be used as the initial strategy model for subsequent reinforcement learning. During distillation, the batch size is 32, and the learning rate is 0.05 times that of pre-training [1], and it will decay to 0. The distillation process uses a total of 350 million token data. The training steps at this stage are roughly as follows:

  1. Prepare prompt data, 50% from pre-training, 50% from StackExchange dataset. For the pre-training data, put the prompt directly after "Human:" as the input on the user side in the dialogue; for the StackExcahnge data, use the question as the input on the user side, and the one with the highest like rate is the reply of the model assistant.
  2. For the two types of prompt data, input them into the pre-trained language model with 52 billion parameters, and then record the top50 logarithmic probability/token and its index to form a new small data set.
  3. Perform context distillation, all models of different sizes are fine-tuned on the data in step 2, using KL divergence as loss, for each token is a 51 classification, the 51st classification represents all tokens except top50 Probability and .

preference model

The preference model is trained on the comparison data set. Each sample includes a prompt data and a pair of reply data. The prompt may contain multiple rounds of dialogue information between humans and the model.
PM models range in size from 13 million to 52 billion, and all models go through three stages:

  1. Pre-training of language models on large-scale corpus [1].
  2. Pre-training of the preference model, the training data comes from the mixed comparative data of StackExchange/Reddit/Wikipedia. The learning rate is 10% of LM pre-training.
  3. Fine-tuning on human feedback data, the learning rate is 1% of LM pre-training, the maximum sequence length is 1024, and the maximum sequence length is 2048 when continuing training based on "online".

To prevent overfitting, each stage is trained for only one round. For more details, refer to [1].
image.png
The figure above shows the dialogue turn distribution of the PM training data and the difference in accuracy for different sizes of models on different distributions. Most of the data dialogue rounds are within 3 rounds; the accuracy of the model is the highest in the first round of dialogue, there is a large drop after 2 rounds of dialogue, and it continues to decline slowly after multiple rounds; it can also be seen that the larger the model parameter scale , the higher the accuracy.
image.png
This picture still shows the accuracy of the PM model, comparing the impact of training data volume/model size/validity data set/harmless data set on accuracy. It can be seen from the figure that the overall accuracy is getting higher and higher with the increase of training data and the increase of model parameters; for different data sets, the larger the model size, the better the effect, and the harmless data , continuing to increase the size of the model has no effect.
image.png
The figure shows the accuracy of the PM model score. The greater the difference in the score of the PM, the higher the accuracy of the PM; and the PM works better when it is only trained on the validity data, and the mixed data of validity and innocence When training, it is underfitting.
The paper also observes that if each PM in the sample pair is required to score more than a threshold, then the accuracy of the PM will decrease as the threshold increases, as shown in the figure below. This shows that the PM model does not have confidence in the judgment of high-scoring samples. To a certain extent, it is caused by the lack of high-quality high-scoring sample pairs. Therefore, the paper adds an "online" learning stage, which can provide more high-scoring samples. Used to retrain the PM model.
image.png

RLHF

The paper conducts reinforcement learning (RL) based on the PM model, including the following two steps:

  1. Prepare to compare datasets, train PM model, PM model will give "better response" a higher score.
  2. Get all the previous prompt data, and train an RL strategy based on the PM's score (the strategy will generate a reply for each prompt, and the PM will provide a score for each reply).

The prompt data in the above step 2 is not entirely from the existing training data, and some prompts are generated using a large language model.
The main idea of ​​this pipeline is to use the preference model to guide the policy to generate better answers. However, as mentioned earlier, PMs also become less confident at higher scores, so higher rewards do not necessarily mean better performance.
The paper conducts RL training based on the PPO method. The reward is obtained according to the PM score and the KL divergence penalty. The formula is as follows:

t t o t a l = r P M − λ K L × D K L ( p o l i c y ∣ ∣ p o l i c y 0 ) t_{total}=r_{PM}-\lambda_{KL}\times D_{KL}(policy||policy_0) ttotal=rPMlKL×DKL(policy∣∣policy0)


where λ KL \lambda_{KL}lKLIs a hyperparameter greater than 0, in practical applications, λ KL = 0.001 \lambda_{KL}=0.001lKL=0.001 , the value is very small, in the process of RL training, most cases play a relatively small role, becauseDKL D_{KL}DKLUsually less than 100, so this part may not be needed. r PM r_{PM}rPMis the score of the PM model, and the probability of "choosing A over B" can be calculated according to the following formula:

P ( A > B ) = 1 1 + e r P M ( B ) − r P M ( A ) P(A>B)=\frac{1}{1+e^{r_{PM(B)}-r_{PM(A)}}} P(A>B)=1+erPM ( B ).rPM ( A ).1


The paper found that the effect of the model and DPM \sqrt{D_{PM}}DPM It is basically a linear relationship (and models of different sizes have similar linear coefficients), as shown in the figure below, but as DPM \sqrt{D_{PM}}DPM After continuing to increase to a certain threshold, the model performance decreases slightly, which indicates that the preference model is less robust and more likely to be exploited in higher returns.

In addition, the paper also found in the experiment: RLHF gradually becomes less robust at higher PM scores, the larger preference model is more robust than the smaller preference model, and the validity data and harmlessness Inconsistency between data.

Online Iteration RLHF

PM is not very robust on high-scoring data. In order to alleviate the problem of lack of high-scoring data, the paper proposes online iterative RLHF:

  1. First train the RLHF strategy model with the best effect, and generate comparison data based on the model for manual labeling. Because the RLHF model has been optimized by the PM model, the data it generates will be biased towards the data that makes the PM score higher.
  2. Mix the new comparison data with the existing data, train a new PM model, and then use the new PM model to train a new RLHF policy model.

The paper assumes that the online RLHF strategy can collect data with a relatively high PM score, so that the new PM model trained will perform better on data with a higher score, which can assist in training a better RLHF strategy model.
However, RLHF often reduces the entropy of the strategy (loss drop is an entropy reduction process), which will reduce the diversity of collected data (the more diverse the data, the higher the entropy). The paper alleviates this problem by deploying different versions of RL models and online iterative models.

other

The paper is worried that the alignment technique may damage the performance of the model, but from the experimental results, it seems that the model with a larger number of parameters is less damaged.
Various experiments have shown that there is an opposite relationship between effectiveness and harmlessness. In the paper, effective data and harmless data are combined by adjusting the loss weight. The formula is as follows:

L o s s T o t a l = L o s s H e l p f u l n e s s + λ × L o s s H a r m l e s s n e s s Loss_{Total}=Loss_{Helpfulness}+\lambda\times Loss_{Harmlessness} LossTotal=LossHelpfulness+l×LossH a r m l ess n ess


The paper also proposes some methods to reduce the harmful output of the model, such as OOD (out-of-distribution detection), which rejects some strange or harmful requests by directly judging the prompt text.

reference

【1】A general language assistant as a laboratory for alignment.

  The blog records the pace of learning and shares the latest technology. Thank you very much for reading. This blog will be updated continuously, hoping to help you technically.


【Large Model & NLP & Algorithm】Column

Nearly 200 papers and 300 markdown notes written by bloggers . Subscribe to this column [Large Model & NLP & Algorithm] column , or go to https://github.com/wjn1996/LLMs-NLP-Algo to get all the following information:

  • Machine learning & deep learning basics and advanced dry goods (notes, PPT, code)
  • NLP basics and advanced dry goods (notes, PPT, code)
  • A full set of large model systems - pre-trained language model foundation, knowledge pre-training, large model overview, large model training and optimization, large model tuning, ChatGPT-like reproduction and application, etc.;
  • Dachang algorithm brush questions;

insert image description here

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/131612445