The most powerful open source large model? Interpretation of Llama 2 Paper


   The Llama model has always been considered the most powerful open source model in the AI ​​community. Many open source models at home and abroad are trained/developed on its basis, such as the Alpaca model, BELLEE model, Guanaco model, and Vicuna model. However, due to open source agreement issues, the Llama model has not been able to be commercially used for free. Last week, the Meta AI team open sourced the latest large model Llama2. Its capabilities are comparable to ChatGPT and it is free for commercial use. The open source address is here and the sample code is here , which once again fuels the open source of large models. It is worth mentioning that the Meta AI team not only open sourced the pre-training model of Llama 2, but also released a paper on Llama 2 that details the training process and fine-tuning process of the Llama 2 model. This article will interpret the Llama 2 paper and learn from it. Core methods in Llama 2 model fine-tuning.

Introduction

  Before introducing the detailed work, the Meta AI team first compared the capabilities of the Llama 2 model with other open source models and the Bytom model, as shown in Figure 1. It can be seen that in the testing of the Meta AI team, the Llama 2 model is comparable to ChatGPT in terms of model capabilities; in terms of model capabilities and comprehensive security capabilities, Llama 2 is even better.

Insert image description here
  The paper introduces that after Meta AI tested and compared the practicality and security of large models, the Llama 2 released this time is the best model among open source models currently, and it can be commercialized for free. There are two series of open source models released this time:

  • Llama 2 series models, as an updated version of the Llama 1 model, this series of models use more and newer open source data for model training. The corpus has increased by 40%, and the text length of the model has been expanded to twice that of Llama 1, and The grouped-query attention method [xxx] is used. This series of open source models includes 13B parameter and 70B parameter versions; at the same time, the Meta AI team also trained a 34B parameter version, but it has not been open sourced.
  • Llama 2-Chat series of models, this series of models is based on the Llama 2 series of models and fine-tunes the model for dialogue tasks. The open source models of this series include 7B parameter, 13B parameter and 70B parameter versions.

  The training work of the Llama 2 model is mainly divided into three parts, namely Pre-training, Fine-tuning and Human Feedback. The entire training process is shown in Figure 2. Among them, the Pre-training module mainly performs model pre-training; the Fine-tuning module focuses on the reinforcement learning process based on human feedback (RLHF, Reinforcement Learning from Human Feedback), which involves two algorithms/strategies, one of which was used in GPT The Proximal Policy Optimization (PPO) algorithm used in the series of articles, and the other is the Rejection Sampling fine-tuning strategy; the Human Feedback module mainly trains the reward model (Reward Model), which focuses on the model's ability and the security of the answer. Two reward models—Safety Reward Model and Helpful Reward Model—were trained separately.
Insert image description here

Model pre-training

Pre-training settings

  The Llama 2 model is mostly consistent with the Llama 1 model in terms of model structure and pre-training settings. The Llama 2 model uses a standard transformer structure, in which RMSNorm is used, as well as the SwiGLU activation function and the RoPE embedding method. . Compared with the training of Llama 1, the main features of Llama 2 training are:
   1) Complete more data cleaning work
   2) Update the data mixing strategy
   3) Add more training data
   4) Achieve text length translation 5
   ) The grouped-query attention (GQA) method is used.
  The comparison of the pre-training settings of the Llama 2 model and the Llama 1 model is shown in the table below. Appendix A.2.1 of the article also provides relevant comparative experiments:
Insert image description here
  In the pre-training settings of the Llama 2 model During training, Meta AI used brand-new public data (which did not involve user data of any meta products), and filtered and deleted personal information and privacy data in the data, totaling 2 trillion tokens. The tokenizer used is the same as Llama 1, the dictionary size is 32K, and the main hyperparameter settings of the pre-training process are as follows:

hyperparameters value
AdamW Optimizer β_1=0.9, β_2=0.95, eps= 10^(-5)
Learning rate schedule cosine
warmup 2000
Weight decay 0.1
Gradient clipping 1.0

  The paper also provides the loss statistics of the Llama 2 model pre-training process, as shown in the figure below. It can be seen that as the model parameters increase, the training effect of the model becomes more significant. This trend is also introduced in the GPT series of papers. At the same time, it can also be seen that as the amount of pre-training data increases, the model's training effect becomes more significant. The training loss also shows a downward trend, which means that if more pre-training data is used, the pre-training effect of the model may be better.
Insert image description here

Hardware resources and carbon emissions

  Meta AI uses Meta's supercomputer cluster when training Llama 2, all of which are equipped with NVIDIA A100 graphics cards. In response to the low-carbon plan, the paper also lists the GPU computing time, power and carbon emissions required for pre-training of different models, as shown in the figure below. According to this statistical table, you can also estimate the number of GPUs used for pre-training of the Llama 2 model based on the pre-training time.

Insert image description here

Model evaluation

  After the Llama 2 pre-training was completed, the paper conducted a comparative test on the model's coding ability, reasoning ability, reading comprehension ability, mathematical ability, etc. on the internal evaluation data of Llama 1, Llama 2 (pre-training version), MPT and Falcon, as follows As shown in the figure. It can be seen from the comparison chart that under the same parameter magnitude, the various capabilities of the Llama 2 model after completing pre-training are better than Llama 1, MPT and Falcon.
Insert image description here
  In addition to comparison with the above-mentioned open source models, the paper also provides test comparisons with some closed source models, as shown in the figure below. It can be seen that although the multiple scores of the pre-trained Llama 2 model are lower than those of these closed-source models, the gap is relatively small. Of course, this is only the performance of the pre-trained Llama 2 model (i.e. the open source Llama 2 series), and the subsequent fine-tuning work has not yet been carried out.

Insert image description here

Model fine-tuning

  For the fine-tuning of Llama 2, Meta AI named the fine-tuned Llama 2 model Llama 2-Chat. The fine-tuning process includes instruction fine-tuning and RLHF process. For the fine-tuning of the model, the paper is divided into three sections. The first section mainly introduces "supervised fine-tuning", the second section mainly explains the training of "reward model" and the "RLHF" process, and the third section focuses on The Ghost Attention (GAtt) method is introduced.

Supervised fine-tuning

  The author first used the same fine-tuning method and fine-tuning instruction set as Llama 1 for the first step of supervised fine-tuning; then used higher quality data to fine-tune again. During the fine-tuning process, the author found that high-quality data sets can significantly improve the performance of the model. Therefore, the Meta AI team did a lot of work to ensure the high quality of the data. A total of 27,540 pieces of high-quality data were collected in the process. (It does not contain user data for any Meta products).
Fine-tuning details: During the supervised fine-tuning process, each sample data includes a prompt and an answer. In order to ensure that the text length of the model is consistent, the prompt and answer of each piece of data are spliced ​​with a special symbol and then passed into the model. The entire The process only lasted 2 epochs.

Reinforcement Learning with Human Feedback (RLHF)

  RLHF is also a training process. This process is to make the fine-tuned model more inclined to give the answers that humans/researchers want/prefer. The author asked some annotators to annotate the answers generated by the model, select their favorite answers, and then use the answers that these annotators are more satisfied with to train a reward model.

Collection of human preference data

  For the data collection in the RLHF stage, the author first asked the annotator to write a prompt/question, and then choose one of the answers generated by the two models according to the specified judging rules. In order to ensure the diversity of generated answers, the two answers are generated from different models (the temperature parameters are changed). Additionally, the authors asked annotators to indicate their preference for the chosen answer: significantly better, better, slightly better, or negligibly better/ unsure.
  In addition, the author also considered the usability and security of the data set and carried out targeted processing. Among them, manually labeled data will be collected once a week. After each reward model is trained with the collected data, the llama 2-Chat model will also be updated once, and then the new version of llama 2-Chat will be used. The data is manually annotated, and then the reward model is continued to be updated. This iteration ensures that the new reward model can be updated synchronously with the llama 2-Chat model.
  In the table below, the author compares the collected dataset with multiple open source human preference datasets. From the comparison results, it can be seen that the dataset collected by the author contains more conversation turns and has longer average content.

Insert image description here

reward model

  Regarding RLHF, the authors trained two reward models, one of which is optimized for usability using a data set with "usability" preferences, and the other is optimized for security using a data set with "security" preferences. For the reward model, the author also uses the checkpoints of the pre-trained Llama 2-Chat model for training, so as to ensure that the reward model can also understand the target tasks of the Llama 2-Chat model. The advantage of this approach is that it can avoid some unexpected bias situations. For example, there is an information mismatch between the two models, which may cause the model to have preference illusion (that is, the model thinks it is performing reward learning according to human data preferences, but in fact (but not above), the difference between the reward model and the Llama 2-Chat model is that the classification head of the label prediction in the model output is replaced by the regression head used to output the scalar reward.
  The training goal (loss function) of the reward model is:
Insert image description here

  Among them, y c represents an answer that is more inclined to human preferences, and y r represents an answer that is relatively unsuitable for human preferences; r θ (x, y c ) represents that when the model parameter set is θ , the output obtained by inputting x is consistent with y c The scalar fraction of r θ (x,y r ) is the same. As mentioned earlier, the data for training the reward model needs to be manually labeled (significantly better, better, slightly better, or negligibly better/ unsure.). In order to enable loss to reflect the differences between different labels, the author added an m (r) discrete function to the loss function, which can better improve the accuracy of the reward model (the paper is in table 28 and Appendix A.3.3 (Explained in)
Insert image description here
  The training parameters of the reward model are briefly introduced in the paper: also using the AdamW optimizer, the maximum learning rate of the 70B Llama 2-Chat model is 5×10 -6 , and the maximum learning rate of other models is 1×10 -5 , the learning rate schedule is cosine, the warm up is set to 3% of the total number of training steps, the minimum value is 5, and the batch size is 512.
  After the reward model training was completed, the author tested and compared the trained reward model with other public reward models, as shown in the table below. It can be seen that Llama 2-Chat's reward model performs very well in terms of usability and security; overall, it completely surpasses other models.
Insert image description here
  In addition to comparing with other reward models, the author also made statistics on the training process data of the reward model, as shown in the figure below. The author found that as the model parameters increase, the ability of the reward model shows an upward trend (this is consistent with the trend shown by the data in the pre-training process); at the same time, as the training data increases, the ability of the reward model also shows an upward trend. . The author believes that RLHF is one of the most important steps in the final ability of large models. Improving the effect of the reward model will directly bring significant improvements to the final ability of large models (in the next version, the author may further optimize the reward model).
Insert image description here

iterative fine-tuning process

  As we mentioned earlier, the training of the reward model and the training of the Llama 2-Chat model are iterative and simultaneous. The author trained multiple versions of large models in RLHF, called RLHF-V1, RLHF-V2,…, RLHF -V5, the whole process uses two algorithms to update the large model:

  • The Proximal Policy Optimization (PPO) algorithm, one of the most common RLHF algorithms currently, was proposed by the OpenAI team and has been applied in the GPT series of work and has shown very good results.
  • Rejection Sampling fine-tuning (rejection sampling) method, when prompt is input, the best K outputs are sampled from the previous model (the model before training in this iteration), and then the latest reward model is used to compare these outputs Scoring is performed, where the output with the highest score is selected for model parameter update.

  It should be noted that the author has been using rejection sampling in RLHF before RLHF (V4); after that, rejection sampling and PPO were used to complete the RLHF process. In addition, the author only used rejection sampling in the fine-tuning process of llama-chat, the largest model of 70B. The other small models used rejection sampling data of the large model for fine-tuning, that is, using the large model with good capabilities to distill the small model. .

Multi-turn dialogue control

  In dialogue tasks, some instructions may need to be used in multiple rounds of dialogue, but the author found in the original RLHF that the model would forget the instructions at the beginning of the dialogue after multiple rounds of dialogue. In order to solve this problem, the author proposed the Ghost Attention (GAtt) method - a model training trick. This method can help the model focus on knowledge in multiple rounds of dialogue. The effect is shown in the figure.

Insert image description here
  The Ghost Attention ( GAtt ) method is to add an instruction to the original multi-round dialogue and change the original dialogue into [u 1 ,a 1 ,…,u n ,an ] (u 1 represents the user’s first round of input, a 1 represents the corresponding first-round answer, u 2 represents the second-round user input, and so on) becomes [inst+u 1 ,a 1 ' ,…,inst+u n ,a n ' ]. The idea of ​​this method is to generate a more focused data set to fine-tune the model. For example, the original data set is [u 1 ,a 1 ,…,u n ,a n ], and the author obtains a new data set by adding instructions [inst+u 1 ,a 1 ' ,…,inst+u n ,a n ' ], and then use the mixed data of the two [inst+u 1 ,a 1',u 2 ,a 2 ' ,...,u n ,a n ' ] to fine-tune the model, so that the model can always maintain focus on the instruction in multiple rounds of dialogue. In order to avoid the problem of sentence mismatch, during the training process, only the first round of prompts is retained, and the loss of the intermediate rounds is set to 0. It should be noted that the author did not use the GAtt method at the beginning of the RLHF stage, but only used the GAtt method after RLHF V3 to enable the model to increase the control of multiple rounds of dialogue.

RLHF results

  In order to evaluate the model effect after RLHF, the author compared the effects of the Llama 2-Chat model, some open source models and closed source models. Manual evaluation was used to score the answers of each model. The overall comparison results are shown in the figure below . It can be seen that the overall test results of the Llama 2-Chat model are better than other open source models, and are basically equivalent to ChatGPT.

Insert image description here

Security of model answers

  In order to ensure the security of the model's answers, the author has also done a lot of work, mainly including:
   (1) The author collected data in accordance with human security preferences when collecting data, which enabled the model to be generated before RLHF Answers that tend to follow human safety awareness.
  (2) In the RLHF stage, a security reward model is separately trained so that the model can be biased toward outputting safer answers when further fine-tuning the model.
  (3) Use context distillation method in the RLHF stage. This includes generating safety-preferred question answers by adding additional prompts “You are a safe and responsible assistant,” and then using these questions without additional prompts and the corresponding generated safe answers to fine-tune the model.
  Additionally, the authors conducted a series of proactive security tests—called “red teaming.” These security tests involve testers and models conducting security questions and answers in different fields, and the results of the questions and answers are fed back to further optimize the security of the model's answers. Among them, there are more than 350 security testers, including experts in various fields, such as network security, public opinion, law, policy, human rights, ethics and other fields; the issues involved are also rich, such as socioeconomic, gender and race.
  The paper evaluates the question and answer security of the Llama 2-Chat model and compares it with some open source and closed source models, as shown in the figure below. It can be seen that in terms of the security of model answers, Llama 2-Chat is far better than other open source models, and some versions of the model are better than ChatGPT.

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_39561364/article/details/131939857