Llama2 technical details & open source impact

From: NLP Workstation

Enter the NLP group -> join the large model and NLP exchange group

written in front

Hello everyone, I am Cong Liu NLP.

MetaAI open-sourced the Llama2 model yesterday, I just want to say: " "MetaAI should be renamed OpenAI!" "

Llama2 not only open-sources the pre-training model, but also open-sources the Llama2-Chat model after SFT of dialogue data, and introduces the fine-tuning of the Llama2-Chat model in detail.

2777acaf650537c2f221a911e17d30ca.png

The open source model currently has three sizes: 7B, 13B, and 70B. The pre-training stage uses 2 trillion Tokens, the SFT stage uses more than 100,000 data, and the human preference data exceeds 1 million.

Very confident MetaAI. Don’t say anything, come up and show an evaluation comparison chart first, and I will ask OpenAI if you are convinced, skr.

2207363008c4eeeaa06fc7ba2ce7f77a.png

Don't worry, wait for the party. Although Chinese accounts for only 0.13%, there will be a large number of Chinese extended vocabulary pre-training & field data fine-tuning models that will be released by Chinese people.

This is not a search on Github, and many people have already taken advantage of it. This is "the victory of those who have cards".67e11c7d2404ba65b98ec5fe4c22c8a4.png1ad8c4b13ed2a577ee57c2aab34ea1ae.png

The technical details of Llama2 are briefly recorded below.

pre-training phase

The model structure is a Transformer structure. Same as Llama, it uses RMSNorm normalization, SwiGLU activation function, RoPE position embedding, vocabulary construction and size, and the difference from Llama is that it adds GQA (group query attention) and expands the model. Entering a maximum length increases the corpus by 40%.

The training hyperparameters are as follows: β1, β2, and eps of the AdamW optimizer are 0.9, 0.95, and 10e-5 respectively. The cosin learning rate is used. After 2000 steps of preheating, the learning rate decays, and finally drops to 10% of the peak value. The weight decay coefficient is 0.1, and the gradient clipping value is 1.0.

But be careful: 7b and 13b models do not increase GQA! ! !d8bd5337b375dade7fad2e521d8dd9dc.png

The loss in the pre-training phase is shown in the figure below. It can be seen that the model has not yet fully converged.ae209fa6e12699a589745d43930b64fb.png

The effect of the pre-training model is summarized in one sentence: "Open source is the first, and closed source has never been played."

d91c6081cee3635e6027fc5fed625e46.png 6bca421168214e3c56870c3e3d8ba814.png

fine-tuning stage

It doesn't matter if the above pre-trained model has not beaten your OpenAI, you wait for me to finish the whole process first.

SFT

"Data Quality Is All You Need." During MetaAI's experiments, it was found that the training effect of a small number of high-quality data sets is better than that of a large number of low-quality data sets. Therefore, in the future, in SFT, don't blindly pursue quantity, quality is more important.

When fine-tuning, the initial learning rate is 2e−5, and the cosine learning rate descent is used, the weight decay is 0.1, the training batch size is 64, and the maximum length is 4096. In order to improve the efficiency of model training, multiple sets of data are spliced ​​to fill 4096 as much as possible. Each piece of data is directly separated by a stop character. When calculating the loss, only the loss of the target content of each sample is calculated.

RM

For the collection of human preference data, focus on the usefulness and safety of model responses, which are obtained by selecting and comparing the results of two models; however, in addition to selecting a better result, it is also necessary to mark the degree of preference for the selected answer, for example: Significantly better, better, slightly better, negligibly better, or indeterminate. In terms of safety, the two results will be marked as meeting safety, only one composite safety, and neither meeting safety, so as to collect safety data.

During the model iteration process, the preference data required by the reward model needs to be collected iteratively, as follows.4e95e7274732e2a73891f7cbae4c02e8.png

The reward model is to generate a scalar score for the reply generated by the prompt to evaluate the quality of the model generation, but it is found that the usefulness and security are difficult to perform well in the same reward model. Therefore, two reward models are trained independently, one for One is optimized for helpfulness, and the other is optimized for safety.

The initialization of the reward model comes from the pretrained chat model checkpoint, replacing the next Token prediction classifier with a scalar reward value regressor. During training, a binary ranking loss with marginal constraints is used as follows:

Marginal Constraint Open Source Improves the Accuracy of Reward Models. And in order to have a better generalization of the reward model and prevent the phenomenon of reward hacking (for example, Llama2-Chat uses the weakness of the reward model to exaggerate the reward score in the case of poor performance), during the reward model training process, also Added some open source human preference data.

Training parameter settings: the maximum learning rate of the 70B model is 5e−6, and the maximum learning rate of the other models is 1e−5, using the cosine learning rate drop, the lowest to 10% of the maximum learning rate, and using 3% of the total number of steps Warm up (minimum 5 steps), training batch size is 1024.

The effect of different reward models on different data is shown in the table below. edbba461b37f6c585c7c9a9a853f096b.pngThe reward model was found to perform better on significantly better data and worse on negligibly better or uncertain data.f833e72f13cbe885e5096bea3b97607f.png

And the scaling trend of the reward model in terms of data and model size is studied. While the data is gradually increasing, the effect is gradually improving.b1822e355b1f112d8c79f9974cd98bf9.png

Iterative Fine-Tuning

As more batches of human preference data are received, better reward models can be trained and more cues collected. Therefore, five successive versions of the RLHF model (RLHF-v1 to RLHF-v5) were trained.

Key training strategies include:

  • Closest Policy Optimization (PPO): Standard Reinforcement Learning Algorithms

  • Rejection sampling fine-tuning: K results are sampled when the model is output, the one with the highest reward value is selected, and the gradient is updated in the reinforcement learning stage.

Prior to RLHF-v4, only rejection sampling was used for fine-tuning, after which the two were sequentially combined. But mainly only the 70B model has been fine-tuned by rejection sampling, while the fine-tuning data of other small models comes from the rejection sampling data of the large model, which is equivalent to distilling the small model with the large model.

The reward value for reinforcement learning in the model is generated by the combination of the usefulness reward value and the security reward value, and the specific calculation is as follows:2504d983d9c8c737f8fffe3707fe601f.png

Training parameter settings: For all models, the AdamW optimizer is sampled, where β1, β2 and eps are 0.9, 0.95 and 1e−5, respectively, weight decay is 0.1, gradient clipping is 1.0, and the learning rate is constant at 1e−6. During PPO training, the large batch is 512, the small batch is 64, and the PPO clipping threshold is 0.2. For the 7B and 13B models, set the KL penalty coefficient to 0.01, and for the 34B and 70B models, set the KL penalty coefficient to 0.005. All models are trained for 200 to 400 iterations.

Consistency across multiple rounds of dialogue

The original RLHF model forgets the initial instruction after a few rounds of dialogue, as shown in the figure below (left). In order to solve these limitations, the Ghost Attention method (Gatt, actually a training trick) is proposed to enhance the model's compliance with instructions.

2074c654630dbea0ae487fac2fd92490.png

Assuming that the multi-round dialogue data is [u1,a1,...,un,an], define an instruction (inst) so that the instruction is followed throughout the dialogue process, and then connect the instruction to all user messages of the dialogue, Constructed as [inst+u1,a1,...,inst+un,an]. In order to avoid the problem of sentence mismatch, during the training process, the hints of the first round are kept, and the loss of the intermediate rounds is set to 0.

Summarize

There are Llama2 models 7b, 13b, 34b, and 70b, which are completely sufficient, and the most anticipated 34b will be postponed.

Domestic open-source base models are still at levels 6b, 7b, and 13b, and 33-34b is just needed.

With more and more open source and commercial models, the large model community will become more and more prosperous, which is the good news for small and medium factories. Open source is a real hero.

I would like to follow MetaAI to take the Open source AI route.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131863002