The stronger Llama 2 is open source and can be used directly for commercial use: Overnight, the layout of the big model has changed

Already on Microsoft Azure, will soon be on AWS, Hugging Face.

Overnight, the pattern of large-scale models changed drastically again.

picture

Llama has been arguably the most powerful open source large model in the AI ​​community. However, due to the open source agreement, it has not been free for commercial use.

Today, Meta finally released the long-awaited free commercial version Llama 2.

picture

The Llama 2 model series released by Meta this time  contains three parameter variants of 7 billion, 13 billion and 70 billion . In addition, 34 billion parameter variants were trained, but not released, only mentioned in the technical report.

According to reports, compared with Llama 1, Llama 2 has 40% more training data, doubles the context length, and adopts a group query attention mechanism . Specifically, the Llama 2 pre-training model is   trained on  2 trillion tokens , and the fine-tuned Chat model is trained on 1 million human-labeled data .

picture

Published evaluation results show that Llama 2 outperforms other open source language models on a number of external benchmarks including inference, coding, proficiency and knowledge tests.

picture

Next, we will learn more about Llama 2 from the technical report published by Meta.

picture

Paper address: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Project address: https://github.com/facebookresearch/llama

Download link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Collectively, the Llama 2 family of models ranges in size from 7 billion to 70 billion parameters as a set of pretrained and fine-tuned large language models (LLMs). Among them, Llama 2-Chat is specially optimized for conversational use cases.

picture

Llama 2-Chat training pipeline.

In addition to outperforming open-source models on most benchmarks, the Llama 2 family of models may also be a suitable replacement for closed-source models, based on Meta's human assessments of usefulness and safety.

picture

Results of Llama 2-Chat and other open and closed source models on human evaluation of security.

The Meta details Llama 2-Chat's approach to fine-tuning and security improvements, allowing the community to build on its work and contribute to the responsible development of large language models.

pre-training

To create the new family of Llama 2 models, Meta builds on the pre-training method described in the Llama 1 paper, using an optimized autoregressive transformer, and making some changes to improve performance.

Specifically, Meta performs more robust data cleaning, updates the mixed data, increases the total number of training tokens by 40%, and doubles the context length. Table 1 below compares the detailed data of Llama 2 and Llama 1.

picture

The training corpus for Llama 2 contains mixed data from publicly available sources and does not include Meta product or service related data. Llama 2 adopts most of the pre-training setup and model architecture from Llama 1, including the standard Transformer architecture, prenormalization using RMSNorm, SwiGLU activation function, and rotated position embedding.

In terms of hyperparameters, Meta is trained with AdamW optimizer, where β_1 = 0.9, β_2 = 0.95, eps = 10^−5. At the same time, a cosine learning rate schedule (2000 steps of warm-up) was used, and the final learning rate was decayed to 10% of the peak learning rate.

Figure 5 below shows the training loss curve for Llama 2 for these hyperparameter settings.

picture

In terms of training hardware, Meta pre-trained the model on its Research Super Cluster (RSC) as well as its internal production cluster. Both clusters used NVIDIA A100.

In terms of the carbon footprint of pre-training, Meta calculated the carbon emissions generated by the pre-training of the Llama 2 model using the power consumption estimation and carbon efficiency of GPU devices according to previous research methods.

picture

Carbon emissions of each model in Llama 2 during pre-training.

Llama 2 pretrained model evaluation

Meta reports results on standard academic benchmarks for open-source models such as Llama 1, Llama 2 base models, MPT (MosaicML), and Falcon.

Table 3 below summarizes the overall performance of these models on a range of popular benchmarks, showing that Llama 2 outperforms Llama 1 .

picture

In addition to the open-source model, Meta also compared the results of Llama 2 70B with the closed-source model, and the results are shown in Table 4 below. Llama 2 70B is close to GPT-3.5 on MMLU and GSM8K, but there is a significant gap on the encoding benchmark.

In addition, Llama 2 70B achieves the same or better results than Google PaLM (540B) on almost all benchmarks, although there is still a large gap between the performance of GPT-4 and PaLM-2-L.

picture

fine-tuning

Llama 2-Chat is the result of months of research and iterative application of alignment techniques, including instruction alignment and RLHF, requiring significant computational and annotation resources.

Supervised Fine Tuning (SFT)

Third-party supervised fine-tuning data is available from many different sources, but Meta found that many of these data were not sufficiently diverse and of high quality, especially for aligning LLMs with conversational instruction. Therefore, they first focused on collecting a few thousand examples of high-quality SFT data, as shown in Table 5 below.

picture

During fine-tuning, each sample consists of a prompt and an answer. To ensure that the model sequence length is correctly padded, Meta concatenates all prompts and answers in the training set. They use a special token to separate the prompt and answer fragments, and utilize an autoregressive objective to zero out the token loss from user prompts, thus backpropagating only on answer tokens. Finally, the model was fine-tuned twice.

RLHF

RLHF is a model training procedure applied to fine-tuned language models to further align model behavior with human preferences and instruction following. Meta collects data representing an empirical sampling of human preferences, whereby human annotators can choose which of the two model outputs they prefer. This human feedback is then used to train a reward model that learns the preference patterns of human annotators and then automatically makes preference decisions.

Table 6 below reports statistics on reward modeling data collected by Meta over time and compared to several open source preference datasets. They collected a large dataset of more than 1 million binary comparisons based on human-applied specified criteria, that is, meta-reward modeling data.

Note that the number of tokens in hints and answers varies by text domain. Prompts for abstracts and online forum data are usually longer, while prompts for conversational ones are usually shorter. Compared with existing open-source datasets, our preference data has more dialogue turns with longer average length.

picture

The reward model takes as input the model responses and their corresponding cues (including context from previous rounds) and outputs a scalar score representing the quality of the model generation (e.g. usefulness and safety). Using this response score as a reward, Meta optimizes Llama 2-Chat during RLHF to better align with human preferences and improve usefulness and safety.

In each batch of human preference annotations for reward modeling, Meta takes 1000 samples as a test set to evaluate the model, and calls the collection of all cues for the corresponding test set "meta-usefulness" and "meta-safety", respectively.

The accuracy results are reported in Table 7 below. As expected, Meta's own reward model performed best on the internal test set collected based on Llama 2-Chat, with the "Usefulness" reward model performing best on the "Meta Usefulness" test set, and similarly, the "Security" reward model performing best on the "Meta Safety" test set.

Overall, Meta's reward model outperforms all baselines including GPT-4. Interestingly, GPT-4 outperforms other non-meta-reward models even though it was not directly trained or specifically targeted at this reward modeling task.

picture

Scale trend. Meta studies the scaling trends of reward models in terms of data and model size, fine-tuning different model sizes with increasing amounts of reward model data collected each week. These trends are reported in Figure 6 below, showing the expected results of higher performance for larger models with similar amounts of data.

picture

As more batches of human preference data annotations are received, better reward models can be trained and more cues collected. Therefore, Meta trained successive versions of the RLHF model, referred to here as RLHF-V1, ... , RLHF-V5.

RLHF is fine-tuned here using two main algorithms:

  • Proximal Policy Optimization (PPO);

  • Rejection sampling fine-tuning.

RLHF results

The first is model-based evaluation results. Figure 11 below reports the progress of different SFT and RLHF versions in terms of safety and usefulness, as evaluated by the safety and usefulness reward model inside Meta.

picture

Let's look at the human evaluation results. As shown in Figure 12 below, the Llama 2-Chat model significantly outperforms the open-source model in both single-round and multi-round prompting. In particular, Llama 2-Chat 7B outperforms MPT-7B-chat on 60% of the hints, and Llama 2-Chat 34B exhibits an overall win rate of more than 75% relative to Vicuna-33B and Falcon 40B of the same size.

picture

Here, Meta also points out some limitations of human evaluation.

While the results show that Llama 2-Chat is on par with ChatGPT in terms of human evaluation, it must be noted that human evaluation has some limitations.

  • By academic and research standards, this article has a large hint set of 4k hints. However, this does not include real-world usage of these models, which could be much more numerous.

  • The diversity of prompts may be another factor affecting the results, for example, the prompt set in this paper does not include any coding or reasoning related prompts.

  • This paper only evaluates the final generation of multi-turn dialogues. A more interesting approach to evaluation might be to ask the model to complete a task and score the model's overall experience across multiple conversations.

  • Human evaluation of generative models is inherently subjective and noisy. Therefore, evaluation with a different set of hints or a different instruction may yield different results.

safety

The study assessed the safety of Llama 2 using three commonly used benchmarks, targeting three key dimensions:

  • Authenticity, refers to whether the language model will generate error information, using the TruthfulQA benchmark;

  • Toxicity, refers to whether the language model will produce "toxic", rude, and harmful content, using the ToxiGen benchmark;

  • Bias, refers to whether the language model will produce biased content, using the BOLD benchmark.

Pretrained Security

First, pre-training data is very important for the model. Meta conducts experiments to evaluate the safety of pre-training data.

The study used the HateBERT classifier fine-tuned on the ToxiGen dataset to measure the "toxicity" of the English data of the pre-training corpus. The specific results are shown in Figure 13 below:

picture

In order to analyze the problem of bias, the study statistically analyzed the pronoun and identity-related terms and their proportions in the pre-training corpus, as shown in Table 9 below:

picture

In addition, in terms of language distribution, the languages ​​covered by the Llama 2 corpus and their proportions are shown in Table 10 below:

picture

security fine-tuning

Specifically, Meta uses the following techniques in secure fine-tuning: 1. Supervised secure fine-tuning; 2. Secure RLHF; 3. Secure context distillation.

Meta observed early in the development of Llama 2-Chat that it was able to learn from security demonstrations during supervised fine-tuning. The model quickly learned to write detailed security responses, address security concerns, explain why topics might be sensitive, and provide more useful information. In particular, when models output security replies, they tend to write them in more detail than regular annotators. So, after collecting only a few thousand supervised demonstrations, Meta switched entirely to RLHF to teach the model how to write more nuanced responses. Another benefit of using RLHF for full tuning is that it makes the model more robust to jailbreak attempts.

picture

Meta first conducts RLHF by collecting data on human preferences for safety, where annotators write prompts that they believe elicit unsafe behavior, then compare multiple model responses to the prompts, and select the safest response based on a set of guidelines. The human preference data is then used to train a secure reward model, and an adversarial prompt is reused in the RLHF stage to sample from the model.

As shown in Figure 15 below, Meta uses the average reward model score as an outcome of the model's performance in terms of safety and usefulness. Meta observed that when they increased the proportion of safe data, the performance of the model handling risky and adversarial prompts improved significantly.

picture

Finally, Meta refines the RLHF pipeline with context distillation. This involves generating safer model responses by prepending the prompt with a safe pre-prompt, such as "You are a safe and responsible assistant", and then fine-tuning the model based on the safer response without the pre-prompt, which essentially extracts the safe pre-prompt (context) into the model.

Meta uses a targeted approach that allows the security reward model to choose whether to use contextual distillation for each sample.

picture

Figure 17 below shows the overall breach percentage and safety rating for various LLMs.

picture

Figure 18 below shows the violation percentages for single and multi-turn conversations. A trend across models is that multiple rounds of dialogue are more likely to elicit unsafe responses. That said, Llama 2-Chat still performs well compared to the baseline, especially in multi-turn conversations.

picture

Figure 19 below shows the percentage of security breaches in different categories for different LLMs.

picture

picture

Reference link: https://ai.meta.com/llama/

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131831549