Llama 2's high-profile open source subverts the big model circle! 2 trillion token training, can't beat GPT3.5

edit

Add picture annotations, no more than 140 words (optional)

Source | Xinzhiyuan ID | AI-era Wake up, Meta dropped a heavy nuclear bomb: Llama 2! Following the open source of LLaMA, Meta today teamed up with Microsoft to open source Llama 2 with a high profile. There are three versions: 7B, 13B, and 70B.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

According to reports, Llama 2 has received 2 trillion tokens for training, and the context length is 4k, which is twice that of Llama 1. The fine-tuned model has been trained on over 1 million human annotations. The performance of Llama 2 has killed many open source language models in seconds, and achieved SOTA in reasoning, coding, ability and knowledge testing. The most important thing is that this time Llama 2 is not only available for research, but also for free commercial use! (emphasis added)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

In February of this year, after Llama 1 was open-sourced, Meta received more than 100,000 requests to access large language models. Unexpectedly, the opening of Llama immediately caused the AI ​​​​community model to explode, and various series of "alpacas" such as UC Berkeley's Vicuna and Stanford Alpaca swarmed out. This time, the open source of Llama 2 directly challenged OpenAI and Google. Under OpenAI and Google's dominance, Meta's move wants to change the pattern of large-scale AI competition by finding another way. LeCun said that the free commercialization of Llama 2 will directly change the market structure of large-scale language models. I became a god overnight, but it was not as good as the birth of GPT-3.5 Llama 2. Unexpectedly, it was directly "sacred" by a group of netizens.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Meta, your crown fell off and even GPT-4 was pushed off the battlefield.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

However, objectively speaking, is Llama 2 really omnipotent? According to Nvidia scientist Jim Fan, Llama 2 has not reached the level of GPT-3.5, mainly because of its weak code capability.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

For more details about Llama 2, Jim Fan and I made a too long version: - The training cost of Llama 2 may exceed 2 million US dollars. Meta has done an incredible service to the community by publishing a business-friendly licensing model. AI researchers at big corporations were wary of Llama-1 due to licensing issues, but now I think many of them will jump in and contribute. - The Meta team conducted a human study on 4K cues to evaluate the utility of Llama-2. They use "win rate" as a metric for comparing models, similar to the Vicuna benchmark. The 70B model is roughly on par with GPT-3.5-0301 and significantly outperforms Falcon, MPT and Vicuna.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

I trust real human ratings more than academic benchmarks. - Llama-2 is not yet at the level of GPT-3.5. On HumanEval, it's not as good as StarCoder or many other models designed specifically for coding. Still, I have no doubt that Llama-2 will be significantly improved with its open weights. - The Meta team spares no effort in AI safety issues. In fact, almost half of the paper is devoted to safety rails, red teams, and assessments. In previous studies, the balance between usefulness and safety was very difficult. Meta alleviates this problem by training 2 independent reward models. These models are not yet open source, but are very valuable to the community. - Llama-2 will greatly advance multimodal artificial intelligence and robotics research. These areas require more than black box access to APIs. Until now, researchers had to convert complex sensory signals (video, audio, 3D perception) into textual descriptions before feeding them into LLMs, which was clumsy and resulted in a lot of information loss. It would be more effective to directly "graft" the sensory modules onto the powerful LLM backbone. - The technical report is a masterpiece in itself. The technical report of GPT-4 only shared very little information, and Llama-2 is different, it introduces the whole recipe in detail, including model details, training stage, hardware, data pipeline and title process. For example, the paper provides a systematic analysis of the impact of RLHF and provides nice visualizations. How was Llama 2 born? The latest technical report of Llama 2 was also released today, with more than 70 pages.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

GenAI first appeared in the name of the team for model training. Like ChatGPT, Llama 2 also went through three stages of pre-training (Pretraining), fine-tuning (Fine-tuing) and human feedback reinforcement learning (RLHF). In addition to open-sourcing Llama 2, Meta fine-tuned the Llama 2-Chat model based on Llama 2.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

On major benchmarks, Llama 2 performs quite well in terms of reasoning and so on.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Next, let’s take a look at how Llama 2 was born. Pre-training To create the new Llama 2, researchers at Meta first adopted the pre-training approach used by Touvron et al., applying an optimized autoregressive Transformer. However, to further improve performance, the Meta team made some changes. Specifically, the researchers performed more robust data cleaning, updated the data combination, and increased the total number of training labels by 40%, doubled the context length, and used GQA (Group Query Attention) to improve large-scale model reasoning scalability. The table below compares the differences in properties between Llama 2 and Llama 1.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

In terms of pre-training data, Meta's training corpus includes a variety of new data combinations from open sources, but does not include data from Meta's own products or services. In addition, the researchers worked to remove data from certain websites known to contain large amounts of private information about individuals. The Meta team trained on a data set of 2 trillion tokens (shown in the table above), doing a good trade-off between performance and cost, and sampling the most realistic data sources to increase knowledge and reduce hallucinations. In terms of training details, the Meta team has both continued and innovated. The researchers followed most of the pre-training settings and model architecture in Llama 1, using the standard Transformer architecture, and RMSNorm for pre-normalization, and also used the SwiGLU activation function and rotation position embedding. The main difference in structure from Llama 1 is that the context length and GQA (Group Query Attention) are increased (as shown in the above table). The figure below shows the training loss of Llama 2. The researchers compared the training loss of the Llama 2 series models of different sizes. The Meta team found that after pre-training 2T tokens, the model still did not show any signs of saturation.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Evaluation Next the researchers report the performance test results of the Llama 1 and Llama 2, MPT and Falcon models on some standard academic benchmarks. In all evaluations, the Meta team applied an internal evaluation library, internally reproducing the test results of the MPT and Falcon models. For these models, researchers always pick the highest score for comparison between the evaluation framework and any publicly reported results. In Table 3, the researchers summarize the overall performance of LlaMa 2 on a range of commonly used benchmarks. Here is an overview of these commonly used benchmarks:

  • Code: The researchers report the average pass@1 scores of the models on HumanEval and MBPP.

  • Commonsense Reasoning: Researchers report average scores for PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA, as well as 7-shot test results for CommonSenseQA and 0-shot test results for all other benchmarks.

  • Knowledge: The researchers evaluated 5-shot scores on NaturalQuestions and TriviaQA, as well as average scores.

  • Reading Comprehension: The researchers report 0-shot average scores for SQuAD, QuAC, and BoolQ.

  • Mathematics ability: The researchers report average scores on the GSM8K (8-shot) and MATH (4-shot) benchmarks, reporting first.

  • Other popular synthetic benchmarks: The researchers report overall results for MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI Eval (3-5shot). Among them, for AGI Eval, the researchers only evaluated English-related tasks and reported the average value.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

The specific data can be seen from the above table, Llama 2 is better than Llama 1. Especially compared with the model of Llama 1-65B, the scores of Llama 2-70B on MMLU and BBH are improved by 5 points and 8 points, respectively. Models from Llama 2-7B and 30B outperform MPT models of the same size on all tests except the code benchmark. As far as the Falcon model is concerned, the Llama 2-7B and 34B perform better than the Falcon-7B and 40B models in all benchmarks. In addition, the Llama 2-70B model also outperforms all open source models. In addition to comparing with the open-source model, the Meta team also compared the results of Llama 2-70B with the closed-source model. As shown in the table below, Llama 2-70B scores close to GPT-3.5 on MMLU and GSM8K, but there is a clear gap on the encoding benchmark.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

The results of the Llama 2-70B are comparable to or better than the PaLM 540B on almost all benchmarks. And the performance gap between Llama 2-70B and GPT-4 and PaLM-2-L is still large. Fine-tuning Llama 2-Chat is the result of several months of research by the Meta team and iterative application of alignment techniques (including instruction fine-tuning and RLHF), which requires a lot of calculations and annotations. Supervised Fine Tuning (SFT)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Third-party SFT data is available from many different sources, but the Meta team found that many of these data were not sufficiently diverse or high-quality, especially to align the LLM with dialogue instructions. Therefore, the researchers first focused on collecting thousands of high-quality examples of SFT data, as shown in the figure above. By setting aside millions of examples from third-party datasets and using higher-quality examples, the research results are significantly improved. The researchers found that after collecting a total of 27,540 annotations, SFT annotation achieved high-quality results. To verify data quality, the researchers scrutinized a set of 180 examples, comparing human-provided annotations to samples generated by the model through human inspection. Unexpectedly, the researchers found that the sample output generated by the SFT model was often comparable to the SFT data handwritten by human annotators. This suggests that researchers can adjust priorities and devote more annotation efforts to preference-based RLHF annotations. In supervised fine-tuning, the researchers used a cosine learning rate schedule with an initial learning rate of 2 times 10 to the negative 5th power and a weight decay of 0. 1, with a batch size of 64 and a sequence length of 4096 tokens. During fine-tuning, each sample consists of a hint and an answer. To ensure that the model sequence length is properly padded, the researchers concatenate all prompts and answers in the training set and use a special marker to separate prompt and answer fragments. Using an autoregressive objective, the researchers zero out the labeling loss from user prompts, so only the answer labeling is backpropagated. Finally, the researchers fine-tuned the model 2 times. The data collected by the Reinforcement Learning with Human Feedback (RLHF) Meta team represents an empirical sampling of human preferences, based on which human annotators can choose which of the two model outputs they prefer. This human feedback is then used to train a reward model that learns the preference patterns of human annotators and then automatically makes preference decisions. The team chose the binary comparison protocol over other schemes primarily because it allows researchers to maximize the diversity of collected cues. The researchers list open-source data used for reward modeling, as well as internally collected human preference data. Note that the binary human preference comparison contains 2 responses (choose and don't choose) that share the same cue. Each example consists of a prompt and a response, the latter being the input to the reward model. The researchers report the number of comparisons, the average number of turns per dialogue, and the average number of tokens per example, per prompt, and per reply. Human preference statistics for reward modeling:

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

The following table is the result in terms of accuracy.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Reward Model Results Meta's own reward model performed best on the internal test set collected based on Llama 2-Chat, and the usefulness reward model performed best on the Meta Helpful (Mega Helpful) test set. Also, the safety reward model performs best on the Mega Safety test set. Overall, Meta's reward model outperforms all models including GPT-4. Interestingly, GPT-4 outperforms other models even though it is not directly trained and does not have a dedicated reward modeling task. In each batch of human preference annotations for reward modeling, the researchers used 1,000 examples as a test set to evaluate the model. Researchers refer to the collection of all hints of the corresponding test set as Meta Helpful and Meta Safety, respectively. For reference, the researchers also evaluated other publicly available alternatives: SteamSHP-XL based on FLAN-T5-xl, Open Assistant's reward model based on DeBERTa V3 Large, and GPT4. Note that at inference time, unlike training time, all reward models can predict a scalar for a single output without access to its paired outputs. Of course, more data and larger models generally improve accuracy, and Meta's models don't seem to be saturated with learning from the training data yet. As shown below.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

For more information about RLHF, please refer to the original text of the paper. Consistent system messages across multiple rounds In dialog settings, there are some commands that should apply to all dialog situations, such as responding concisely, or impersonating a public figure, etc. When researchers provide such instructions to Llama 2-Chat, the responses given should always obey this constraint. However, the original RLHF model tends to forget the initial instruction after a few rounds of dialogue, as shown in the figure below.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

To address these limitations, the Meta team proposes "Ghost Attention" (GAtt), a very simple approach that utilizes fine-tuning data to help a model's attention stay focused during a multi-stage process. After applying GAtt, the result is shown in the figure below. We can see that GAtt can realize dialogue control in multiple rounds.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

The figure below is a visualization of dialogue attention with and without GAtt.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

The researchers considered the maximum activation across the network and placed adjacent markers together. To illustrate how GAtt helps reshape attention during fine-tuning, the figure above shows the model's maximum attention activation. The left side of each figure corresponds to the system information. We can see that compared to the model without GAtt (left), the model with GAtt installed (right) maintains a larger attentional activation on system information for most of the conversation. But while GAtt is useful, its current implementation is rough, and more development and iteration on the technology will only further benefit the model. Results of RLHF Of course, evaluating LLM is a challenging open research problem. Human evaluation, while a decent criterion, is complicated by various human-computer interaction considerations and is not always scalable. Therefore, in order to select the best-performing model among multiple models in each iteration from RLHF-V1 to V5, researchers at Meta first observed the reward improvement of the latest reward model to save cost and improve iteration speed . Subsequently, the main model version was validated by human evaluation. The picture below is the evolution of Llama 2-Chat. The researchers show the evolution of Llama 2-Chat vs. ChatGPT win percentage after several iterations of fine-tuning. The referee on the left is Meta's reward model, probably towards their own model, and the referee on the right is GPT-4, and the result should be more neutral.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

And as mentioned above, human evaluation is generally considered the gold standard for judging natural language generation models (including dialogue models). To assess the quality of major model releases, Meta asked human evaluators to rate their usefulness and safety. The researchers compared the Llama 2-Chat model with open-source models (Falcon, MPT), as well as closed-source models (ChatGPT) and PaLM on more than 4,000 single- and multi-round prompts. For ChatGPT, the researchers used a model of gpt-3.5-turbo-0301 in each generation. For PaLM, the chat-bison-001 model is used. The following figure is the evaluation result——

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

It can be seen that the Llama 2-Chat model significantly outperforms the open-source model on both single-turn and multi-turn cues. In particular, the Llama 2-Chat 7B model outperforms the MPT-7B-chat in 60% of the cues. And Llama 2-Chat 34B has an overall win rate of more than 75% compared to Vicuna-33B and Falcon 40B of the same size. In addition, the largest Llama 2-Chat model has a win rate of 36% and a draw rate of 31.5% in the 70B version compared with ChatGPT. On the pompt set of Meta researchers, the Llama 2-Chat 70B model outperforms the PaLM-bison chat model to a large extent. Commercial restrictions: No more than 700 million users Llama-2 is free for commercial use, which is the first time for Meta. However, it is not absolutely free.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

According to the terms of the license, Meta stipulates that the data or output of Llama-2 cannot be used to improve any other LLM, similar to OpenAI, but not common in the OSS model. In addition, if the product MAU exceeds 700 million users in June 2023, a special commercial license must be applied for.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Except as stated above, use, reproduction, distribution, copying, creation of derivative works, and modification of Llama-2 are royalty-free. For details, please refer to: https://github.com/facebookresearch/llama/blob/main/LICENSE Strong alliance, Microsoft adult life winner If you want to say the biggest life winner, it is Microsoft. While teaming up with OpenAI to launch the paid version of Office supported by GPT-4, on the other hand holding Meta's hand, welcome Llama 2 to the stage on Azure and Windows.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Today, Xiao Zha also posted a photo of himself and Nadella on Ins. In the first half of the year, the photo of Nadella and Sam Altman was taken together, and there was an instant feeling that OpenAI was backstabbed.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Coupled with the accompanying text of netizens: Nadella made a surprising and admirable move between the open and closed Al. (is a master)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

According to the Meta official blog, we have taken our partnership with Microsoft to the next level, becoming the preferred partner of Llama 2. Llama 2 is available in the Azure AI Model Library. Developers using Microsoft Azure will be able to build on it and take advantage of cloud-native tools for content filtering. It is also optimized to run natively on Windows, providing a seamless workflow for developers. Alternatively, Llama 2 is also available through AWS, Hugging Face, and other platforms. It is said that Llama 2 runs the 70B model on Amazon AWS, and the minimum requirement for one year is about $85,000.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

In addition, today Meta also announced that it has teamed up with Qualcomm to provide Llama 2-based capabilities on flagship smartphones and PCs from 2024. Empowering developers to leverage the AI ​​of the Snapdragon platform to launch exciting new generative artificial intelligence applications.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Netizens tried it out, and Mac can run the open source of Llama 2, which is a big carnival in the AI ​​community. Many netizens used Midjourney's various AI tools to generate alpacas to pay tribute to this important moment.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Xiao Zha was also made a god.

edit

Add picture annotations, no more than 140 words (optional)

Xiaozha, you are my head of HuggingFace, said that Meta’s influence in the field of open source artificial intelligence has been expanding, and 600+ models have been released on Hugging Face, such as MusicGen, Galactica, Wav2Vec, etc.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

After Llama 2 is open-sourced, the first step is to start the demo. confirmed. Llama 2-70B can be easily trained on a single GPU with 48GB. 70B 4-bit QLoRA and A6000 unimpeded.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

edit

Add picture annotations, no more than 140 words (optional)

Llama 2-7B has been converted to Core ML and runs natively on Mac at ~6.5 tokens per second.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

I just got Llama 2 running on my Mac using the latest version of this project: https://github.com/jmorganca/ollama

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

A lot of people are asking how does the Llama 2 compare to other popular models? Compared to other models of similar size, Llama 2 is clearly superior, and according to benchmarks, Llama 2 is the best OS model!

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Reference: https://ai.meta.com/llama/?utm_source=twitter&utm_medium=organic_social&utm_campaign=llama2&utm_content=video

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131858530