What is the performance of the refreshed Llama 2?

 Datawhale dry goods 

Large model: Llama 2, source: Heart of the Machine

Although the performance is still not as good as ChatGPT 3.5, the power of open source is immeasurable.

I believe that many people have been swiped by Llama 2 released by Meta. OpenAI research scientist Andrej Karpathy said on Twitter, "For artificial intelligence and LLM, this is really a big day. This is the most powerful LLM that can provide weights for everyone to use."

969b1281dfec1e076c9ed25eafeeb71d.png

For the open source community, this large model is "the hope of the whole village". Its appearance will further narrow the gap between open-source big models and closed-source big models, and give everyone the opportunity to build their own big-model applications based on it.

So Llama 2 has been the focus of all community members for the past 24 hours. Everyone is talking about its performance, how to deploy it, and the possible impact. In order to let everyone understand the information at the first time, we have summarized it in this article.

How exactly does Llama 2 perform?

Before showing the evaluation results, let's sort out the basic information of Llama 2:

  • Contains three parameter variants of 7 billion, 13 billion and 70 billion. In addition, 34 billion parameter variants were trained, but they were not released, only mentioned in the technical report.

  • Training on 2 trillion tokens, compared to Llama 1, the training data is 40% more, and the fine-tuned Chat model is trained on 1 million human-labeled data.

  • The supported context token length is doubled, from 2048 to 4096.

  • It is free for commercial use, but products with more than 700 million daily active users need to apply for commercial permission separately.

5269b1c68f5ea426565eadefd801c42a.png

After the release of Llama 2, the Github star volume of the entire Llama project is approaching 30k.

a03d97dd38dd8ea2405762093afc3903.png

In the paper published by Meta, we can also see some performance of Llama 2:

  • The Llama 2 70B scores close to GPT-3.5 on MMLU and GSM8K, but there is a significant gap on the encoding benchmark.

  • On almost all benchmarks, Llama 2 70B achieves the same or better results than Google PaLM (540B), although there is still a large gap between the performance of GPT-4 and PaLM-2-L.

e07e10405c62795f6cde144fc9717bef.png

That is to say, even the Llama 2 70B with the largest number of parameters, the performance has not surpassed GPT-3.5 at present, and the distance from GPT-4 is even greater.

In order to test the strength of Llama 2, various netizens are trying to find ways to test it, such as asking him to write a company email based on some "ridiculous" information.

34c309166d94de2bebcfa599d37b6e3e.png

0d307d41b5be2154c69426d3fe1e7de6.png

Some people have already started live broadcasting the PK of Llama 2 and other ChatGPT-like products:

522656549436bdd06b436d8c2760e17a.png

During the testing process, we found some problems. For example, Llama 2 seems to have received too much RLHF training, which made it very disciplined:

b5a7a428bbe6bcb46839528ed85b0b9d.png

In a free app made by a developer, the heart of the machine also got a hands-on experience.

First, let's ask what can Llama 2 do? Llama 2 replied: "I can help you with various tasks, such as answering questions, providing information, and even helping you with daily tasks. I am just an artificial intelligence and cannot perform tasks in reality, but I can provide you with what you need. for information or assistance...". From the language point of view, Llama 2 supports Chinese input, but the answers are basically in English, and the ability to understand and generate Chinese is limited.

807d2a3b993f21c274c13981976f071f.png

70015c60f434c39fd80ab32f64da5a1c.png

Then ask when is the data cut off? From the answer of Llama 2, we can know that the cut-off date of the data it has is December 2022.

adc7bdde9d37756a64ca53fa6a232ace.png

Next, we asked Llama 2 a less pertinent question. Llama 2 pointed out the irrationality of the title and offered some advice:

0e4989256551fe15d15faae36cd9688b.png

However, Llama 2 is still not good at chicken and rabbit cage problems.

c097e58f3ad356f21c050f681324fb11.png

88e1bb77c2063d248c65bcf4a3cb8c40.png

Test address: https://llama-2.replit.app/

On Twitter, the creators of the Vicuna project announced the results of their system tests, with the following conclusions:

  • Llama-2 exhibits stronger command-following abilities, but still lags significantly behind GPT-3.5/Claude in information retrieval, encoding, and mathematics;

  • Oversensitivity to security may lead to misinterpretation of user queries;

  • Comparable to leading models based on Llama-1 (such as Vicuna, WizardLM) in chat performance;

  • Limited non-English language skills.

4a27ebfe02ec60062939f529ca12802a.png

Here are some test data and results:

43dfed34a8983b2ec194db9e7ffac575.png

a10e792f24bed4f4d26e1e6661bfc9bf.png

922b0f1b4991dfe96108e1de48bd2792.png

d5430a1872e2c36fdac471f4dc9ba048.png

Which devices can run these models locally?

Since Llama 2 is open-sourced in different sizes, these models are very flexible in terms of local deployment. If you don't want to upload your data to the Internet, then local deployment is the best choice. This idea can be realized through the MLC-LLM project created by Chen Tianqi et al.:

7059600e2da912eeb9eb28180d40035c.png

Project address: https://github.com/mlc-ai/mlc-llm

In previous reports , we mentioned this project. Its goal is to let you "compile and run large language models on any device," including mobile, consumer PCs, and web browsers. The platforms it supports include:

bf35b531ba31e172a29330dc59112680.png

After the release of Llama 2, project members such as Chen Tianqi stated that MLC-LLM now supports local deployment of Llama-2-70B-chat (requires an Apple Silicon Mac with 50GB VRAM to run). On M2 Ultra, the decoding speed can reach ~10.0token/second.

02585917f976e33d3a24639c77c3776f.png

Of course, with MLC-LLM, running other versions of the Llama 2 model is even easier: the 7B model runs at about 46 tok/s on the Apple M2 Max and about 156 tok/s on the RTX 4090.

e46193d6862c11d10af285325e36b29f.png

In addition, with the help of the "MLC Chat" APP released by Chen Tianqi and others (which can be found in the Apple App Store), we can also try to use Llama 2 on mobile phones and iPads (no internet connection required).

c5ad9632effe197c96f27aca30aa6bb6.png

What impact will Llama 2 have?

If Meta did not open-source Llama in February this year, you may not know that there are so many ways to write "Alpaca": the "Second Creation" project based on this open source model occupies almost all the English words of the biological alpaca. After Meta iterated the model to version 2.0, these projects were naturally pulled to a new starting point.

In less than a day after the release of Llama 2, the developers of the large multimodal model "Lava Alpaca LLaVA", which can process image information like GPT-4, announced that they have updated LLaVA based on Llama 2. The new version adds support for LLaMA-2, while also supporting LoRA training with academic GPUs, as well as features such as higher resolution (336x336) and 4-/8-inference.

b17c044a13d89d329a4230d3a1cf63b8.png

Additionally, they released a preview version of a new LLaVA variant based on the latest RLHF-fine-tuned LLaMA-2-Chat checkpoint that provides a longer context window. These new releases support and validate training on the RTX 3090 and RTX A6000, making training large multimodal models easier and more accessible to the wider community.

63863b1e5020e8431bdc548679a86032.png

Of course, this is just the beginning. Over time, those models based on Llama 2 will be launched or updated one after another, and the "thousand model war" will start soon.

afbc41237759872c380fcc807dc556d2.png

Regarding the future development and impact of Llama, Jim Fan, a senior AI scientist at Nvidia, also gave his own prediction:

  • Llama-2 could cost more than $20 million to train. Previously, artificial intelligence researchers of some large companies were cautious about Llama-1 because of commercial licensing issues, but the commercial restrictions of Llama-2 have been greatly relaxed. In the future, many people may join the Llama camp and contribute their strength.

  • Although Llama-2 has not yet reached the level of GPT-3.5, there are obvious shortcomings in programming and other issues, but because its weight is open, these issues will be improved sooner or later;

  • Llama-2 will greatly advance research in multimodal artificial intelligence and robotics. These areas require more than black-box access to APIs. Currently, we have to convert complex sensory signals (video, audio, 3D perception) into textual descriptions, which are then fed into LLMs (Language and Vision Fusion Models), which is very clumsy and results in a significant loss of information. It will be more efficient to directly graft the perception module onto the powerful LLM backbone.

d0edcc6778c86d9845c9b33d484848c6.png

For companies that develop closed-source large models, the release of Llama 2 is also of great significance. If the model they develop is not powerful enough, or if it is not far behind the open source Llama 2 and its derivative models, then its commercial value will be difficult to realize.

If you also have some opinions on the future impact of Llama 2, please leave a message in the comment area.

0594555bd8c72de961ac8d2039ee9bbb.pngDry goods learning, like three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/131874705