Llama2~baby

OpenAI scientist Karpathy spent a weekend building the star project llama2.c. With the help of GPT-4, he used only 500 lines of C language code to realize the reasoning of the Llama 2 baby model.

Have you ever thought about inferring a Llama 2 baby model with only C language?

No? You can do it now!

Just this past weekend, OpenAI scientist Andrej Karpathy made a very interesting project - llama2.c.

The project inspiration comes from the previous star project - llama.cpp

First, train a smaller Llama 2 model in PyTorch.

Then, run inference in pure C with 500 lines of code and no dependencies.

The final pre-trained model (based on TinyStories) can generate story samples at a rate of 18 tokens per second with fp32 on the MacBook Air M1 CPU.

Once llama2.c was released, it quickly gained 1.6k stars on GitHub, and it is still climbing rapidly.

Project address: https://github.com/karpathy/llama2.c

By the way, Karpathy also said: "Thanks to GPT-4 for helping me with my unfamiliar C language!"

Nvidia scientist Jim Fan said that GPT-4 helped Karpathy "raise" a baby Llama in C language! It's amazing!

Netizens also said that using GPT-4 to build llama2.c is the ultimate crossover.

Pure C language reasoning Llama 2

Maybe Karpathy didn't expect that the potential of this llama2.c project is so huge.

Surprisingly, you can do inference on these smaller (O(~10MB)) models at an interactive rate of fp32 on a single-threaded CPU.

However, I haven't tried it with the smallest Meta LLama2 checkpoint (7B), expect it to be slow.

In narrower domains, such as stories, one can do interesting things with smaller Transformers, Karpathy said.

Therefore, this simple pure C language implementation is still very practical, especially because it is also portable.

Immediately afterwards, he compiled with -O3 to increase the number of tokens processed per second tok/s on the MacBook Air M1 from 18 to 98.

For using such a simple method, and being able to run a fairly large model (tens of millions of parameters) at a high interactive rate, Karpathy said that he is very happy——

"Looks like I'll have to train a bigger model now."

It turns out that my original checkpoint compiled with -O3 was running _way_ faster (100 tok/s) on the MacBook Air M1 than I expected, so I'm now training a much larger 44M model, which should still run interactively. Maybe the 7B Llama model is within reach.

open source code

Currently, the code of llama2.c is open source.

Using this code, you can train the Llama 2 LLM architecture from scratch in PyTorch, then save the weights as raw binary files and load them into a ~500-line C file (run.c). Currently, the file does inference on the model using fp32.

In the cloud Linux development environment, Karpathy uses a model with a dimension of 288, 6 layers, and 6 heads (about 15 million parameters) to perform inference at a speed of about 100 tok/s under fp32, which is also the same as that on the M1 MacBook Air. The operation is much the same.

feel the magic

Before running a baby Llama 2 model in C, a model checkpoint is first required.

For this, you can download this 15M parameter model (about 58MB) trained on the TinyStories dataset and put it into the default checkpoint directory out:

wget https://karpathy.ai/llama2c/model.bin -P out

Then, compile and run the C code:

gcc -O3 -o run run.c -lm./run out/model.bin

As you can see, this just streams the original token. To read it, it needs to be converted to text.

Unfortunately, translation is currently only possible with a simple Python function decorator (30 lines of code):​​​​​​​​

pip install sentencepiecepython run_wrap.py

On an M1 MacBook Air, it runs at about 100 tokens per second, which is not bad for super simple fp32 single-threaded C code.

Example output:

Once upon a time, there was a boy named Timmy. Timmy loved to play sports with his friends. He was very good at throwing and catching balls. One day, Timmy's mom gave him a new shirt to wear to a party. Timmy thought it was impressive and asked his mom to explain what a shirt could be for. "A shirt is like a special suit for a basketball game," his mom said. Timmy was happy to hear that and put on his new shirt. He felt like a soldier going to the army and shouting. From that day on, Timmy wore his new shirt every time he played sports with his friends at the party. Once upon a time, there was a little girl named Lily. She loved to play outside with her friends. One day, Lily and her friend Emma were playing with a ball. Emma threw the ball too hard and it hit Lily's face. Lily felt embarrassed and didn't want to play anymore. Emma asked Lily what was wrong, and Lily told her about her memory. Emma told Lily that she was embarrassed because she had thrown the ball too hard. Lily felt bad achieved tok/s: 98.746993347843922

Once upon a time, there was a boy named Timmy. Timmy likes to play sports with his friends. He's very good at throwing and catching the ball. One day, Timmy's mother gave him a new shirt to wear to a party. Timmy thought the shirt was great and asked his mother if it had any special purpose. "The shirt is like a special suit for a basketball game," his mother said. Timmy was very happy to hear that, so he put on the new shirt. He felt like a soldier going to join the army, shouting loudly. From that day on, Timmy would wear the new shirt every time he played sports with his friends at a party. Once upon a time, there was a little girl named Lily. She likes to play outside with her friends. One day, Lily and her friend Emma were playing ball. Emma threw the ball too hard and it hit Lily in the face. Lily felt embarrassed and didn't want to play any more. Emma asked Lily what happened, Lily told her her memory. Emma told Lily that she was embarrassed because she threw the ball too hard. Lily felt bad. Tok/s: 98.746993347843922

user's guidance

In theory, it should be possible to load the weights released by Meta, but even for the smallest 7B model, using this simple single-threaded C program for inference, the speed estimate will not be fast.

So in this repo, we focus on a narrower application domain and train the same architecture from scratch.

First, download and pre-segment some source datasets, such as TinyStories:​​​​​​​​

python tinystories.py downloadpython tinystories.py pretokenize

Then, train the model:

python train.py

See the train.py script for more information on special startups and hyperparameter overrides. Karpathy expected that simple hyperparameter exploration should lead to a better model, so he did not tune it.

If you want to skip model training, just download Karpathy's pre-trained model and save it to the out directory, and you can do a simple demo:

wget https://karpathy.ai/llama2c/model.bin -P out

Once you have the model.bin file, you can run inference in C.

First, compile the C code:

gcc -O3 -o run run.c -lm

Then, run:

./run out/model.bin

Note that only the SentencePiece token is output here. To decode the token into text, run this script with a simple decorator: whaosoft  aiot  http://143ai.com

python run_wrap.py

Alternatively, run the PyTorch inference script for comparison (add model.ckpt to the /out directory):

python sample.py

This will give the same result. More detailed testing will be done in test_all.py, run as follows:

$ pytest

Currently, you need two files for testing or sampling: the model.bin file and the model.ckpt file from the previous PyTorch training.

(On how to run tests without downloading 200MB of data.)

to do list

- Why can't SentencePiece iteratively decode correctly?

- would like to be able to delete the run_wrap.py file and directly use C code to convert to a string

-Does it support multi-query function? Doesn't seem to be very useful for smaller models running on CPU?

- It is planned to support the inference of more than max_seq_len steps, and the kv cache must be considered

- Why is MFU so low (only about 10%) when training on my A100 40GB GPU?

- Weird errors with torch.compile and wandb when using DDP

- Add better tests to reduce yolo

Hot discussion among netizens

With the enthusiasm of llama2.c, netizens compiled llama2 into Emscripten and ran it on the web page.

He compiled it with Emscripten and modified the code to predict a token every render. The webpage automatically loads 50MB of model data.

 

In addition, he added support for de-tokenization.

Some netizens said that based on the success of llama.cpp, the industry seems to be moving towards providing a separate source code for each released model, rather than a general framework like pytorch/tenorflow/onnxruntime? 

What is the significance of llama2.c?

A netizen gave a vivid example of creating a computer game about a small island with 100 people, everyone has consciousness, and llama2.c is their brain. Then you can simulate a thousand years of history and see what happens.

References:

https://github.com/karpathy/llama2.c

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131906587