llama2.c - 垂直领域LLM训练/推理全栈利器

llama2.c是一个极简的Llama 2 LLM全栈工具，非常适合用于制作面向细分市场垂直领域的大规模语言模型。

在这里插入图片描述

推荐：用 NSDT设计器快速搭建可编程3D场景。

1、简介

使用此存储库中的代码，你可以在 PyTorch 中从头开始训练 Llama 2 LLM 架构，然后将权重导出到二进制文件，并将其加载到一个约 500 行的 C 文件 (run.c) 中以推断模型。或者，你可以加载、微调和推理 Meta 的 Llama 2（但这仍在积极充实中）。

因此，这个仓库是 Llama 2 LLM 的“全栈”训练+推理解决方案，重点是极简主义和简单性。你可能认为需要数十亿个参数的 LLM 才能完成任何有用的事情，但事实上，如果将领域做得足够窄，那么非常小的 LLM 也可以具有令人惊讶的强大性能。我建议你查看 TinyStories 论文以获取灵感。

请注意，这只是最近开始的一个有趣的周末项目：我使用了之前的 nanoGPT，将其调整为实现 Llama-2 架构而不是 GPT-2，其核心是在 run.c 中编写 C 推理引擎。所以这个项目还很年轻并且进展很快。向出色的 llama.cpp 致敬，感谢它为这个项目提供了灵感。我想要一些超级简单的东西，所以我选择对 Llama 2 架构进行硬编码，坚持使用 fp32，并且只推出一个没有依赖项的纯 C 推理文件。

2、感受魔法

让我们在 C 中运行一个小 Llama 2 模型。你需要一个模型检查点。下载我在 TinyStories 数据集上训练的这个 15M 参数模型（约 58MB 下载）并将其放入默认检查点目录中：

wget https://karpathy.ai/llama2c/model.bin -P out

（如果这不起作用，请尝试使用谷歌驱动器）。编译并运行C代码：

gcc -O3 -o run run.c -lm
./run out/model.bin

你将看到文本流示例。在我的 M1 MacBook Air 上，运行速度约为 110 个令牌/秒。请参阅性能或 Makefile 以获取可以显着加快速度的编译标志。

示例输出：

Once upon a time, there was a boy named Timmy. Timmy loved to play sports with his friends. He was very good at throwing and catching balls. One day, Timmy's mom gave him a new shirt to wear to a party. Timmy thought it was impressive and asked his mom to explain what a shirt could be for. "A shirt is like a special suit for a basketball game," his mom said. Timmy was happy to hear that and put on his new shirt. He felt like a soldier going to the army and shouting. From that day on, Timmy wore his new shirt every time he played sports with his friends at the party. Once upon a time, there was a little girl named Lily. She loved to play outside with her friends. One day, Lily and her friend Emma were playing with a ball. Emma threw the ball too hard and it hit Lily's face. Lily felt embarrassed and didn't want to play anymore. Emma asked Lily what was wrong, and Lily told her about her memory. Emma told Lily that she was embarrassed because she had thrown the ball too hard. Lily felt bad achieved tok/s: 129.146172

更新：我上传了一个更大的检查点。这个是 512维、8 层、8 头、上下文长度 1024， ~44M 参数的 Transformer。它在大约 8 小时内在 4XA100 40GB GPU 上训练了 200K 迭代批量大小 32。

你可以像这样使用这个更大、更强大的检查点：

wget https://karpathy.ai/llama2c/model44m.bin -P out44m
./run out44m/model44m.bin

它仍然以交互速率运行，并采样更连贯和多样化的故事：

Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.

更新2：110M参数模型现在也可用，参见模型。

3、Meta 的Llama 2 模型

由于神经网络架构相同，我们也可以用 Meta 发布的 Llama 2 模型进行推理。遗憾的是，由于许可问题，这里存在一些摩擦（我认为我无法直接上传检查点）。因此，第 1 步，按照Meta的说明获取 Llama 2 检查点。一旦我们有了这些检查点，我们就必须将它们转换为 llama2.c 格式。

为此，我们使用export_meta_llama_bin.py 文件，例如对于 7B 模型：

python export_meta_llama_bin.py path/to/llama/model/7B llama2_7b.bin

导出大约需要 10 分钟左右，并在当前目录中生成一个名为 llama2_7b.bin 的 26GB 文件（float32 中 7B 模型的权重）。据报道，尽管做出了努力，13B 导出目前仍无法正常工作，原因未知（接受 PR 进行修复）。我们可以正常运行模型：

./run llama2_7b.bin

在我的云中 CPU Linux 机器上，使用 OpenMP 在 96 个线程上编译，运行速度约为 4 个令牌/秒。（在我的 MacBook Air M1 上，如果你仅使用 make runfast 进行构建，目前每个令牌的时间接近 30 秒。）示例输出：

The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.

基础模型… ˆ(ツ)/ˆ。由于我们可以用基础模型进行推理，因此也应该可以很容易地推断聊天模型，并与其进行对话。如果我们能找到一种更有效地运行 7B 的方法，我们就可以开始将 LoRA 添加到我们的训练脚本中，并在存储库中进行所有微调！

4、模型

为了获得更小的、从头开始的模型的示例，我在 TinyStories 上训练了多个模型，并将它们分类如下。所有这些都在我的环境（4 个 A100 40GB GPU）上进行了几个小时的训练。 110M大约用了24小时。

模型	维度	层数	头数	最大上下文长度	参数	验证损失	下载
OG	288	6	6	256	15M		model.bin
44M	512	8	8	1024	44M		model44m.bin
110M	768	12	12	1024	110M	0.7601	model110m.bin

你会注意到 110M 模型的大小相当于 GPT-1。或者，这也是 GPT-2 系列中最小的模型（GPT-2 小），除了最大上下文长度仅为 1024 而不是 2048。与 GPT-1/2 架构相比，唯一显着的变化是 Llama 使用 RoPE 相对位置嵌入而不是绝对/学习位置嵌入，MLP 中更奇特的 SwiGLU 非线性，RMSNorm 而不是 LayerNorm，所有线性层上的偏差=False，并且可以选择多重查询（但 llama2.c 尚不支持）。

5、训练

让我们看看如何使用此存储库中的代码从头开始训练小型Llama 2。首先，让我们下载并预标记一些源数据集，例如我喜欢 TinyStories，所以这是此存储库中当前可用的唯一示例。但是添加数据集应该很容易，请参阅代码

python tinystories.py download
python tinystories.py pretokenize

然后训练我们的模型：

python train.py

6、简短的培训指南

请参阅 train.py 脚本以了解更多奇特的启动和超参数覆盖。以下是如何设置参数的简要指南。

查看 Chinchilla 论文最后的表格，了解 Transformer 参数（dim、n_layers、n_heads）如何一起增长或收缩。外推/内插此模式以获得更大或更小的transformer。

根据问题的不同，设置最大上下文长度：这应该是预测下一个标记的最大标记数。例如。 Llama 2 使用 2048。

接下来，对于中型应用程序，你希望每次更新的总批量大小（由脚本打印为“每次迭代的令牌将是：”）约为 100K 令牌。对于小型应用程序，它可能会更低，对于大型训练（例如 GPT/LLamas），它通常约为 0.5M，甚至更多。

首先将batch_size设置为系统允许的最大值（例如，我的在最近的运行中为16，因为之后我的GPU内存不足），然后是希望将gradient_accumulation_steps增加到所需的高度，以达到约100K的总批处理大小。最后，调整你的学习率（LR）。

学习率在你的训练允许的范围内尽可能高。非常小的网络可以使用大的 LR（例如 1e-3 甚至更高）。大型网络需要较低的 LR。 3e-4 在大多数中型应用程序中是安全的选择，但对于小型网络来说可能太低，因此请尝试增加它！

最后，max_iters 是训练的长度。使用不同的设置。我基本上只调整这些参数，而其他大多数参数保持不变。

下面是我如何训练 110M 模型的示例，我认为该模型远非最佳，但对我来说看起来很合理：dim 768、n_layers 12、n_heads 12（因此每个头的大小为 768 / 12 = 64 个通道）、seq len 为 1024、批大小 16（这是最适合我的 A100 40GB GPU 的）、gradient_accumulation_steps = 8 需要使总令牌批量大小为 16 批量大小 * 1024 个序列令牌 * 8 grad_accum = 每次更新 131,072 个令牌。好的。学习率 4e-4（可能有点太低了）。 max_iters 200K（可能有点太高了）。 Dropout 0.1，因为这通常对中等大小有帮助。就是这样。我在云计算机上的 4 个 GPU 上使用分布式数据并行 (DDP) 运行，训练大约需要一天左右的时间。

完全理解是否要跳过模型训练，对于简单的演示，只需下载我的预训练模型并将其保存到目录out中：

wget https://karpathy.ai/llama2c/model.bin -P out

一旦我们有了 model.bin 文件，就可以在 C 中进行推理。首先编译 C 代码：

gcc -O3 -o run run.c -lm

现在可以简单地运行它：

./run out/model.bin

看着token流过，有趣！我们还可以运行 PyTorch 推理脚本进行比较（如果尚未运行，请将 model.ckpt 添加到 /out）：

python sample.py

这给出了相同的结果。更详细的测试将在 test_all.py 中完成，运行如下：

$ pytest

目前，你需要两个文件来测试或采样：我之前运行的 PyTorch 训练中的 model.bin 文件和 model.ckpt 文件。我必须考虑运行测试而不必下载 200MB 的数据。

7、性能

注意：本指南不是很好，因为我个人在 Python 领域花费了大量时间，对C编译的许多功能和标志并没有特别深入的理解。

根据你的系统，有多种方法可以加速此代码。在这里，我们记录了一些内容以及有关其用途的高级指南。这又是默认的编译方式，但使用 -O3：

gcc -O3 -o run run.c -lm

-O3 包括在编译时间和内存使用方面代价高昂的优化。包括矢量化、循环展开和预测分支。这里还有一些可以尝试的。

-Ofast 除了 -O3 之外，还运行可能违反 C/IEEE 规范的其他优化。请参阅 GCC 文档以获取更多信息。

-march=native 编译程序以使用正在编译的机器的体系结构，而不是更通用的 CPU。这可以实现额外的优化和特定于硬件的调整，例如改进的向量指令/宽度。

迄今为止我在 MacBook Air (M1) 上看到的最快吞吐量是：

gcc -Ofast -o run run.c -lm

你还可以尝试用 clang 替换 gcc。

OpenMP 通过使用 OpenMP 编译也可以实现重大改进，它“激活”matmul 和注意力内部的 #pragma omp 并行。可以编译例如像这样：

clang -Ofast -fopenmp -march=native run.c  -lm  -o run

你可以尝试交换 clang/gcc，并且可以尝试省略 -march=native。但是，当运行推理时，请确保使用 OpenMP 标志来设置线程数，例如：

OMP_NUM_THREADS=4 ./run out/model.bin

根据你的系统资源，可能需要调整这些超参数。

原文链接：垂直领域LLM大模型DIY — BimAnt