LLMs之Guanaco：《QLoRA：Efficient Finetuning of Quantized LLMs》翻译与解读

导读：本文介绍了QLORA，一种高效的微调方法。利用QLoRA高效微调Guanaco原驼。它可以减少内存使用量，可以在保持完整的16位微调任务性能下，实现单个 48GB GPU 上微调 65B 参数量模型。它通过将梯度向后传播到冻结的4位量化的预训练语言模型(LoRA)中来实现。

QLoRA是LoRA的量化版本，主要通过几种手段来实现高效：由于 QLoRA 在实现中对 LLM 的所有全连接层均插入了 LoRA Adapter，其往往可以达到接近16位精度全参数微调的性能，且得益于 NF4 数据类型和双量化策略，大大降低了训练过程中的显存占用。
>> NF4数据类型更符合(基于信息论)正态分布权重的设计：4bit NormalFloat(四位标准浮点数)是一种一种对于正态分布权重在信息理论上最优的新数据类型。NF4数据类型对于正态分布数据可以产生比Int4整数和FP4浮点数更好的实证结果。
>> DQ双重量化减少存储空间：Double Quantization，对第一次量化后的那些常量(量化常数)再进行一次量化，减少存储空间。双重量化将第一次量化的量化常数cFP32视为第二次量化的输入……
>> PO分页优化器管理内存峰值：Paged Optimizers，防止梯度检查点期间的内存峰值导致内存不足错误。它使用NVIDIA统一内存来处理跨批量化梯度检查时的内存峰值。实现了 CPU 和 GPU 之间自动的页面转换。当 GPU 内存不足时，Paged Optimizers 技术会自动将优化器状态转移到 CPU 内存，以确保优化器的正常运行。

>> All-Linear-Layer-Adapter/ALLA全线性层适配器：所有 Linear 层插入 adapter ，即在所有全连接层都插入 LoRA Adapter，增加了训练参数，能匹配16位全参数微调的性能。

----------------------------------------------------

文中使用QLORA微调超过1000个模型，展示在不同规模(7B到65B参数)、模型结构和数据集上的指令跟随和聊天机器人性能。QLORA微调使最好的Guanaco模型超过所有已发布模型，在Vicuna评价指标上达到ChatGPT的99.3%性能水平。且只需要24小时的单GPU微调。QLORA微调在高质量数据集上达到了最先进水平，即使使用较小的模型也超越了之前的状态最先进水平。

第四章中，QLORA可以在4bit量化精度下微调大模型，达到16bit完整微调的效果。微调的数据集选择对效果很重要。Guanaco是开源数据训练的效果最好的开源聊天模型，接近ChatGPT效果。当前评估方法和benchmark存在不足，仍需要进一步改进。Guanaco模型仅使用交叉熵损失训练，没有使用强化学习，值得进一步研究。

第七章中，QLORA可使33B参数模型在单GPU上微调，65B参数模型在单个专业GPU上微调，而不降低相对于完整微调基线的性能。QLORA有望使大型语言模型部署到手机和低资源设备成为现实。但是，微调技术是双刃剑，可以滥用造成伤害。QLORA将使微调高质量的LLM变得更加广泛和容易获得。此举可能有利于提高独立分析能力。也比较了QLORA与其他量化方法、微调方法、指令式微调数据集以及聊天机器人相关工作。

第八章中指明了存在局限性，如未能证明QLORA可以与完整16bit微调效果匹配等。存在其他应研究的问题，如3bit量化、不同适配器等。

总的来说，讨论了QLORA产生的潜在影响，其好处在于可以使微调技术更为普及，但同时也可能滥用造成危害。研究人员认为QLORA可能整体产生积极影响，但需要进行监管和审核来避免不良后果。

《QLoRA：Efficient Finetuning of Quantized LLMs》翻译与解读

1、Introduction引言

2、背景Background

块状k位量化Block-wise k-bit Quantization

低秩适配器Low-rank Adapters

参数高效微调的内存需求Memory Requirement of Parameter-Efficient Finetuning

3、QLORA微调QLORA Finetuning

4位NormalFloat量化4-bit NormalFloat Quantization

双重量化Double Quantization

分页优化器 Paged Optimizers

QLORA

4、QLoRA与标准微调的比较QLoRA vs. Standard Finetuning

默认LoRA超参数与16位性能不匹配Default LoRA hyperparameters do not match 16- bit performance.

4位NormalFloat比4位浮点数性能更好4-bit NormalFloat yields better performance than 4-bit Floating Point

k位QLORA与16位完全微调和16位LoRA的性能相匹配k-bit QLORA matches 16-bit full finetuning and 16-bit LoRA performance

总结Summary

5.1、实验设置Experimental setup

自动化评估Automated Evaluation

人工评估Human Evaluation

Elo评分Elo Rating

5.3、Guanaco：QLORA在OASST1上训练的聊天机器人是最先进的模型Guanaco：QLORA trained on OASST1 is a State-of-the-art Chatbot

6、定性分析Qualitative Analysis

6.1、定性分析示例生成Qualitative Analysis of Example Generations

6.2、考虑因素Considerations

评估Evaluation

数据和训练Data & Training

7、相关工作Related Work

大型语言模型的量化Quantization of Large Language Models

使用Adapter进行微调 Finetuning with Adapters

指令微调Instruction Finetuning

聊天机器人Chatbots

8、局限性与讨论Limitations and Discussion

9、更广泛的影响Broader Impacts

致谢Acknowledgements

《QLoRA：Efficient Finetuning of Quantized LLMs》翻译与解读

地址

论文地址：https://arxiv.org/abs/2305.11206

GitHub：GitHub - artidoro/qlora: QLoRA: Efficient Finetuning of Quantized LLMs

在线测试：Guanaco Playground Tgi - a Hugging Face Space by uwnlp

时间

2023年5月23日

作者

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

华盛顿大学

Abstract

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

本文介绍了QLoRA，一种高效的微调方法，可以在单个48GB GPU上减少内存使用量，同时保持完整的16位微调任务性能，从而能够对一个65B参数模型进行微调。QLoRA通过将梯度反向传播到冻结的4位量化预训练语言模型中，进而到低秩适配器（LoRA）。我们最好的模型系列，命名为Guanaco，在Vicuna基准测试中优于所有先前公开发布的模型，达到了ChatGPT性能水平的99.3%，而只需要在单个GPU上进行24小时的微调。QLoRA引入了多项创新来节省内存，而不牺牲性能：

（a）4位NormalFloat（NF4），这是一种对于正态分布权重在信息理论上最优的新数据类型；

（b）双重量化，通过量化量化常数来降低平均内存占用；

（c）分页优化器以管理内存峰值。

我们使用QLoRA对1000多个模型进行微调，并在8个指令数据集、多个模型类型（LLaMA、T5）和之前不可行的模型规模（如33B和65B参数模型）上提供了详细的指令跟随和聊天机器人性能分析。我们的结果表明，使用高质量的小型数据集进行QLoRA微调可以实现最先进的结果，即使使用的模型比之前的最优模型更小。我们基于人类和GPT-4评估提供了聊天机器人性能的详细分析，表明GPT-4评估是一种廉价而合理的替代方法。此外，我们发现当前的聊天机器人基准测试不可靠，无法准确评估聊天机器人的性能水平。通过对比分析，我们展示了Guanaco相对于ChatGPT的失败之处。我们发布了所有模型和代码，包括用于4位训练的CUDA内核。

1、Introduction引言

Finetuning large language models (LLMs) is a highly effective way to improve their performance,[40, 62, 43, 61, 59, 37] and to add desirable or remove undesirable behaviors [43, 2, 4]. However, finetuning very large models is prohibitively expensive; regular 16-bit finetuning of a LLaMA 65B parameter model [57] requires more than 780 GB of GPU memory. While recent quantization methods can reduce the memory footprint of LLMs [14, 13, 18, 66], such techniques only work for inference and break down during training [65].

We demonstrate for the first time that it is possible to finetune a quantized 4-bit model without any performance degradation. Our method, QLORA, uses a novel high-precision technique to quantize a pretrained model to 4-bit, then adds a small set of learnable Low-rank Adapter weights [28] that are tuned by backpropagating gradients through the quantized weights.

对大型语言模型（LLM）进行微调是提高其性能的一种非常有效的方法[40, 62, 43, 61, 59, 37]，也可以添加期望的行为或去除不希望的行为[43, 2, 4]。然而，微调非常大的模型代价昂贵；对LLaMA 65B参数模型[57]进行常规的16位微调需要超过780GB的GPU内存。尽管最近的量化方法可以减少LLM的内存占用[14, 13, 18, 66]，但这些技术只适用于推断阶段，在训练过程中会出现问题[65]。

我们首次证明可以在不降低性能的情况下对量化的4位模型进行微调。我们的方法QLORA使用一种新颖的高精度技术将预训练模型量化为4位，然后通过反向传播梯度来调整量化权重中的一小组可学习的低秩适配器权重[28]。

QLORA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16-bit fully finetuned baseline. This marks a significant shift in accessibility of LLM finetuning: now the largest publicly available models to date finetunable on a single GPU. Using QLORA, we train the Gua- naco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna [10] benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna bench- mark. When deployed, our smallest Guanaco model

(7B parameters) requires just 5 GB of memory and outperforms a 26 GB Alpaca model by more than 20 percentage points on the Vicuna benchmark (Table 6).

QLORA introduces multiple innovations designed to reduce memory use without sacrificing per- formance: (1) 4-bit NormalFloat, an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats.(2) Double Quantization, a method that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model). (3) Paged Optimizers, using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length. We combine these contributions into a better tuned LoRA approach that includes adapters at every network layer and thereby avoids almost all of the accuracy tradeoffs seen in prior work.

QLORA将对一个65B参数模型的微调的平均内存需求从>780GB的GPU内存减少到<48GB，与16位完全微调的基准相比，不降低运行时间和预测性能。这标志着LLM微调的可访问性发生了重大变化：现在公开可用的最大模型可以在单个GPU上进行微调。使用QLORA，我们训练了Guanaco模型系列，其中第二好的模型在Vicuna[10]基准测试中达到了ChatGPT性能水平的97.8%，而只需要在单个消费级GPU上进行不到12小时的训练；使用单个专业GPU超过24小时，我们的最大模型达到了99.3%，基本上缩小了与Vicuna基准测试中的ChatGPT之间的差距。当部署时，我们最小的Guanaco模型（7B参数）只需要5GB的内存，在Vicuna基准测试上比26GB的Alpaca模型表现优异超过20个百分点（表6）。

QLORA引入了多个创新来降低内存使用，而不牺牲性能：

（1）4位NormalFloat，这是一种在信息理论上对于正态分布数据最优的量化数据类型，比4位整数和4位浮点数获得更好的实验结果；

（2）双重量化，一种量化量化常数的方法，平均每个参数节省约0.37位（对于65B模型约3GB）；

（3）分页优化器，使用NVIDIA统一内存来避免处理具有长序列长度的小批量时出现的梯度检查点内存峰值。

我们将这些贡献整合到更好调优的LoRA方法中，该方法在每个网络层都包括适配器，从而避免了先前工作中出现的几乎所有准确性的权衡。

QLORA’s efficiency enables us to perform an in-depth study of instruction finetuning and chatbot performance on model scales that would be impossible using regular finetuning due to memory overhead. Therefore, we train more than 1,000 models across several instruction tuning datasets, model architectures, and sizes between 80M to 65B parameters. In addition to showing that QLORA recovers 16-bit performance (§4) and training a state-of-the-art chatbot, Guanaco, (§5), we also analyze trends in the trained models. First, we find that data quality is far more important than dataset size, e.g., a 9k sample dataset (OASST1) outperformed a 450k sample dataset (FLAN v2, subsampled) on chatbot performance, even when both are meant to support instruction following generalization. Second, we show that strong Massive Multitask Language Understanding (MMLU) benchmark performance does not imply strong Vicuna chatbot benchmark performance and vice versa—in other words, dataset suitability matters more than size for a given task.

QLORA的高效性使我们能够对指令微调和聊天机器人性能进行深入研究，而这在常规微调中由于内存开销是不可能的。因此，我们在多个指令微调数据集、模型架构和80M到65B参数的不同大小之间训练了1000多个模型。除了展示QLORA恢复了16位性能（第4节）和训练出一种最先进的聊天机器人Guanaco（第5节），我们还分析了训练模型的趋势。首先，我们发现数据质量比数据集大小更重要，例如，一个9k样本的数据集（OASST1）在聊天机器人性能上优于一个45k样本的数据集（FLAN v2，经过子采样），即使两者都是为了支持指令跟随泛化。其次，我们显示强大的Massive Multitask Language Understanding（MMLU）基准测试性能并不意味着强大的Vicuna聊天机器人基准测试性能，反之亦然，换句话说，对于给定任务，数据集的适用性比大小更重要。

Furthermore, we also provide a extensive analysis of chatbot performance that uses both human raters and GPT-4 for evaluation. We use tournament-style benchmarking where models compete against each other in matches to produce the best response for a given prompt. The winner of a match is judged by either GPT-4 or human annotators. The tournament results are aggregated into Elo scores [16, 17] which determine the ranking of chatbot performance. We find that GPT-4 and human evaluations largely agree on the rank of model performance in the tournaments, but we also find there are instances of strong disagreement. As such, we highlight that model-based evaluation while providing a cheap alternative to human-annotation also has its uncertainties.

We augment our chatbot benchmark results with a qualitative analysis of Guanaco models. Our analy- sis highlights success and failure cases that were not captured by the quantitative benchmarks.

We release all model generations with human and GPT-4 annotations to facilitate further study. We open-source our codebase and CUDA kernels and integrate our methods into the Hugging Face transformers stack [64], making them easily accessible to all. We release a collection of adapters for 7/13/33/65B size models, trained on 8 different instruction following datasets, for a total of 32 different open sourced, finetuned models.

此外，我们还对聊天机器人性能进行了广泛的分析，使用人类评估者和GPT-4进行评估。我们使用锦标赛式的基准测试，在比赛中，模型互相竞争为给定提示生成最佳回复。比赛的胜者由GPT-4或人类标注者评判。锦标赛结果被汇总成Elo分数[16, 17]，用于确定聊天机器人性能的排名。我们发现，在锦标赛中，GPT-4和人类评估在模型性能的排名上基本上是一致的，但也存在一些强烈的分歧。因此，我们强调基于模型的评估虽然提供了一种廉价的人类注释替代方法，但也存在不确定性。

我们通过定性分析Guanaco模型对聊天机器人基准测试结果进行了补充分析，其中包括成功和失败的案例，这些案例在定量基准测试中没有捕捉到。

我们发布了所有模型版本，包括人类和GPT-4注释，以促进进一步研究。我们开源我们的代码库和CUDA内核，并将我们的方法集成到Hugging Face transformers堆栈[64]中，使其易于访问。我们发布了在8个不同的指令跟随数据集上训练的7B/13B/33B/65B规模模型的适配器集合，总共发布了32个不同的开源微调模型。

2、背景Background

块状k位量化Block-wise k-bit Quantization

Quantization is the process of discretizing an input from a rep- resentation that holds more information to a representation with less information. It often means taking a data type with more bits and converting it to fewer bits, for example from 32-bit floats to 8-bit Integers. To ensure that the entire range of the low-bit data type is used, the input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor. For example, quantizing a 32-bit Floating Point (FP32) tensor into a Int8 tensor with range [−127, 127]:

量化是将输入从包含更多信息的表示形式离散化为包含较少信息的表示形式的过程。通常意味着将具有更多位的数据类型转换为较少位数的数据类型，例如从32位浮点数转换为8位整数。为了确保使用低位数据类型的整个范围，通常通过将输入元素的绝对最大值进行归一化来将输入数据类型重新缩放到目标数据类型范围内。输入元素通常以张量的形式结构化。例如，将32位浮点（FP32）张量量化为具有范围[−127，127]的Int8张量：

The problem with this approach is that if a large magnitude value (i.e., an outlier) occurs in the input tensor, then the quantization bins—certain bit combinations—are not utilized well with few or no numbers quantized in some bins. To prevent the outlier issue, a common approach is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant c.

This can be formalized as follows: We chunk the input tensor X ∈ Rb×h into n contiguous blocks of size B by flattening the input tensor and slicing the linear segment into n = (b × h)/B blocks. We quantize these blocks independently with Equation 1 to create a quantized tensor and n quantization constants ci.

这种方法的问题是，如果输入张量中存在一个大的幅值值（即异常值），则量化的区间（即某些位组合）在一些区间中几乎没有或没有被很好地利用。为了解决异常值的问题，常见的方法是将输入张量分块，每个块都是独立量化的，并具有自己的量化常数c。

这可以形式化为：我们将输入张量X ∈ Rb×h分块为大小为B的n个连续块，通过对输入张量进行扁平化和切片线性段，将其切分成n = (b × h)/B个块。我们使用方程1独立地量化这些块，以创建一个量化张量和n个量化常数ci。

低秩适配器Low-rank Adapters

Low-rank Adapter (LoRA) finetuning [28] is a method that reduces memory requirements by using a small set of trainable parameters, often termed adapters, while not updating the full model parameters which remain fixed. Gradients during stochastic gradient descent are passed through the fixed pretrained model weights to the adapter, which is updated to optimize the loss function. LoRA augments a linear projection through an additional factorized projection. Given a projection XW = Y with X ∈ Rb×h, W ∈ Rh×o LoRA computes:

低秩适配器（LoRA）微调[28]是一种通过使用一小组可训练参数（通常称为适配器）来减少内存需求的方法，同时不更新保持固定的完整模型参数。在随机梯度下降期间，梯度通过固定的预训练模型权重传递给适配器，适配器通过更新来优化损失函数。LoRA通过附加一个因式分解投影来增强线性投影。给定投影XW = Y，其中X ∈ Rb×h，W ∈ Rh×o，LoRA计算如下：

参数高效微调的内存需求Memory Requirement of Parameter-Efficient Finetuning

One important point of discussion is the memory requirement of LoRA during training both in terms of the number and size of adapters used. Since the memory footprint of LoRA is so minimal, we can use more adapters to improve performance without significantly increasing the total memory used. While LoRA was designed as a Parameter Efficient Finetuning (PEFT) method, most of the memory footprint for LLM finetuning comes from activation gradients and not from the learned LoRA parameters. For a 7B LLaMA model trained on FLAN v2 with a batch size of 1, with LoRA weights equivalent to commonly used 0.2% of the original model weights[28, 37], the LoRA input gradients have a memory footprint of 567 MB while the LoRA parameters take up only 26 MB. With gradient checkpointing [9], the input gradients reduce to an average of 18 MB per sequence making them more memory intensive than all LoRA weights combined. In comparison, the 4-bit base model consumes 5,048 MB of memory. This highlights that gradient checkpointing is important but also that aggressively reducing the amount of LoRA parameter yields only minor memory benefits. This means we can use more adapters without significantly increasing the overall training memory footprint (see Appendix G for a detailed breakdown). As discussed later, this is crucial for recovering full 16-bit precision performance.

讨论的一个重要问题是LoRA在训练过程中的内存需求，包括使用的适配器数量和大小。由于LoRA的内存占用非常小，我们可以使用更多的适配器来提高性能，而不会显著增加总内存使用量。虽然LoRA被设计为参数高效微调（PEFT）方法，但LLM微调的大部分内存占用来自激活梯度，而不是学习到的LoRA参数。对于在FLAN v2上使用批量大小为1进行训练的7B LLaMA模型，LoRA权重相当于原始模型权重的常用0.2%[28, 37]，LoRA输入梯度的内存占用为567 MB，而LoRA参数仅占用26 MB。使用梯度检查点[9]，输入梯度平均减少到每个序列18 MB，使其比所有LoRA权重的内存占用更高。相比之下，4位基础模型消耗了5,048 MB的内存。这凸显了梯度检查点的重要性，但也表明过于大幅度减少LoRA参数量只会带来轻微的内存优势。这意味着我们可以使用更多的适配器而不会显著增加整体训练内存占用（详见附录G中的详细分解）。正如后面讨论的，这对于恢复完整的16位精度性能至关重要。

3、QLORA微调QLORA Finetuning

QLORA achieves high-fidelity 4-bit finetuning via two techniques we propose—4-bit NormalFloat (NF4) quantization and Double Quantization. Additionally, we introduce Paged Optimizers, to prevent memory spikes during gradient checkpointing from causing out-of-memory errors that have traditionally made finetuning on a single machine difficult for large models.

QLORA has one low-precision storage data type, in our case usually 4-bit, and one computation data type that is usually BFloat16. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit.

We now discuss the components of QLORA followed by a formal definition of QLORA.

QLORA通过两种技术实现高保真的4位微调，这两种技术是我们提出的4位NormalFloat（NF4）量化和双重量化。此外，我们引入了分页优化器，以防止梯度检查点期间的内存峰值导致内存不足错误，这在传统上使得在单个计算机上对大型模型进行微调变得困难。

QLORA具有一个低精度存储数据类型，通常是4位，和一个计算数据类型，通常是BFloat16。实际上，这意味着每当使用QLORA权重张量时，我们将张量去量化为BFloat16，然后在16位上进行矩阵乘法运算。

我们现在讨论QLORA的组成部分，然后给出QLORA的正式定义。

4位NormalFloat量化4-bit NormalFloat Quantization

The NormalFloat (NF) data type builds on Quantile Quantization [15] which is an information-theoretically optimal data type that ensures each quantization bin has an equal number of values assigned from the input tensor. Quantile quantization works by estimating the quantile of the input tensor through the empirical cumulative distribution function.

The main limitation of quantile quantization is that the process of quantile estimation is expensive. Therefore fast quantile approximation algorithms, such as SRAM quantiles [15], are used to estimate them. Due to the approximate nature of these quantile estimation algorithms, the data type has large quantization errors for outliers, which are often the most important values.

Expensive quantile estimates and approximation errors can be avoided when input tensors come from a distribution fixed up to a quantization constant. In such cases, input tensors have the same quantiles making exact quantile estimation computationally feasible.

NormalFloat（NF）数据类型建立在分位数量化[15]的基础上，分位数量化是一种信息理论上最优的数据类型，确保每个量化区间从输入张量中分配相同数量的值。分位数量化通过通过经验累积分布函数估计输入张量的分位数来工作。

分位数估计的主要限制是分位数估计过程的耗时。因此，使用快速分位数近似算法，如SRAM分位数[15]，来进行估计。由于这些分位数估计算法的近似性质，数据类型对于异常值存在较大的量化误差，而异常值通常是最重要的值。

当输入张量来自固定到量化常数的分布时，可以避免昂贵的分位数估计和近似误差。在这种情况下，输入张量具有相同的分位数，使得精确的分位数估计在计算上可行。

Since pretrained neural network weights usually have a zero-centered normal distribution with standard deviation σ (see Appendix F), we can transform all weights to a single fixed distribution by scaling σ such that the distribution fits exactly into the range of our data type. For our data type, we set the arbitrary range [−1, 1]. As such, both the quantiles for the data type and the neural network weights need to be normalized into this range.

The information theoretically optimal data type for zero-mean normal distributions with arbitrary standard deviations σ in the range [−1, 1] is computed as follows: (1) estimate the 2k + 1 quantiles of a theoretical N(0, 1) distribution to obtain a k-bit quantile quantization data type for normal distri-butions, (2) take this data type and normalize its values into the [−1, 1] range, (3) quantize an input weight tensor by normalizing it into the [−1, 1] range through absolute maximum rescaling.

Once the weight range and data type range match, we can quantize as usual. Step (3) is equivalent to rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data type. More formally, we estimate the 2k values qi of the data type as follows:

where QX(·) is the quantile function of the standard normal distribution N(0, 1). A problem for a symmetric k-bit quantization is that this approach does not have an exact representation of zero, which is an important property to quantize padding and other zero-valued elements with no error. To ensure a discrete zeropoint of 0 and to use all 2k bits for a k-bit datatype, we create an asymmetric data type by estimating the quantiles qi of two ranges qi: 2k−1 for the negative part and 2k−1 + 1 for the positive part and then we unify these sets of qi and remove one of the two zeros that occurs in both sets. We term the resulting data type that has equal expected number of values in each quantization bin k-bit NormalFloat (NFk), since the data type is information-theoretically optimal for zero-centered normally distributed data. The exact values of this data type can be found in Appendix E.

由于预训练的神经网络权重通常具有以标准差σ为零的正态分布（参见附录F），我们可以通过缩放σ使所有权重转换为单个固定分布，以使分布完全适应我们的数据类型范围。对于我们的数据类型，我们将任意范围设置为[-1，1]。因此，数据类型的分位数和神经网络权重都需要归一化到此范围内。

对于标准差为σ且位于[-1，1]范围内的零均值正态分布来说，信息理论上最优的数据类型计算如下：

（1）通过估计理论N（0，1）分布的2k + 1个分位数来获得正态分布的k位分位数量化数据类型，

（2）使用此数据类型并将其值归一化到[-1，1]范围内，

（3）通过绝对最大重新缩放将输入权重张量量化到[-1，1]范围内。

一旦权重范围和数据类型范围匹配，我们可以像通常一样进行量化。步骤（3）等效于将权重张量的标准差重新缩放，使其与k位数据类型的标准差匹配。更具体地说，我们如下估计数据类型的2k个值qi：

其中QX（·）是标准正态分布N（0，1）的分位函数。对于对称的k位量化来说，问题在于这种方法没有零的精确表示，而零是对于没有误差的量化填充和其他零值元素来说的一个重要属性。为了确保离散的零点为0并使用k位数据类型的所有2k位，我们通过估计两个范围qi的分位数来创建一个非对称数据类型：负部分为2k-1，正部分为2k-1 + 1，然后我们统一这些qi的集合并移除两个集合中同时出现的两个零中的一个。我们将所得到的数据类型称为具有每个量化区间中期望相等数量的k位NormalFloat（NFk）数据类型，因为该数据类型对于以零为中心的正态分布数据在信息理论上是最优的。此数据类型的精确值可在附录E中找到。

双重量化Double Quantization

We introduce Double Quantization (DQ), the process of quantizing the quantization constants for additional memory savings. While a small blocksize is required for precise 4-bit quantization [13], it also has a considerable memory overhead. For example, using 32-bit constants and a blocksize of 64 for W, quantization constants add 32/64 = 0.5 bits per parameter on average. Double Quantization helps reduce the memory footprint of quantization constants.

More specifically, Double Quantization treats quantization constants cFP32 of the first quantization as inputs to a second quantization. This second step yields the quantized quantization constants 2 and the second level of quantization constants c1 . We use 8-bit Floats with a blocksize of 256 for the second quantization as no performance degradation is observed for 8-bit quantization, in line with results from Dettmers and Zettlemoyer [13]. Since the cFP32 are positive, we subtract the mean from c2 before quantization to center the values around zero and make use of symmetric quantization. On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits, to 8/64 + 32/(64 · 256) = 0.127 bits, a reduction of 0.373 bits per parameter.

我们引入双重量化（DQ），即量化量化常数以节省额外的内存。虽然精确的4位量化需要较小的块大小[13]，但它也会带来相当大的内存开销。例如，使用32位常数和64的块大小进行量化，平均每个参数的量化常数会增加32/64 = 0.5位。双重量化有助于减少量化常数的内存占用。

具体而言，双重量化将第一次量化的量化常数cFP32作为第二次量化的输入。第二步产生了量化的量化常数2和第二级的量化常数c1。我们对第二次量化使用块大小为256的8位浮点数，因为8位量化不会导致性能下降，这与Dettmers和Zettlemoyer的结果[13]一致。由于cFP32为正值，我们在进行量化之前从c2中减去均值以将值居中于零，并利用对称量化。平均而言，对于块大小为64，这种量化将每个参数的内存占用从32/64 = 0.5位减少到8/64 + 32/（64·256）= 0.127位，每个参数减少了0.373位。

分页优化器 Paged Optimizers

Paged Optimizers use the NVIDIA unified memory 3 feature wich does automatic page-to-page transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM and the disk. We use this feature to allocate paged memory for the optimizer states which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.

分页优化器使用NVIDIA统一内存功能，它在GPU偶尔内存不足的情况下自动进行CPU和GPU之间的页面传输，以实现无误差的GPU处理。该功能类似于CPU RAM和磁盘之间的常规内存分页。我们使用此功能为优化器状态分配分页内存，当GPU内存不足时，这些状态会自动转移到CPU RAM中，并在优化器更新步骤中再次分页到GPU内存中。

QLORA

QLORA. Using the components described above, we define QLORA for a single linear layer in the quantized base model with a single LoRA adapter as follows:

We use NF4 for W and FP8 for c2. We use a blocksize of 64 for W for higher quantization precision and a blocksize of 256 for c2 to conserve memory.

For parameter updates only the gradient with respect to the error for the adapters weights ∂E are image needed, and not for 4-bit weights ∂E . However, the calculation of ∂E entails the calculation of ∂X ∂W ∂Li ∂W

which proceeds via equation (5) with dequantization from storage WNF4 to computation data type to calculate the derivative ∂X in BFloat16 precision.

利用上述组件，我们对具有单个LoRA适配器的量化基础模型的单个线性层定义了QLORA如下：

我们对W使用NF4，对c2使用FP8。我们对W使用块大小为64以获得更高的量化精度，对c2使用块大小为256以节省内存。

对于参数更新，仅需要与适配器权重的误差相关的梯度∂E，而不需要4位权重∂E的梯度。然而，计算∂E涉及通过从存储WNF4到计算数据类型进行去量化，以计算BFloat16精度的导数∂X的计算，该导数通过方程（5）进行。

To summarize, QLORA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LoRA parameters which use 16-bit BrainFloat.

总之，QLORA具有一个存储数据类型（通常为4位NormalFloat）和一个计算数据类型（16位BrainFloat）。我们将存储数据类型去量化为计算数据类型以进行前向和后向传递，但我们只计算使用16位BrainFloat的LoRA参数的权重梯度。

4、QLoRA与标准微调的比较QLoRA vs. Standard Finetuning

We have discussed how QLoRA works and how it can significantly reduce the required memory for finetuning models. The main question now is whether QLoRA can perform as well as full-model finetuning. Furthermore, we want to analyze the components of QLoRA including the impact of NormalFloat4 over standard Float4. The following sections will discuss the experiments that aimed at answering these questions.

Experimental setup. We consider three architectures (encoder, encoder-decoder, and decoder only) and compare QLoRA with 16-bit adapter-finetuning and with full-finetuning for models up to 3B. Our evaluations include GLUE [58] with RoBERTa-large [38], Super-NaturalInstructions (TKInstruct)[61] with T5 [49], and 5-shot MMLU [24] after finetuning LLaMA on Flan v2 [39] and Alpaca [55]. To additionally study the advantages of NF4 over other 4-bit data types, we use the setup of Dettmers and Zettlemoyer [13] and measure post-quantization zero-shot accuracy and perplexity across different models (OPT [72], LLaMA [57], BLOOM [52], Pythia [7]) for model sizes 125m - 13B. We provide more details in the results section for each particular setup to make the results more readable. Full details in Appendix A.

我们已经讨论了QLoRA的工作原理以及它如何显著减少微调模型所需的内存。现在的主要问题是，QLoRA能否像完全模型微调一样表现出色。此外，我们还想分析QLoRA的组成部分，包括NormalFloat4对标准Float4的影响。下面的章节将讨论旨在回答这些问题的实验。

实验设置。我们考虑了三种架构（编码器、编码器-解码器和仅解码器），并将QLoRA与16位适配器微调和完全微调进行比较，适用于高达3B的模型。我们的评估包括使用RoBERTa-large进行GLUE[58]评估，使用T5[49]进行Super-NaturalInstructions (TKInstruct)[61]评估，并在在Flan v2[39]和Alpaca[55]上对LLaMA进行微调后的5-shot MMLU[24]评估。为了进一步研究NF4相对于其他4位数据类型的优势，我们使用Dettmers和Zettlemoyer[13]的设置，评估在不同模型（OPT[72]、LLaMA[57]、BLOOM[52]、Pythia[7]）上的后量化零点准确率和困惑度，模型规模为125m-13B。我们在每个特定设置的结果部分提供了更多细节，以使结果更易读。附录A中提供了完整的详细信息。

While paged optimizers are critical to do 33B/65B QLORA tuning on a single 24/48GB GPU, we do not provide hard measurements for Paged Optimiz-ers since the paging only occurs when processing mini-batches with long sequence lengths, which is rare. We do, however, perform an analysis of the runtime of paged optimizers for 65B models on 48GB GPUs and find that with a batch size of 16, paged optimizers provide the same training speed as regular optimizers. Future work should measure and characterize under what circumstances slow-downs occur from the paging process.

虽然对于在单个24/48GB GPU上进行33B/65B QLORA微调，分页优化器至关重要，但我们没有提供Paged Optimizers的硬性测量数据，因为分页只在处理具有长序列长度的小批量时发生，而这种情况很少见。然而，我们对48GB GPU上的65B模型的分页优化器的运行时间进行了分析，并发现在批量大小为16的情况下，分页优化器提供与常规优化器相同的训练速度。未来的工作应该测量和描述在哪些情况下从分页过程中出现减速。

默认LoRA超参数与16位性能不匹配Default LoRA hyperparameters do not match 16- bit performance.

When using the standard prac- tice of applying LoRA to query and value attention

projection matrices [28], we are not able to replicate full finetuning performance for large base models. As shown in Figure 2 for LLaMA 7B finetuning on Alpaca, we find that the most critical LoRA hyper- parameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers are required to match full finetuning perfor- mance. Other LoRA hyperparameters, such as the projection dimension r, do not affect performance (see Appendix A).

Similarly, we find that default hyperparameters for fully finetuned baselines are undertuned. We do a hyperparameter search over learning rates 1e-6 to 5e-5 and batch sizes 8 to 128 to find robust baselines. Results for 7B LLaMA finetuning on Alpaca are shown in Figure 2.

当使用将LoRA应用于查询和值注意力投影矩阵的标准实践[28]时，我们无法复制大型基准模型的完全微调性能。如图2所示，对于在Alpaca上对LLaMA 7B进行微调，我们发现最关键的LoRA超参数是总共使用多少个LoRA适配器，并且需要在所有线性变压器块层上使用LoRA才能匹配完全微调的性能。其他LoRA超参数，如投影维度r，对性能没有影响（见附录A）。

类似地，我们发现完全微调基准的默认超参数是欠调的。我们在学习率1e-6到5e-5和批量大小8到128上进行了超参数搜索，以找到稳健的基准结果。图2显示了在Alpaca上对LLaMA 7B进行微调的结果。

4位NormalFloat比4位浮点数性能更好4-bit NormalFloat yields better performance than 4-bit Floating Point

While the 4-bit NormalFloat (NF4) data type is information-theoretically optimal, it still needs to be determined if this property translates to empirical advantages. We follow the setup from Dettmers and Zettlemoyer [13] where quantized LLMs (OPT [72], BLOOM [52], Pythia [7], LLaMA) of different sizes (125M to 65B) with different data types are evaluated on language modeling and a set of zero-shot tasks. In Figure 3 and Table 2 we see that NF4 improves per-formance significantly over FP4 and Int4 and that double quantization reduces the memory footprint without degrading performance.

虽然4位NormalFloat（NF4）数据类型在信息理论上是最佳的，但仍需要确定该属性是否能转化为实证优势。我们遵循Dettmers和Zettlemoyer[13]的设置，评估了具有不同大小（125M到65B）和不同数据类型的量化LLM（OPT[72]、BLOOM[52]、Pythia[7]、LLaMA）在语言建模和一组零射击任务上的表现。在图3和表2中，我们看到NF4相对于FP4和Int4显著改善了性能，并且双重量化减少了内存占用而不降低性能。

k位QLORA与16位完全微调和16位LoRA的性能相匹配k-bit QLORA matches 16-bit full finetuning and 16-bit LoRA performance

Recent findings have established that 4-bit quantization for inference is possible, but leads to performance degradation rel- ative to 16-bit [13, 18]. This raises the crucial question of whether the lost performance can be recovered by conducting 4-bit adapter finetuning. We test this for two setups.

The first focuses on a comparison with full 16-bit finetuning of RoBERTA and T5 models sized 125M to 3B parameters on GLUE and the Super-NaturalInstructions dataset. Results are shown in Table 3. In both datasets, we observe that 16-bit, 8-bit, and 4-bit adapter methods replicate the performance of the fully finetuned 16-bit baseline. This suggests that the performance lost due to the imprecise quantization can be fully recovered through adapter finetuning after quantization.

最近的研究结果表明，对推断进行4位量化是可能的，但与16位相比会导致性能下降[13, 18]。这引发了一个关键问题，即是否可以通过进行4位适配器微调来恢复由于精确度不准确而丢失的性能。我们对两种设置进行了测试。

第一个重点是将RoBERTA和T5模型（参数规模为125M至3B）进行完全16位微调与适用于GLUE和Super-NaturalInstructions数据集的16位比较。结果如表3所示。在这两个数据集中，我们观察到16位、8位和4位适配器方法复制了完全微调的16位基准性能。这表明，通过量化后的适配器微调可以完全恢复由于不精确的量化而丢失的性能。

For our second setup, since full finetuning models at and beyond 11B parameters requires more than one server of high memory GPUs, we continue to test whether 4-bit QLORA can match 16-bit LoRA at the 7B to 65B parameter scales. To this end, we finetune LLaMA 7B through 65B on two instruction following datasets, Alpaca and FLAN v2, and evaluate on the MMLU benchmark via 5-shot accuracy. Results are shown in Table 4 where we see that NF4 with double quantization fully recovers the 16-bit LoRA MMLU performance. In addition, we also note that QLORA with FP4 lags behind the 16-bit brain float LoRA baseline by about 1 percentage point. This corroborates both our findings that (1) QLORA with NF4 replicates both 16-bit full finetuning and 16-bit LoRA finetuning performance, and (2) NF4 is superior to FP4 in terms of quantization precision.

对于我们的第二个设置，由于在11B参数及以上的完全微调模型需要多个高内存GPU服务器，我们继续测试4位QLORA是否可以与16位LoRA在7B至65B参数范围内相匹配。为此，我们在两个指令跟踪数据集Alpaca和FLAN v2上对LLaMA 7B到65B进行微调，并通过5-shot准确率在MMLU基准上进行评估。结果如表4所示，我们可以看到双重量化的NF4完全恢复了16位LoRA的MMLU性能。此外，我们还注意到，使用FP4的QLORA相对于16位脑浮点LoRA基准落后约1个百分点。这证实了我们的发现：（1）NF4的QLORA复制了16位完全微调和16位LoRA微调的性能，（2）在量化精度方面，NF4优于FP4。

总结Summary

Our results consistently show that 4-bit QLORA with NF4 data type matches 16-bit full finetuning and 16-bit LoRA finetuning performance on academic benchmarks with well-established evaluation setups. We have also shown that NF4 is more effective than FP4 and that double quantization does not degrade performance. Combined, this forms compelling evidence that 4-bit QLORA tuning reliably yields results matching 16-bit methods.

In line with previous work on quantization [13], our MMLU and Elo results indicate that with a given finetuning and inference resource budget it is beneficial to increase the number of parameters in the base model while decreasing their precision. This highlights the importance of efficiency benefits from QLORA. Since we did not observe performance degradation compared to full-finetuning in our experiments with 4-bit finetuning, this raises the question of where the performance-precision trade-off exactly lies for QLoRA tuning, which we leave to future work to explore.

We proceed to investigate instruction tuning at scales that would be impossible to explore with full 16-bit finetuning on academic research hardware.

我们的结果一致表明，带有NF4数据类型的4位QLORA与学术基准上已建立的评估设置中的16位完全微调和16位LoRA微调的性能相匹配。我们还证明了NF4相对于FP4更有效，并且双重量化不会降低性能。综合而言，这构成了令人信服的证据，表明4位QLORA微调可可靠地产生与16位方法相匹配的结果。

与先前的量化工作一致，我们的MMLU和Elo结果表明，在给定的微调和推断资源预算下，增加基准模型中的参数数量并降低其精度是有益的。这凸显了QLORA的效率优势的重要性。由于我们在4位微调的实验中没有观察到性能下降，这引发了关于QLoRA微调的性能-精度权衡的确切位置的问题，这需要未来的工作来探索。

我们继续研究在学术研究硬件上无法进行完全16位微调的规模下的指令微调。

5、使用QLoRA推动聊天机器人的最新技术水平Pushing the Chatbot State-of-the-art with QLoRA

Having established that 4-bit QLORA matches 16-bit performance across scales, tasks, and datasets we conduct an in-depth study of instruction finetuning up to the largest open-source language models available for research. To assess the performance of instruction finetuning these models, we evaluate on a challenging Natural Language Understanding benchmark (MMLU) and develop new methods for real-world chatbot performance evaluation.

经过我们已经确立的4位QLoRA在各个规模、任务和数据集上与16位性能相匹配的结果，我们对最大规模的开源语言模型进行了深入研究，以评估指令微调在这些模型上的性能，并开发了用于实际聊天机器人性能评估的新方法。

5.1、实验设置Experimental setup

We now describe an overview of the experimental setup with full details in Appendix B.

我们现在对实验设置进行概述，详细内容请参见附录B。

数据集Data

As, to our knowledge, there is no comprehensive study of recent instruction-following datasets, we select eight recent datasets. We include datasets obtained through crowd-sourcing (OASST1 [31], HH-RLHF [4]), distillation from instruction-tuned models (Alpaca [55], self-instruct [59], unnatural-instructions [26]), corpora aggregations (FLAN v2 [12]), as well as hybrids (Chip2 [32], Long-form [30]). These datasets cover different languages, data sizes, and licenses.

据我们所知，目前没有关于最新指令跟踪数据集的全面研究，因此我们选择了八个最近的数据集。我们包括通过众包获取的数据集（OASST1 [31]，HH-RLHF [4]），通过指令微调模型蒸馏得到的数据集（Alpaca [55]，self-instruct [59]，unnatural-instructions [26]），语料库聚合数据集（FLAN v2 [12]），以及混合型数据集（Chip2 [32]，Long-form [30]）。这些数据集涵盖了不同的语言、数据规模和许可证。

训练设置Training Setup

To avoid confounding effects from different training objectives, we perform QLoRA finetuning with cross-entropy loss (supervised learning) without reinforcement learning, even for datasets that include human judgments of different responses. For datasets that have a clear distinction between instruction and response, we finetune only on the response (see ablations in Appendix B). For OASST1 and HH-RLHF, multiple responses are available. We then select the top response at every level of the conversation tree and finetune on the full selected conversation, including the instructions. In all of our experiments, we use NF4 QLORA with double quantization and paged optimizers to prevent memory spikes during gradient checkpointing. We do small hyperparameter searches for the 13B and 33B LLaMA models and we find that all hyperparameter settings found at 7B generalize (including number of epochs) except learning rate and batch size. We halve the learning rate for 33B and 65B while doubling the batch size.

为了避免来自不同训练目标的混淆效应，我们使用交叉熵损失（监督学习）对QLoRA进行微调，而不使用强化学习，即使对于包含不同回答的人类判断的数据集也是如此。对于明确区分指令和回答的数据集，我们仅对回答进行微调（详见附录B中的消融实验）。对于OASST1和HH-RLHF，有多个回答可用。然后，我们在每个对话树层级上选择顶级回答，并对包括指令在内的完整对话进行微调。在我们的所有实验中，我们使用NF4 QLoRA进行双重量化和页面优化，以防止在梯度检查点期间出现内存峰值。对于13B和33B LLaMA模型，我们进行了小型超参数搜索，并发现在7B的所有超参数设置都具有良好的泛化性能（包括时代数量），只有学习率和批量大小不同。我们将学习率减半，并将批量大小加倍。

基准模型Baselines

Baselines We compare our models to both research (Vicuna [10] and Open Assistant [31]) and commercial (GPT-4 [42], GPT-3.5-turbo and Bard) chatbot systems. The Open Assistant model is a LLaMA 33B model finetuned with Reinforcement Learning from Human Feedback (RLHF) on the same OASST1 dataset that we experiment with. Vicuna does full fine-tuning of LLaMA 13B on proprietary user-shared conversations from ShareGPT and is thus the result of distillation from OpenAI GPT models.

我们将我们的模型与研究模型（Vicuna [10]和Open Assistant [31]）以及商业模型（GPT-4 [42]，GPT-3.5-turbo和Bard）进行比较。Open Assistant模型是LLaMA 33B模型，在与我们进行实验的OASST1数据集上通过人类反馈的强化学习进行微调的结果。Vicuna对LLaMA 13B模型进行了完全微调，使用了来自ShareGPT的专有用户共享对话，因此它是通过对OpenAI GPT模型进行蒸馏得到的结果。

5.2、评估Evaluation

Following common practice, we use the MMLU (Mas-sively Multitask Language Understanding) benchmark [24] to measure performance on a range of language un-derstanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.

We also test generative language capabilities through both automated and human evaluations. This second set of evaluations relies on queries curated by humans and aims at measuring the quality of model responses. While this is a more realistic testbed for chatbot model performance and is growing in popularity, there is no

commonly accepted protocol in the literature. We de-

scribe below our proposed setup, using nucleus sampling with p = 0.9 and temperature 0.7 in all cases.

按照常规做法，我们使用Mas-sively Multitask Language Understanding（MMLU）基准[24]来衡量在一系列语言理解任务上的性能。这是一个包含57个任务的多项选择基准，包括基本数学、美国历史、计算机科学、法律等。我们报告5个样本测试的准确率。

我们还通过自动化和人工评估测试生成语言能力。这第二组评估依赖于由人类策划的查询，并旨在衡量模型回复的质量。虽然这是一个更真实的聊天机器人模型性能测试平台，并且越来越受欢迎，但在文献中并没有通用接受的协议。在所有情况下，我们使用核心抽样（nucleus sampling），p = 0.9，温度为0.7。

基准数据集Benchmark Data

We evaluate on two curated datasets of queries (questions): the Vicuna prompts [10] and the OASST1 validation dataset [31]. We use the Vicuna prompts, a set of 80 prompts from a diverse set of categories, without modifications. The OASST1 dataset is a multilingual collection of crowd-sourced multiturn dialogs between a user and an assistant. We select all user messages in the validation dataset as queries and include previous turns in the prompt. This procedure leads to 953 unique user queries. We term these two datasets the Vicuna and OA benchmarks.

我们评估两个查询（问题）的策划数据集：Vicuna prompts [10]和OASST1验证数据集[31]。我们使用Vicuna prompts，这是一个由多个类别中的80个提示组成的集合，没有进行修改。OASST1数据集是一个多语言的用户和助手之间的众包多轮对话集合。我们选择验证数据集中的所有用户消息作为查询，并在提示中包含之前的对话回合。这个过程产生了953个独特的用户查询。我们将这两个数据集称为Vicuna和OA基准。

自动化评估Automated Evaluation

First, based on the evaluation protocol introduced by Chiang et al. [10], we use GPT-4 to rate the performance of different systems against ChatGPT (GPT-3.5 Turbo) on the Vicuna benchmark. Given a query along with ChatGPT’s and a model’s responses, GPT-4 is prompted to assign a score out of ten to both responses and provide an explanation. The overall performance of a model is calculated as a percentage of the score that ChatGPT achieved. Note this relative score can be higher than 100% if the model achieves a higher absolute score than ChatGPT. We find a significant ordering effect with GPT-4 increasing the score of the response occurring earlier in the prompt. To control for such effects, we recommend reporting the mean score over both orders.

Next, we measure performance through direct comparisons between system outputs. We simplify the rating scheme to a three-class labeling problem that accounts for ties. We prompt GPT-4 to pick the best response or declare a tie and provide an explanation. We conduct these head-to-head comparisons on all permutations of pairs of systems on both the Vicuna and OA benchmarks.

首先，基于Chiang等人[10]提出的评估协议，我们使用GPT-4来评估不同系统在Vicuna基准上相对于ChatGPT（GPT-3.5 Turbo）的性能。给定一个查询以及ChatGPT和模型的回答，我们让GPT-4为这两个回答分配一个10分制的得分，并提供解释。模型的整体性能被计算为ChatGPT所获得得分的百分比。需要注意的是，如果模型的绝对得分高于ChatGPT，那么相对得分可以高于100%。我们发现GPT-4对先出现在提示中的回答的得分有明显的排序效应。为了控制这种效应，我们建议报告两种排序方式下的平均得分。

接下来，我们通过直接比较系统输出来衡量性能。我们将评分方案简化为一个三类标签问题，考虑到平局情况。我们让GPT-4选择最佳回答或宣布平局，并提供解释。我们在Vicuna和OA基准上对系统之间的所有配对进行这些一对一的比较。

人工评估Human Evaluation

While recent work indicates generative models can be effectively employed for system evaluations [19], the reliability GPT-4 ratings to assess chatbot performance is, to our knowledge, yet to be proven to correlate with human judgments. Therefore, we run two parallel human evaluations on the Vicuna benchmark matching both automated evaluation protocols described above. We use Amazon Mechanical Turk (AMT) and get two human annotators for comparisons to ChatGPT and three annotators for pairwise comparisons.

尽管最近的研究表明生成模型可以有效用于系统评估[19]，但据我们所知，尚未证明GPT-4的评分与人类判断相关来评估聊天机器人的性能。因此，我们在Vicuna基准上进行了两个并行的人工评估，与上述自动化评估协议相匹配。我们使用Amazon Mechanical Turk（AMT），并为与ChatGPT的比较选择了两个人工标注者，以及为两两比较选择了三个标注者。

Elo评分Elo Rating

With both human and automated pairwise comparisons, we create a tournament-style competition where models compete against each other. The tournament is made up of matches where pairs of models compete to produce the best response for a given prompt. This is similar to how Bai et al. [4] and Chiang et al. [10] compare models, but we also employ GPT-4 ratings in addition to human ratings. We randomly sample from the set of labeled comparisons to compute Elo [16, 17]. Elo rating, which is widely used in chess and other games, is a measure of the expected win-rate relative to an opponent’s win rate, for example, an Elo of 1100 vs 1000 means the Elo 1100 player has an expected win-rate of approximately 65% against the Elo 1000 opponent; a 1000 vs 1000 or 1100 vs 1100 match results in an expected win-rate of 50%. The Elo rating changes after each match proportionally to the expected outcome, that is, an unexpected upset leads to a large change in Elo rating while an expected outcome leads to a small change. Over time, Elo ratings approximately match the skill of each player at playing the game. We start with a score of 1,000 and use K = 32. Similar to Chiang et al. [10], we repeat this procedure 10,000 times with different random seeds to control for ordering effects, e.g., the effect of which model pairs compete with each other first.

通过人工和自动化的两两比较，我们创建了一个类似锦标赛的竞争模式，让模型彼此竞争。锦标赛由一系列比赛组成，每对模型竞争为给定提示产生最佳回答。这类似于Bai等人[4]和Chiang等人[10]比较模型的方式，但我们除了使用人工评分外，还使用了GPT-4的评分。我们从标记比较的集合中随机抽样以计算Elo[16, 17]。Elo评分是国际象棋和其他游戏中广泛使用的一种衡量预期胜率的指标。例如，Elo值为1100对1000意味着Elo值为1100的选手对Elo值为1000的对手有大约65%的预期胜率；1000对1000或1100对1100的比赛结果预期胜率为50%。Elo评分在每场比赛后会按预期结果的比例进行变化，意味着意外的反败为大幅改变Elo评分，而预期的结果只会带来小幅改变。随着时间的推移，Elo评分大致匹配每个玩家在游戏中的技能水平。我们从1,000开始，并使用K = 32。与Chiang等人[10]类似，我们以不同的随机种子重复此过程10,000次，以控制排序效应，例如，哪些模型首先进行竞争的效应。

5.3、Guanaco：QLORA在OASST1上训练的聊天机器人是最先进的模型Guanaco：QLORA trained on OASST1 is a State-of-the-art Chatbot

Based on our automated and human evaluations, we find that the top QLORA tuned model, Guanaco 65B, which we finetune on a variant of OASST1, is the best-performing open-source chatbot model and offers performance competitive to ChatGPT. When compared to GPT-4, Guanaco 65B and 33B have an expected win probability of 30%, based on Elo rating from human annotators system-level pairwise comparisons - the highest reported to date.

The Vicuna benchmark [10] results relative to ChatGPT are shown in Table 6. We find that Guanaco 65B is the best-performing model after GPT-4, achieving 99.3% performance relative to ChatGPT. Guanaco 33B has more parameters than the Vicuna 13B model, but uses only 4-bit precision for its weights and is thus much more memory efficient at 21 GB vs 26 GB, providing a three percentage points of improvement over Vicuna 13B. Furthermore, Guanaco 7B easily fits on modern phones at a 5 GB footprint while still scoring nearly 20 percentage points higher than Alpaca 13B.

However, Table 6 also has very wide confidence intervals, with many models overlapping in per-formance. We hypothesize that this uncertainty comes from the lack of clear specification of scale, g., it is unclear what 8 on a 10 point scale means across different scenarios. As such, we instead recommend using the Elo ranking method [16], based on pairwise judgments from human annotators and GPT-4 to avoid the problem of grounding an absolute scale. Elo ratings of the most competitive models can be seen in Table 1. We note that human and GPT-4 ranking of models on the Vicuna benchmark disagree partially, particularly for Guanaco 7B, but are consistent for most models with a Kendall Tau of τ = 0.43 and Spearman rank correlation of r = 0.55 at the system level. At the example level, the agreement between GPT-4 and human annotators’ majority vote is weaker with Fleiss κ = 0.25. Overall, this shows a moderate agreement between system-level judgments by GPT-4 and human annotators, and thus that model-based evaluation represents a somewhat reliable alternative to human evaluation. We discuss further considerations in Section 6.2.

根据我们的自动化和人工评估，我们发现经过优化的顶级QLORA调整模型Guanaco 65B，在OASST1的变体上微调，是性能最好的开源聊天机器人模型，并且其性能与ChatGPT相媲美。与GPT-4相比，Guanaco 65B和33B在人工注释系统级别的两两比较中具有30%的预期获胜概率，这是迄今为止报道的最高值。

相对于ChatGPT，Vicuna基准[10]的结果如表6所示。我们发现Guanaco 65B是继GPT-4之后表现最好的模型，相对于ChatGPT的性能达到99.3%。Guanaco 33B的参数比Vicuna 13B模型更多，但是其权重只使用4位精度，因此在内存效率方面更高，为21 GB对26 GB，相比Vicuna 13B提高了三个百分点。此外，Guanaco 7B在仍然比Alpaca 13B的体积小得多的情况下，容易适应现代手机，得分比Alpaca 13B高近20个百分点。

然而，表6的置信区间非常宽广，许多模型的性能有重叠。我们假设这种不确定性来自于缺乏明确的规模规范，例如，不清楚在不同情景中10分制中的8分代表什么意思。因此，我们建议使用Elo排名方法[16]，基于人工注释者和GPT-4的两两判断，以避免绝对刻度的问题。表1显示了最有竞争力的模型的Elo评分。我们注意到，GPT-4和人工注释者对于Vicuna基准中的模型排序在某种程度上存在差异，特别是对于Guanaco 7B，但对于大多数模型在系统级别上是一致的，Kendall Tau为τ = 0.43，Spearman秩相关为r = 0.55。在示例级别上，GPT-4和人工注释者的多数投票一致性较弱，Fleiss κ = 0.25。总体而言，这表明GPT-4和人工注释者在系统级别上的判断存在适度的一致性，因此基于模型的评估是一种相对可靠的人工评估替代方法。我们在第6.2节讨论了进一步的考虑因素。

Elo rankings in Table 7 indicate that Guanaco 33B and 65B models outperform all models besides GPT-4 on the Vicuna and OA benchmarks and that they perform comparably to ChatGPT in line with Table 6. We note that the Vicuna benchmark favors open-source models while the larger OA benchmark favors ChatGPT. Furthermore, we can see from Tables 5 and 6 that the suitability of a finetuning dataset is a determining factor in performance. Finetuning Llama models on FLAN v2 does particularly well on MMLU, but performs worst on the Vicuna benchmark (similar trends are observed with other models). This also points to partial orthogonality in current evaluation benchmarks: strong MMLU performance does not imply strong chatbot performance (as measured by Vicuna or OA benchmarks) and vice versa.

Guanaco is the only top model in our evaluation that is not trained on proprietary data as the OASST1 dataset collection guidelines explicitly forbid the use of GPT models. The next best model trained on only open-source data is the Anthropic HH-RLHF model, which scores 30 percentage points lower than Guanaco on the Vicuna benchmark (see Table 6). Overall, these results show that 4-bit QLORA is effective and can produce state-of-the-art chatbots that rival ChatGPT. Furthermore, our 33B Guanaco can be trained on 24 GB consumer GPUs in less than 12 hours. This opens up the potential for future work via QLORA tuning on specialized open-source data, which produces models that can compete with the very best commercial models that exist today.

表7中的Elo排名表明，在Vicuna和OA基准上，Guanaco 33B和65B模型在除GPT-4以外的所有模型中表现出色，并且与ChatGPT的表现相当。我们注意到，Vicuna基准更偏向于开源模型，而更大的OA基准更偏向于ChatGPT。此外，从表5和表6可以看出，微调数据集的适用性是性能的决定性因素。在FLAN v2上微调Llama模型在MMLU上表现出色，但在Vicuna基准上表现最差（其他模型也观察到类似趋势）。这也指出了当前评估基准中的部分正交性：较好的MMLU性能并不意味着较好的聊天机器人性能（如Vicuna或OA基准所衡量的），反之亦然。

Guanaco是我们评估中唯一不使用专有数据训练的顶级模型，因为OASST1数据集的收集指南明确禁止使用GPT模型。在仅使用开源数据训练的模型中，Anthropic HH-RLHF模型是次优的，其在Vicuna基准上的得分比Guanaco低30个百分点（见表6）。总的来说，这些结果表明，4位QLORA是有效的，可以产生与ChatGPT不相上下的最先进的聊天机器人。此外，我们的33B Guanaco可以在不到12小时的时间内在24 GB的消费级GPU上进行训练。这为未来通过在专门的开源数据上进行QLORA调优提供了潜力，从而产生能够与当前最佳的商业模型竞争的模型。

6、定性分析Qualitative Analysis

While quantitative analysis is the core of our evaluation, there are a number of issues with only looking at summary statistics. Perhaps the largest is the problem of benchmark validity [36]—whether a benchmark truly tests what its name or description suggests is always at question, especially as we discover “shortcuts” to solve benchmarks that machine learning models sometimes exploit [22, 46]. To partially alleviate this, we here perform some qualitative analysis, in two sections. First, in §6.1 we show some examples that we believe are representative of some observed patterns in the text generated by our 65b Guanaco model. Second, §6.2 we detail considerations about the results we have discussed and our interpretation of them.

虽然定量分析是我们评估的核心，但仅仅看总结统计数据存在一些问题。可能最大的问题是基准的有效性[36]，即基准是否真正测试了其名称或描述所暗示的内容，特别是当我们发现机器学习模型有时会利用“捷径”来解决基准问题[22, 46]。为了在一定程度上缓解这个问题，我们在两个部分进行了定性分析。首先，在第6.1节中，我们展示了一些例子，我们认为这些例子代表了我们的65B Guanaco模型生成的文本中观察到的一些模式。其次，在第6.2节中，我们详细讨论了我们讨论的结果及其解释的考虑因素。

6.1、定性分析示例生成Qualitative Analysis of Example Generations

To find examples, we first go through data generated for the Vicuna benchmark and the OpenAssistant benchmark, and look for patterns in the answers Guanaco generates. When we notice a pattern we attempt to setup a question or prompt that will induce the pattern even though it is the incorrect solution, e.g., if we observe that the model tends to give long-winded answers we prompt the model to “Answer yes or no without explanation.” We use this to find “lemons” where we manage to adversarially break the model and “cherries” where we fail to break the model, and present both. All generations in this section were generated with Nucleus Sampling [25] with p = 0.9.

Of course, this is by no means comprehensive, since it is beyond the scope of this small qualitative study to control for all the variables involved, e.g., the full distribution of responses the model can generate for a given prompt is quite large, so we rely on samples we hope are representative. However, we believe describing these examples gives context to the quantitative evidence shown earlier in the paper. Since we open source all models and code, we hope this section will inspire future work to examine in more detail the issues we present here.

为了找到例子，我们首先浏览了Vicuna基准和OpenAssistant基准生成的数据，并寻找Guanaco生成的答案中的模式。当我们注意到一个模式时，我们尝试设置一个问题或提示，即使它是错误的解决方案，也能诱导出这个模式，例如，如果我们观察到模型倾向于给出冗长的答案，我们会提示模型“只回答是或否，不解释”。我们利用这一点找到我们设法对抗性地破坏模型的“柠檬”案例和我们未能破坏模型的“樱桃”案例，并同时呈现两者。本节中的所有生成都是使用核心抽样[Nucleus Sampling]（[25]）并且 p=0.9。

当然，这绝不是全面的，因为这个小规模定性研究的范围无法控制所有相关变量，例如，模型对于给定提示可以生成的完整响应分布非常大，因此我们依赖于我们希望具有代表性的样本。然而，我们相信描述这些例子可以为前文所示的定量证据提供背景。由于我们开源了所有模型和代码，我们希望本节能激发未来的工作更详细地研究我们在这里提出的问题。

where Guanaco presumes information transfer that was never described. These issues echo recent literature [51], but require more study.

Guanaco 认为信息传递从未被描述过。这些问题与最近的文献[51]相呼应，但需要更多的研究。

6.2、考虑因素Considerations

评估Evaluation

We report moderate agreement among human annotators (Fleiss κ = 0.42) with additional deterioration when comparing two strong systems. This points to limitations in the current benchmarks and human evaluation protocols for chatbot task performance. When manually comparing generations from ChatGPT and Guanaco 65B on the Vicuna benchmark, we find that subjective preferences start to play an important role as the authors of this paper disagreed on the many preferred responses. Future work should investigate approaches to mitigate these problems drawing from disciplines that developed mechanisms to deal with subjective preferences, such as Human-Computer Interaction and Psychology.

我们报告了人工注释者之间的适度一致性（Fleiss κ = 0.42），并且在比较两个强系统时发现了额外的恶化。这指出了当前基准和人工评估协议在聊天机器人任务性能方面存在的限制。当在Vicuna基准上人工比较ChatGPT和Guanaco 65B生成的文本时，我们发现主观偏好开始起到重要作用，因为本文的作者在许多首选回答上存在分歧。未来的工作应该研究如何解决这些问题，借鉴开发处理主观偏好机制的学科，如人机交互和心理学。

In our analysis, we also find that automated evaluation systems have noticeable biases. For example, we observe strong order effects with GPT-4 assigning higher scores to the system appearing first in its prompt. The relatively weak sample-level agreement between GPT-4 and human annotators (Fleiss κ = 0.25) also suggests that human annotators and automated systems might rely on preferences that are not always aligned. In addition, in Table 7, we observe that GPT-4 assigns significantly higher scores to its own outputs compared to human ratings, Elo of 1348 vs 1176, which represent an additional 20% probability of winning against an opponent. Future work should examine the presence of potential biases in automated evaluation systems as well as possible mitigation strategies.

在我们的分析中，我们还发现自动评估系统存在明显的偏见。例如，我们观察到GPT-4存在强烈的顺序效应，它给出的得分比在其提示中出现的系统更高。GPT-4和人工注释者之间相对较弱的样本级一致性（Fleiss κ = 0.25）也表明，人工注释者和自动化系统可能依赖于并不总是一致的偏好。此外，在表7中，我们观察到GPT-4将其自己的输出分配的分数比人工评级高得多，Elo分别为1348和1176，这代表了比对手多20%的获胜概率。未来的工作应该检查自动评估系统中潜在偏见的存在以及可能的缓解策略。

数据和训练Data & Training

We note that the OASST1 dataset on which Guanaco models are trained is multilingual and that the OA benchmark also contains prompts in different languages. We leave it to future work to investigate the degree to which such multilingual training improves performance on instructions in languages other than English and whether this explains the larger gap between Vicuna- 13B model (only trained on English data) and Guanaco 33B and 65B on the OA benchmark.

我们注意到Guanaco模型训练所使用的OASST1数据集是多语言的，OA基准测试中也包含了其他语言的提示。我们将这个问题留给未来的工作来调查这种多语言训练在除英语以外的语言指令上提高性能的程度，以及是否这解释了Vicuna-13B模型（仅训练在英语数据上）和Guanaco 33B和65B在OA基准测试中的较大差距。

Given the strong performance of Guanaco models, we investigate any data leakage between the OASST1 data and the Vicuna benchmark prompts. We do not find overlapping prompts after perform-ing fuzzy string matching in the two datasets and inspecting the closest matches manually.

Furthermore, we note that our model is only trained with cross-entropy loss (supervised learning) without relying on reinforcement learning from human feedback (RLHF). This calls for further investigations of the tradeoffs of simple cross-entropy loss and RLHF training. We hope that QLORA enables such analysis at scale, without the need for overwhelming computational resources.

鉴于Guanaco模型的强大性能，我们调查了OASST1数据和Vicuna基准测试提示之间是否存在数据泄漏。我们在这两个数据集中执行模糊字符串匹配并手动检查最接近的匹配后，未发现重叠的提示。

此外，我们注意到我们的模型仅使用交叉熵损失（监督学习）进行训练，而不依赖于来自人类反馈的强化学习（RLHF）。这需要进一步研究简单交叉熵损失和RLHF训练的权衡。我们希望QLORA能够在不需要过多计算资源的情况下进行这种规模的分析。

7、相关工作Related Work

大型语言模型的量化Quantization of Large Language Models

Quantization of LLMs has largely focused on quanti-zation for inference time. Major approaches for preserving 16-bit LLM quality focus on managing outlier features (e.g., SmoothQuant [66] and LLM.int8() [14]) while others use more sophisticated grouping methods [44, 69]. Lossy quantization approaches study the trade-offs for regular round-ing [13, 71, 47] or how to optimize rounding decisions to improve quantization precision [18]. Besides our work, SwitchBack layers [65] is the only work that studies backpropagation through quantized weights at a scale beyond 1B parameters.

LLM的量化主要集中在推理时间的量化上。保持16位LLM质量的主要方法侧重于管理异常特征（例如，SmoothQuant [66]和LLM.int8() [14]），而其他方法则使用更复杂的分组方法 [44, 69]。有损量化方法研究了正常舍入的权衡 [13, 71, 47] 或如何优化舍入决策以提高量化精度 [18]。除了我们的工作外，SwitchBack层 [65] 是唯一研究超过10亿参数规模下量化权重反向传播的工作。

使用Adapter进行微调 Finetuning with Adapters

While we use Low-rank Adapters [28] (LoRA), many other Parameter Efficient FineTuning (PEFT) methods have been proposed such as prompt tuning [48, 33, 34], tuning the embedding layer inputs [1], tuning hidden states (IA3) [37], adding full layers [27], tuning biases [70], learning a mask over weights based on Fisher information [54], and a combination of approaches [23]. In our work, we show that LoRA adapters are able to reach full 16-bit finetuning performance. We leave it to future work to explore the tradeoffs of other PEFT approaches.

虽然我们使用了低秩Adapter（LoRA），但还提出了许多其他参数高效微调（PEFT）方法，例如prompt tuning [48, 33, 34]，微调嵌入层输入 [1]，微调隐藏状态（IA3） [37]，添加完整层 [27]，微调偏差 [70]，基于Fisher信息学习权重上的掩码 [54]，以及多种方法的结合 [23]。在我们的工作中，我们展示了LoRA适配器能够达到完全16位微调的性能。未来的工作可以探索其他PEFT方法的权衡。

指令微调Instruction Finetuning

To help a pretrained LLM follow the instructions provided in a prompt, instruction finetuning uses input-output pairs of various data sources to finetune a pretrained LLM to generate the output given the input as a prompt. Approaches and datasets include MetaICL [40],MetaTuning [73], InstructGPT [43], FLAN [62, 12], PromptSource [3], Super-NaturalInstructions [61, 50], Self-instruct [59], UnnaturalInstructions [26], OPT-IML [29], UnifiedSKG[67], OIG/Chip2 [32], Alpaca [55], Vicuna [10], Koala [20], and Self-instruct-GPT-4 [45].

为了帮助预训练的LLM按照提示中提供的指令进行操作，指令微调使用来自各种数据源的输入-输出对来微调预训练的LLM，以便在给定输入作为提示时生成输出。方法和数据集包括MetaICL [40]、MetaTuning [73]、InstructGPT [43]、FLAN [62, 12]、PromptSource [3]、Super-NaturalInstructions [61, 50]、Self-instruct [59]、UnnaturalInstructions [26]、OPT-IML [29]、UnifiedSKG [67]、OIG/Chip2 [32]、Alpaca [55]、Vicuna [10]、Koala [20]和Self-instruct-GPT-4 [45]。

聊天机器人Chatbots

Many instruction following models are structured as dialogue-based chatbots, often using Reinforcement Learning from Human Feedback (RLHF) [11] or generating data from an existing model to train with AI model feedback (RLAIF) [5]. Approaches and datasets include Anthropic-HH [2, 4], Open Assistant [31], LaMDA [56], and Sparrow [21]. We do not use reinforcement learning, but our best model, Guanaco, is finetuned on multi-turn chat interactions from the Open Assistant dataset which was designed to be used for RLHF training [31]. For the evaluation of chatbots approaches that use GPT-4 instead of costly human annotation have been developed [10, 45]. We improve on such approaches with a focus on an evaluation setup that is more reliable.

许多遵循指令的模型都是以对话式聊天机器人的结构为基础，通常使用人类反馈的强化学习（RLHF）[11]或从现有模型生成数据以便通过AI模型反馈进行训练（RLAIF）[5]。方法和数据集包括Anthropic-HH [2, 4]、Open Assistant [31]、LaMDA [56]和Sparrow [21]。我们没有使用强化学习，但我们最好的模型Guanaco是在Open Assistant数据集的多轮聊天交互上进行微调的，该数据集设计用于强化学习从人类反馈中进行训练[31]。针对聊天机器人的评估方法已经被开发出来，使用GPT-4代替昂贵的人类注释[10, 45]。我们通过更可靠的评估设置改进了这些方法的性能。

8、局限性与讨论Limitations and Discussion

We have shown evidence that our method, QLORA, can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA). Despite this evidence, we did not establish that QLORA can match full 16-bit finetuning performance at 33B and 65B scales. Due to the immense resource costs, we leave this study to future work.

Another limitation is the evaluation of instruction finetuning models. While we provide evaluations on MMLU, the Vicuna benchmark, and the OA benchmark, we did not evaluate on other benchmarks such as BigBench, RAFT, and HELM, and it is not ensured that our evaluations generalize to these benchmarks. On the other hand, we perform a very broad study on MMLU and develop new methods for evaluating chatbots.

我们已经证明了我们的方法QLORA能够在使用4位基础模型和低秩Adapter（LoRA）时复制16位完全微调的性能。尽管有这样的证据，我们并没有证明QLORA能够在33B和65B规模上达到完全的16位微调性能。由于资源成本的巨大，我们将这个研究留给未来的工作。

另一个限制是指令微调模型的评估。虽然我们对MMLU、Vicuna基准测试和OA基准测试进行了评估，但我们没有在其他基准测试（如BigBench、RAFT和HELM）上进行评估，并且不能确保我们的评估可以推广到这些基准测试中。另一方面，我们对MMLU进行了广泛的研究，并开发了评估聊天机器人的新方法。

From the evidence presented, it appears that the performance of these benchmarks likely depends how similar the finetuning data is to the benchmark dataset. For example, FLAN v2 is similar to MMLU, but dissimilar to chatbot benchmarks and vice versa for the Chip2 dataset and both models score accordingly on the MMLU and Vicuna benchmarks. This highlights that not only better benchmarks and evaluation is needed, but that one needs to be careful about what one is evaluating in the first place. Do we want to create models that do well on classroom highschool and colleague knowledge or do we want to do well on chatbot conversation ability? Maybe something else? Because it is always easier to evaluate on an existing benchmark compared to creating a new one, certain benchmarks can steer the community towards a certain direction. We should ensure as a community that the benchmarks measure what we care about.

While we provide a detailed evaluation for general chatbot performance, another limitation is that we only do a limited responsible AI evaluation of Guanaco. We evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models in Table 8. We see that the average score in Guanaco-65B is much lower than other raw pretrained models. As such, it seems that finetuning on the OASST1 dataset reduces the bias of the LLaMA base model. While these results are encouraging, it is unclear if Guanaco does also well when assessed on other types of biases. We leave further evaluation of analyzing biases in Guanaco and similar chatbots to future work.

从所提供的证据来看，这些基准测试的性能可能取决于微调数据与基准测试数据集的相似程度。例如，FLAN v2与MMLU相似，但与聊天机器人基准测试不同，而Chip2数据集与之相反，两个模型在MMLU和Vicuna基准测试上得分相应地。这凸显了不仅需要更好的基准测试和评估，还需要注意首先评估什么。我们是要创建在课堂高中和同事知识上表现良好的模型，还是要在聊天机器人对话能力上表现良好？或者可能还有其他目标？因为与创建新基准测试相比，在现有基准测试上进行评估总是更容易的，某些基准测试可能会引导社区朝着特定的方向发展。作为一个社区，我们应该确保基准测试度量的是我们关心的事物。

虽然我们对通用聊天机器人性能进行了详细评估，但另一个限制是我们只对Guanaco进行了有限的负责任AI评估。我们评估了Guanaco-65B生成社会偏见序列的可能性，与其他原始预训练模型相比，表8中的平均得分要低得多。因此，似乎在OASST1数据集上微调可以减少LLaMA基础模型的偏见。尽管这些结果是令人鼓舞的，目前尚不清楚Guanaco在评估其他类型的偏见时表现如何。我们将进一步评估Guanaco和类似聊天机器人中的偏见分析留给未来的工作。

An additional limitation is that we did not evaluate different bit-precisions, such as using 3-bit base models, or different adapter methods. Besides LoRA, there is also a wide variety Parameter Efficient FineTuning (PEFT) methods that have been shown to work well. However, it is unclear if these methods scale to large models. We used LoRA as many results established its robustness but other adapters might yield better performance. Since finetuning after quantization seems to recover most of the information that is lost during quantization this might enable much more aggressive quantization. For example, 3-bit GPTQ quantization of the basemodel with LoRA might also yield 16-bit full finetuning performance after finetuning.

另一个局限性是我们没有评估不同的位精度，例如使用3位基础模型或不同的Adapter方法。除了LoRA之外，还有很多参数高效微调（PEFT）方法已经被证明效果良好。然而，这些方法是否适用于大型模型尚不清楚。我们选择了LoRA，因为许多结果已经证明了其稳健性，但其他适配器可能会获得更好的性能。由于在量化过程中微调似乎恢复了丢失的大部分信息，这可能会实现更激进的量化。例如，使用LoRA对基础模型进行3位GPTQ量化后进行微调，可能也能实现16位完全微调的性能。

9、更广泛的影响Broader Impacts

Our QLORA finetuning method is the first method that enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, while not degrading performance relative to a full finetuning baseline. We have demonstrated that our best 33B model trained on the Open Assistant dataset can rival ChatGPT on the Vicuna benchmark. Since instruction finetuning is an essential tool to transform raw pretrained LLMs into ChatGPT-like chatbots, we believe that our method will make finetuning widespread and common in particular for the researchers that have the least resources, a big win for the accessibility of state of the art NLP technology. QLORA can be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs.

我们的QLORA微调方法是第一个能够在单个消费级GPU上进行33B参数模型微调，以及在单个专业GPU上进行65B参数模型微调的方法，同时不降低性能相对于完全微调的基准线。我们已经证明了我们在Open Assistant数据集上训练的最佳33B模型在Vicuna基准测试上可以与ChatGPT相媲美。由于指令微调是将原始预训练LLM转化为ChatGPT-like聊天机器人的重要工具，我们相信我们的方法将使微调在研究人员中普及和常见，尤其是对于资源最少的研究人员而言，这是对最先进的自然语言处理技术的可访问性的巨大胜利。QLORA可以被视为一种平等化因素，有助于弥合大型企业和拥有消费级GPU的小团队之间的资源差距。

Another potential source of impact is deployment to mobile phones. We believe our QLORA method might enable the critical milestone of enabling the finetuning of LLMs on phones and other low resource settings. While 7B models were shown to be able to be run on phones before, QLORA is the first method that would enable the finetuning of such models. We estimate that with an iPhone 12 Plus, QLORA can finetune 3 million tokens per night while the phone is charging. While finetuned 7B models do not reach the quality of ChatGPT, we believe that the quality is good enough to enable novel applications that have not been possible before due to privacy or LLM quality issues. QLORA can help enable privacy-preserving usage of LLMs, where users can own and manage their own data and models, while simultaneously making LLMs easier to deploy.

However, finetuning is a dual-use technology that can be abused to cause harm. Widespread use of LLMs has known dangers [8, 6], but we believe that equalizing access to a technology that is quickly becoming ubiquitous will allow for better more independent analysis than keeping the power of LLMs in the hands of large corporations that do not release models or source code for auditing.

All in all, we believe that QLORA will have a broadly positive impact making the finetuning of high quality LLMs much more widely and easily accessible.

另一个潜在的影响来源是部署到手机上。我们相信我们的QLORA方法可能实现关键的里程碑，即在手机和其他资源有限的环境中实现LLM的微调。虽然之前已经证明了7B模型可以在手机上运行，但QLORA是第一个能够进行这种微调的方法。我们估计，使用iPhone 12 Plus，QLORA可以在夜间手机充电时微调300万个标记。虽然经过微调的7B模型无法达到ChatGPT的质量，但我们相信这种质量已经足够好，可以实现之前由于隐私或LLM质量问题而不可能实现的新应用。QLORA可以帮助实现隐私保护的LLM使用，用户可以拥有和管理自己的数据和模型，同时使LLM更容易部署。

然而，微调是一项具有双重用途的技术，可能被滥用以造成伤害。LLM的广泛使用已经带来了已知的危险[8, 6]，但我们相信，将一种正在迅速普及的技术的访问权限平等化，将允许进行更好的独立分析，而不是将LLM的权力掌握在不发布模型或源代码以供审计的大型企业手中。

总之，我们相信QLORA将产生广泛的积极影响，使高质量LLM的微调变得更加广泛和容易访问。

致谢Acknowledgements

We thank Aditya Kusupati, Ofir Press, Ashish Sharma, Margaret Li, Raphael Olivier, Zihao Ye, and Evangelia Spiliopoulou for their valuable feedback. Our research was facilitated by the advanced computational, storage, and networking infrastructure of the Hyak supercomputer system at the University of Washington. We thank the Hyak team for ensuring a smooth operation. We thank the beta testers of the bitsandbytes library, in particular Alex Birch and Alyssa Vance. We thank Younes Belkada for help with the integration of our software into the Hugging Face transformers stack.

我们感谢Aditya Kusupati、Ofir Press、Ashish Sharma、Margaret Li、Raphael Olivier、Zihao Ye和Evangelia Spiliopoulou对我们宝贵的反馈意见。我们的研究得到了华盛顿大学Hyak超级计算机系统先进的计算、存储和网络基础设施的支持。我们感谢Hyak团队确保顺利运行。我们感谢bitsandbytes库的Beta测试人员，特别是Alex Birch和Alyssa Vance。我们感谢Younes Belkada帮助我们将软件集成到Hugging Face transformers库中。

LLMs之Guanaco：《QLoRA：Efficient Finetuning of Quantized LLMs》翻译与解读

《QLoRA：Efficient Finetuning of Quantized LLMs》翻译与解读

Abstract

1、Introduction引言

2、背景Background

块状k位量化Block-wise k-bit Quantization

低秩适配器Low-rank Adapters

参数高效微调的内存需求Memory Requirement of Parameter-Efficient Finetuning

3、QLORA微调QLORA Finetuning

4位NormalFloat量化4-bit NormalFloat Quantization

双重量化Double Quantization

分页优化器 Paged Optimizers

QLORA

4、QLoRA与标准微调的比较QLoRA vs. Standard Finetuning

默认LoRA超参数与16位性能不匹配Default LoRA hyperparameters do not match 16- bit performance.

4位NormalFloat比4位浮点数性能更好4-bit NormalFloat yields better performance than 4-bit Floating Point

k位QLORA与16位完全微调和16位LoRA的性能相匹配k-bit QLORA matches 16-bit full finetuning and 16-bit LoRA performance

总结Summary

5、使用QLoRA推动聊天机器人的最新技术水平Pushing the Chatbot State-of-the-art with QLoRA

5.1、实验设置Experimental setup

数据集Data

训练设置Training Setup

基准模型Baselines

5.2、评估Evaluation

基准数据集Benchmark Data

自动化评估Automated Evaluation

人工评估Human Evaluation

Elo评分Elo Rating

5.3、Guanaco：QLORA在OASST1上训练的聊天机器人是最先进的模型Guanaco：QLORA trained on OASST1 is a State-of-the-art Chatbot

6、定性分析Qualitative Analysis

6.1、定性分析示例生成Qualitative Analysis of Example Generations

6.2、考虑因素Considerations

评估Evaluation

数据和训练Data & Training

7、相关工作Related Work

大型语言模型的量化Quantization of Large Language Models

使用Adapter进行微调 Finetuning with Adapters

指令微调Instruction Finetuning

聊天机器人Chatbots

8、局限性与讨论Limitations and Discussion

9、更广泛的影响Broader Impacts

致谢Acknowledgements

猜你喜欢