QLoRA paper overview

QLoRA paper overview

Foreword (stream-saving version)

Fine-tuning requires a lot of video memory resources.

Most of the quantification in previous work was in inference rather than training.

In the experiment, it was found that the quality of data is more important than the quantity.

Evaluation uses a combination of humans and GPT-4.

Three technical solutions were proposed to fine-tune the 65B model on a single GPU, achieving the same performance as the 16-bit fine-tuning task.

  • 4-bit NormalFloat (NF4) Quantization: QLORA uses a new data type, NF4, which is best suited for normally distributed weights in information theory and in practice Better than 4-bit integers and 4-bit floating point numbers.
  • Double quantization: QLORA uses a double quantization method to quantize the quantization constant for the second time, thereby reducing the memory footprint of each parameter.
  • Paging Optimizer: QLORA introduces a paging optimizer to prevent memory overflow errors during gradient checkpoints. This method uses the NVIDIA Unified Memory feature to automatically perform page-to-page transfers between the CPU and GPU for error-free GPU processing when GPU memory is low. The optimizer state allocates paged memory, which is then automatically evicted to CPU RAM when the GPU runs out of memory, and repaged to GPU memory during the optimizer update step.

Summary

We propose QLORA, an efficient fine-tuning method that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while retaining full 16-bit fine-tuning task performance.

QLORA backpropagates gradients to Low Rank Adapters (LoRA) through frozen 4-bit quantized pre-trained language models. Our best model family, which we named Guanaco, outperforms all previously publicly released models on the Vicuna benchmark, achieving 99.3% of the ChatGPT performance level while requiring only 24 hours of fine-tuning on a single GPU.

QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is theoretically optimal for normally distributed weights; (b) dual quantization, Reduce average memory footprint by quantizing constants; © paging optimizer to manage memory peaks.

We have fine-tuned over 1000 models using QLORA, providing instruction following and Detailed analysis of chatbot performance.

Our results show that QLoRA fine-tuning on small high-quality datasets can lead to state-of-the-art results, even with smaller models than previous SoTA. We provide a detailed analysis of chatbot performance based on human and GPT-4 evaluation, showing that GPT-4 evaluation is a cheap and reasonable alternative to human evaluation.

Additionally, we found that current chatbot benchmarks are not trustworthy to accurately assess chatbot performance levels. A lemon analysis shows where Guanaco fails compared to ChatGPT. We release all models and code, including CUDA kernels for 4-bit training

Ten questions about the paper

  1. What problem is the paper trying to solve?

This paper attempts to solve the problem of huge GPU memory required for large-scale language model fine-tuning (finetuning), so that such models can be fine-tuned on a single GPU.

  1. Is this a new problem?

Yes, this is a new question. Previous work has mainly focused on quantization during inference, but has not studied quantization during training and fine-tuning.

  1. What scientific hypothesis does this article test?

The core scientific hypothesis of this paper is that 4-bit quantization fine-tuning can achieve the effect of 16-bit full fine-tuning without losing performance.

  1. What are the relevant studies? How to classify? Who are the noteworthy researchers in the field on this topic?

Related research includes language model quantization, low-rank adapter fine-tuning, etc. Researchers worthy of attention include Tim Dettmers, Luke Zettlemoyer, etc.

  1. What is the key to the solution mentioned in the paper?

The key solution is to propose the 4-bit NormalFloat data type, as well as technologies such as double quantization and paging optimizer. These technologies work together to achieve high-precision 4-bit quantization.

  1. How were the experiments in the paper designed?

Comparative experiments with different model structures, data sets and model sizes were designed, and the effectiveness of the method was verified through academic benchmark tests.

  1. What is the data set used for quantitative evaluation? Is the code open source?

The data sets used include GLUE, Super-Natural Instructions, etc., and the code has been open sourced on GitHub.

  1. Do the experiments and results in the paper well support the scientific hypothesis that needs to be tested?

Yes, the detailed experimental results fully verify the core scientific hypothesis that 4-bit quantization fine-tuning can achieve the effect of 16-bit complete fine-tuning without losing performance.

  1. What contribution does this paper make?

The main contribution of this paper is to prove the effectiveness of 4-bit quantization fine-tuning for the first time, and to train a new state chatbot model on this basis.

  1. What’s next? Is there any work that can be further developed?

In the future, we can continue to study the quantitative fine-tuning effect under different bit precisions, verification under larger model sizes, and exploration on other tasks.

experiment

Experiment 1

Datasets and models

GLUE、Super-Natural Instructions

RoBERTa-large 、T5

Experimental results

Our results consistently show that 4-bit QLORA with the NF4 data type has good evaluation settings on academic benchmarks, matching 16-bit full-tuned and 16-bit LoRA-tuned performance. We also show that NF4 is more efficient than FP4 and that double quantization does not degrade performance. Taken together, this forms compelling evidence that 4-bit QLORA tuning reliably produces results that match the 16-bit approach.

Insert image description here

Experiment 2

Datasets and models

MMLU: This is a multiple-choice benchmark covering 57 tasks including elementary math, U.S. history, computer science, law, and more.

Alpaca、FLAN V2

Experimental results

Average 5-shot MMLU test accuracy of LLaMA 7-65B model after fine-tuning adapters for different data types on Alpaca and FLAN v2. Overall, NF4 with Dual Quantization (DQ) performs on par with BFloat16, while FP4 consistently lags behind both by one percentage point.

Insert image description here

limitation

model scale

At 33B and 65B model scales, QLORA may not fully match 16-bit fully fine-tuned performance. This is mainly due to the huge resource costs

data set

Although evaluation was performed on the MMLU, Vicuna benchmark, and OA benchmark, it was not evaluated on other benchmarks such as BigBench, RAFT, and HELM, so the evaluation results cannot be guaranteed to generalize to these benchmarks.

Other fine-tuning methods

In addition to LoRA, there are various parameter effective fine-tuning (PEFT) methods that were not involved in the evaluation.

Guess you like

Origin blog.csdn.net/qq128252/article/details/134884456