Generative AI New World | Overview of the principles of efficient fine-tuning and quantification of large model parameters

a2d93a747c90ab3b8b1edf893a80e318.gif

The author of this article  is Huang Haowen 

Amazon Cloud Technology Senior Developer Evangelist

In the previous article , we compared two different deployment methods for deploying large models on Amazon SageMaker. In this article, we will discuss two hot topics that are currently concerned by developers in the field of large language models (LLM): efficient fine-tuning and quantization of large language models.

Fine-tuning large language models allows developers to adjust the open source underlying model to improve performance on domain-specific tasks. In the next two articles, we will explore how to use Hugging Face's parameter efficient fine-tuning (PEFT) library and QLoRA quantization technology to perform efficient parameter fine-tuning and quantitative deployment of large models using a single instance.

Since this research field is still a cutting-edge field, I plan to use two articles to explain the principles and the main papers behind it, and then guide everyone in specific practices on Amazon cloud technology. include:

  • Principle exploration: Overview of the principles of efficient fine-tuning and quantification of large model parameters;

  • Hands-on lab: Fine-tuning the Falcon-40B open source large model on a single ml.g5.12xlarge instance using PEFT and QLoRA quantization techniques.

First, we enter the first part to sort out the theoretical basis behind parameter efficient fine-tuning (PEFT) and QLoRA quantification technology.

Large model parameter efficient fine-tuning (PEFT)

Currently, the scale of pre-trained language models is getting larger and larger, and full fine-tuning (Full Fine-Tuning) on ​​consumer-grade hardware is becoming increasingly unrealistic; at the same time, because the size of the fine-tuned model needs to be the same as the size of the pre-trained model , so the cost of separately storing and deploying fine-tuned models for each downstream task can be very high.

Parameter-Efficient Fine-Tuning (PEFT) technology was proposed to solve these two problems.

Under the premise of ensuring almost the same performance as complete fine-tuning, PEFT technology can help the pre-trained model to efficiently adapt to various downstream application tasks without the need to fine-tune all parameters of the pre-trained model. PEFT technology fixes most of the pre-training parameters and only fine-tunes a small number of model parameters, which greatly reduces the computing and storage costs required for fine-tuning.

The figure below shows PEFT, a library open sourced by Hugging Face for efficient fine-tuning of large models.

1e5991b159edbf21abf3e3b589967f82.png

Source: https://github.com/huggingface/peft

As can be seen from the above, the algorithm library currently supports the following six types of technical methods for efficient fine-tuning of large models:

1. LoRA: LORA:LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

https://arxiv.org/abs/2106.09685

2. Prefix Tuning: Prefix-Tuning: OptimizingContinuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

https://aclanthology.org/2021.acl-long.353/

https://arxiv.org/pdf/2110.07602.pdf

3. P-Tuning: GPT Understands, Too

https://arxiv.org/abs/2103.10385

4. Prompt Tuning: The Power of Scale forParameter-Efficient Prompt Tuning

https://arxiv.org/abs/2104.08691

5. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

https://arxiv.org/abs/2303.10512

6. (IA)^3 : Infused Adapter by Inhibiting and Amplifying Inner Activations

https://arxiv.org/abs/2205.05638

Due to space limitations, in these two articles, we will focus on the efficient fine-tuning technology of large model parameters in the direction of LoRA, as well as its implementation on Amazon cloud technology.

Low-rank adaptation (LoRA) for large language models

In the 2021 LoRA paper, researchers proposed the low-rank (LOW-RANK) adaptation (LoRA) low-rank adaptation method for large language models for the first time. The LoRA method injects the trainable rank decomposition matrix into each layer of the Transformer architecture by freezing the pre-trained model weights, thereby greatly reducing the number of parameters that need to be trained for downstream tasks. The LoRA paper is shown below.

1a3d5d0b27fcd0d8d1d320eb5ddba1cc.png

Source: https://arxiv.org/pdf/2106.09685.pdf

In the abstract of the paper, the researchers have made an incisive explanation of the need to propose the RoLA method. They analyzed: When pre-training a large model, as the model gets larger and larger, it is unrealistic to retrain all model parameters. Taking the GPT-3 175B large model as an example, deploying fine-tuned independent instances of the model, each instance already has 175B parameters, such training costs will be prohibitive for ordinary enterprises and individuals.

The paper proposes low-rank (LOW-RANK) adaptation (LoRA) , which freezes the pre-trained model weights and injects the trainable rank decomposition matrix into each layer of the Transformer architecture, thereby greatly reducing the trainability of downstream tasks. Number of parameters. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times, thus reducing GPU and memory requirements by 3 times.

The value of this 2021 paper has become even more important in today's era of open source large models. The famous Figure 1 in the LoRA paper has been made into an animated picture by some researchers, making it more vivid when explaining the principles.

As shown below:

bdb250836a0ad3367d3736657e848222.png

Source: https://huggingface.co/blog/4bit-transformers-bitsandbytes

The above figure is intended to illustrate that during the reparameterization process, only A and B need to be trained. The low-rank adaptive (LoRA) method allows to indirectly train some dense layers in a neural network by optimizing the rank decomposition matrix of the changes in the dense layer during the adaptation process, while keeping the pre-trained weights unchanged. Taking GPT-3 175B as an example, even when the full rank (i.e., d) is as high as 12,288, a very low rank (i.e., the number of r in the graph) is sufficient, making LoRA quite efficient in both storage and computation.

Efficient fine-tuning of low-precision quantization (QLoRA)

After understanding the RoLA paper, let's take a look at the innovative angle and vision of the QLoRA paper.

LoRA adds a small number of trainable parameters (adapters) to each layer of a large model and freezes all original parameters. This way for fine-tuning, only the adapter weights need to be updated, which can significantly reduce the memory footprint. However, as large models become larger and larger, full-precision representations of these large models can no longer fit into the memory of a single or even multiple GPUs. To support fine-tuning and inference of models of this scale on a single instance, QLoRA emerged.

1.QLoRA paper overview

QLoRA is the abbreviation of Quantized LLMs with Low-Rank Adapters, which comes from a paper published by the University of Washington in May 2023, as follows:

13b00ffb72b047a0922a78ae58c7ea8c.png

Source: https://arxiv.org/pdf/2305.14314.pdf

To sum up, QLoRA has three important contributions: the introduction of 4-bit quantization, dual quantization, and the use of NVIDIA unified memory for paging. The following are respectively explained:

  • 4-bit NormalFloat quantization (4-bit NormalFloat, NF4): This improved quantization method ensures that there are the same number of values ​​in each quantization bin to avoid calculation problems and outlier errors

  • Double Quantization: The author of QLoRA defines it as follows: "The process of quantizing a quantized constant again to save additional memory."

  • Unified memory paging (Paged Optimizers): relies on NVIDIA unified memory management to automatically handle page-to-page transfers between the CPU and GPU

ac11563c0ba4ab6d1e584e1820b1768a.png

Source: https://arxiv.org/pdf/2305.14314.pdf

Regarding the excellent performance of QLoRA, the paper explains that Guanaco, a series of models they trained using the QLoRA method, outperformed all previously publicly released models in the Vicuna benchmark test. The Guanaco 65B large model fine-tuned within 24 hours can even reach GPT4 performance 99.3% of the level. The following are the benchmark test data results published in the paper:

424e1fd1d5af3aadb048d94905acd880.png

Source: https://arxiv.org/pdf/2305.14314.pdf

The Q of QLoRA is Quantization. In order to help you understand the abstract concept of quantization, in the next section we will first sort out the common data types in machine learning scenarios, so that everyone will have a certain understanding of the proportion of resources occupied by different data types; then we will Through papers, let’s uncover the mystery behind quantitative technical methods.

2. Common data types for machine learning

We start with a basic understanding of the different floating point data types, also known as "precision" in the context of machine learning. The size of the model is determined by the number of its parameters and their accuracy, usually Float32 (FP32), Float16 (FP16) or BFloat16 (BF16). As shown below:

5e0d79217be10746c5816e1a5262f4f9.png

Source: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

Float32 (FP32) data type: Represents standardized IEEE 32-bit floating point representation. Using this data type, various floating point numbers can be represented. In FP32, 8 bits are used to represent the exponent, 23 bits are used to represent the decimal, and 1 bit is used for the numeric sign. Most hardware supports FP32 operations and instructions.

Float16 (FP16) data type: 5 bits reserved for exponent and 10 bits reserved for decimal. This makes the representable range of FP16 numbers much lower than FP32. This puts FP16 at  risk of overflowing (trying to represent very large numbers) and  underflowing (trying to represent very small numbers).

For example if you calculate 10k * 10k, since the result is 100M, this cannot be represented in FP16 because the maximum number possible is only 64k. Therefore, you will end up with NaN (Not a Number) results, and if you do sequential calculations like a neural network, all previous work will be destroyed. Although it is possible to use loss scaling to partially circumvent this problem, it sometimes does not work.

BFloat16 (BF16) data type: To avoid these limitations, a new data type called bfloat16 (BF16) emerged. In BF16, 8 bits are reserved for the exponent (same as FP32) and 7 bits are reserved for the decimal. This means that in BF16 we can retain the same dynamic range as in FP32. But compared to FP16, we lose 3 digits of precision. Now there's absolutely no problem with large numbers, but the accuracy here is worse than in FP16.

In the Ampere architecture, NVIDIA also introduced the TensorFloat-32 (TF32) data type , which combines the dynamic range of BF16 with the precision of FP16, using only 19 bits. It is currently only used for certain operations within it.

In machine learning terminology, FP32 is called full precision (4 bytes), while BF16 and FP16 are called half precision (2 bytes) . The int8 (INT8) data type consists of an 8-bit representation and can store 2^8 different values ​​(between [0, 255] or [-128, 127] for signed integers).

Although training and inference should ideally be done in FP32, it is twice as slow as FP16/BF16, so the actual scenario uses a mixed precision approach : the weights are saved in FP32 as precise " "Master Weight" reference, while forward and backward pass calculations are performed for FP16/BF16 to improve training speed. Then, the FP16/BF16 gradient is used to update the master weight of FP32.

During training, the main weights are always stored in FP32, but in practice, half-precision weights often provide similar result quality to their FP32 counterparts during inference—only if the model receives multiple An accurate reference to the model is only required for gradient updates. This means we can use half-precision weights and use half the GPU to achieve the same results.

7fbefe8e957cf0e6d1604722b9ab1a81.png

Source: https://huggingface.co/blog/hf-bitsandbytes-integration

Give an example to elaborate. As shown in the image above, if we want to calculate the model size (in bytes), we need to multiply the number of parameters by the size of the selected precision (in bytes) . For example, if we use the bfloat16 version of the BLOOM-176B large model, we need (176*10**9) x 2 bytes = 352GB! It would be a huge challenge if you only had a few GPUs to do it.

But what if we could use a different data type to store these weights with less memory? In the context of this scenario requirement, a technical method called quantization finally made its debut.

3. Overview of Quantization Technology

Quantization is essentially done by "rounding" from one data type to another . Quantization is a noisy process and there may be information loss; this process is a lossy compression process.

FP8 and FP4 represent floating point precision of 8 and 4 bits respectively. Let's first look at how floating point values ​​are represented using the FP8 data type format, and then look at the FP4 data type format.

 FP8 data format

Floating point numbers contain n bits, each of which belongs to a specific category and is responsible for representing the component parts of the number (sign, mantissa, and exponent). The FP8 (Floating Point 8) data format was originally introduced in the paper "FP8 for Deep Learning" and has two different FP8 encodings: E4M3 (4-bit exponent and 3 decimal places) and E5M2 (5-bit exponent and 2-bit decimal).

  • FP8 for Deep Learning

    https://arxiv.org/pdf/2209.05433.pdf

The expression of each bit of different data types is as shown in the figure below. For example, the potential floating point number represented by the FP8 E4M3 format is between -448 and 448; in the FP8 E5M2 format, as the number of exponent digits increases, the range increases to -57344 to 57344. This reduces accuracy since the number of possible representations remains the same.

The experience of some researchers shows that: E4M3 is most suitable for forward transmission, and E5M2 is most suitable for backward calculation.

b79f37c03722d1836345a8131860ebd9.png

Source: https://huggingface.co/blog/4bit-transformers-bitsandbytes

In the process of looking for information, I also found another picture explaining different data formats for machine learning. Compared with the picture above, I also found another picture that is more concise and beautiful. Therefore, I also post them together for your reference and study:

47fbeeaa1ec96fe712bcdeb44a0e9252.png

Source: https://www.nextplatform.com/2022/03/31/deep-dive-into-nvidias-hopper-gpu-architecture/

 FP4 data format

Unlike the FP8 precision data format, FP4 precision uses only 4 bits to represent a number. These 4 bits are divided into 1 sign bit, 3 exponent bits and 0 mantissa bits. For example: In the table below, the sign and mantissa bits in each column have different values, while the rows for the exponent bit have different values.

84252bb7e92bb0b991c92cb3b2f693c2.png

Source: https://en.wikipedia.org/wiki/Minifloat

There is no fixed format for FP4 precision. This means that different combinations of mantissa and exponent bits can be used. Generally speaking, 3 exponent bits are used in most cases because they provide better accuracy. However, sometimes 2 exponent bits and 1 mantissa bit provide better performance.

Regarding the detailed description of the FP4 data format, due to space limitations, we will not discuss it here. Hugging Face and the author of the above paper published a detailed blog article: "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA". The article covers some low-level interpretation of 4-bit quantization technology methods. If you are interested Developers can refer to:

https://huggingface.co/blog/4bit-transformers-bitsandbytes

Summarize

Compared to standard 16-bit model fine-tuning, QLoRA reduces memory usage for large model fine-tuning without performance trade-offs. This approach allows fine-tuning a 33B model on a single 24GB GPU and a 65B model on a single 46GB GPU. QLoRA uses 4-bit quantization to compress a pre-trained language model, then freezes the parameters of the large model and adds relatively few trainable parameters to the model in the form of low-level adapters. During fine-tuning, QLoRA backpropagates the gradients through the frozen 4-bit quantized pre-trained model to the low-level adapter, and the LoRA layer is the only parameter that needs to be updated during training.

The quantification process of QLoRA can be basically summarized as follows:

  1. QLoRA has a stored data type (NF4) for base model weights and a computed data type (BF16) for performing calculations;

  2. QLoRA dequantizes weights from stored data types to computed data types to perform forward and backward passes, but only computes weight gradients using BF16's LoRA parameters during the passes. Weights are only decompressed when needed, so memory usage remains low during training and inference.

410f9d5deb1ba9aefaf8742ce2d20b37.gif

After exploring the basic theory, we will start to practice. In the next article, we'll explore using Amazon SageMaker Notebook to quickly and efficiently fine-tune large language models in an interactive environment. We will use the bitsandbtyes quantification technology principles of QLoRA and 4-bits, and use Hugging Face PEFT on Amazon SageMaker to fine-tune the Falcon-40B model, so stay tuned.

Please continue to follow the "Amazon Cloud Developer" WeChat official account to learn more about technology sharing and cloud development trends for developers!

The author of this article

05b91127a53d7c999e7b32778497d2ee.jpeg

Huang Haowen

Senior developer evangelist at Amazon Cloud Technology, focusing on AI/ML, Data Science, etc. He has more than 20 years of experience in architecture design, technology and entrepreneurial management in industries such as telecommunications, mobile Internet and cloud computing. He has worked for Microsoft, Sun Microsystems, China Telecom and other companies, focusing on providing services to corporate customers such as games, e-commerce, media and advertising. Solution consulting services such as AI/ML, data analysis and enterprise digital transformation.

84f73eac0e36ad9aff610a595a37597b.gif

The star will not get lost and development will be faster!

After following, remember to star "Amazon Cloud Developer"

7b21d672e462cd97884c9d8a40eddeb3.gif

I heard, click the 4 buttons below

You won’t encounter bugs!

f24148cf42447e30fa2c7a0068ca2226.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132844011