With 65 billion parameters, 8 GPUs can fine-tune the parameters of the large model. The latest paper of Qiu Xipeng's team is here!

 Datawhale dry goods 

Published by: Qiu Xipeng Team , Edited by : Heart of the Machine

Full-parameter fine-tuning uses as much video memory as inference, and large models are no longer just toys for big tech companies.

With 65 billion parameters, 8 GPUs can fine-tune all parameters. In the direction of large models, technology giants are training larger models, while the academic community is trying to optimize them. Recently, the method of optimizing computing power has risen to new heights.

Large-scale language models (LLMs) have revolutionized the field of natural language processing (NLP), demonstrating extraordinary capabilities such as emergence and epiphany. However, if you want to build a model with certain general capabilities, you need billions of parameters, which greatly increases the threshold of NLP research. In the process of LLM model tuning, expensive GPU resources are usually required, such as 8×80GB GPU devices, which makes it difficult for small laboratories and companies to participate in research in this field.

Recently, Parameter Efficient Fine-tuning Techniques (PEFT), such as LoRA and Prefix-tuning, are being studied, which provide solutions for tuning LLMs with limited resources. However, these methods do not provide a practical solution for full-parameter fine-tuning, which has been recognized as a more powerful method than parameter-efficient fine-tuning.

In the paper "Full Parameter Fine-tuning for Large Language Models with Limited Resources" submitted by Qiu Xipeng's team at Fudan University last week, the researchers proposed a new optimizer, LOw-Memory Optimization (LOMO).

By integrating LOMO with existing memory saving techniques, the new method reduces the memory usage by 10.8% compared to the standard method (DeepSpeed ​​solution). Thus, the new method is able to perform full-parameter fine-tuning of the 65B model on a single machine with 8× RTX 3090s, each with 24GB of memory.

5e9648a2f95c7f8280c44a8790676c77.jpeg

Paper link: https://arxiv.org/abs/2306.09782

In this work, the authors analyze four aspects of memory usage in LLM: activations, optimizer states, gradient tensors, and parameters, and optimize the training process in three aspects:

  1. We rethought the function of the optimizer from an algorithmic point of view and found that SGD is a good substitute in fine-tuning the full parameters of the LLM. This allows the authors to remove entire parts of the optimizer state, since SGD does not store any intermediate state.

  2. The newly proposed optimizer LOMO reduces the memory usage of the gradient tensor to O(1), which is equivalent to the memory usage of the largest gradient tensor.

  3. To stabilize mixed-precision training with LOMO, the authors integrate gradient normalization, loss scaling, and convert some computations to full precision during training.

The new technique makes memory usage equal to parameter usage plus activation and max gradient tensors. The memory usage of full-parameter fine-tuning is pushed to the extreme, which is only equivalent to the usage of inference. This is because the memory footprint of the forward+backward process should not be less than that of the forward process alone. It is worth noting that while using LOMO to save memory, the new method ensures that the fine-tuning process is not affected, because the parameter update process is still equivalent to SGD.

The study evaluated the memory and throughput performance of LOMO, showing that with LOMO, researchers can train a model with 65B parameters on 8 RTX 3090 GPUs. Furthermore, to verify the performance of LOMO on downstream tasks, they apply LOMO to tune all parameters of LLM on the SuperGLUE dataset collection. The results demonstrate the effectiveness of LOMO for optimizing LLMs with billions of parameters.

method introduction

In the methods section, this paper introduces LOMO (LOW-MEMORY OPTIMIZATION) in detail. Generally speaking, the gradient tensor represents the gradient of a parameter tensor, and its size is the same as the parameter, so the memory overhead is relatively large. Whereas existing deep learning frameworks such as PyTorch store gradient tensors for all parameters. At this stage, there are two reasons for storing gradient tensors: computing optimizer state and normalizing gradients.

Since this study uses SGD as the optimizer, there is no gradient-dependent optimizer state, and they have some alternatives to gradient normalization.

They proposed LOMO, as shown in Algorithm 1, LOMO fuses gradient computation and parameter update in one step, thereby avoiding the storage of gradient tensors.

The figure below compares SGD and LOMO in the backpropagation and parameter update phases. Pi is the model parameter, and Gi is the gradient corresponding to Pi. LOMO fuses gradient computation and parameter update into one step, minimizing the gradient tensor.

77929e88083b4492c6ba3ceba3b8401a.png

Pseudocode of algorithm corresponding to LOMO:

57e48f2efe9864e92c0baabe5f28742e.png

Specifically, the study represents vanilla gradient descent as cc646a6c66d3d2257ecd0b33e2be0ce2.pnga two-step process that first computes gradients and then updates parameters. The fusion version is  af8b7ec7bd96e1d8841f3220fa892b74.png .

The key idea of ​​this research is to update the parameters immediately when calculating the gradient, so that the gradient tensor is not stored in memory. This step can be achieved by injecting a hook function into the backpropagation. PyTorch provides related APIs for injecting hook functions, but it cannot achieve precise instant updates with the current API. Instead, the study stores the gradient of at most one parameter in memory and updates each parameter one by one with backpropagation. Our method reduces the memory usage of gradients from storing gradients for all parameters to only storing gradients for one parameter.

Most of the LOMO memory usage is consistent with that of parameter-efficient fine-tuning methods, suggesting that combining LOMO with these methods results in only a slight increase in gradient memory usage. This allows more parameters to be tuned for the PEFT method.

Experimental results

In the experimental part, the researchers evaluated their proposed method in three aspects, namely memory usage, throughput and downstream performance. Without further explanation, all experiments were performed with LLaMA models from 7B to 65B.

memory usage

The researchers first dissected, under different settings, the memory usage of the model state and activations during training. As shown in Table 1, the use of the LOMO optimizer results in a significant reduction in memory footprint compared to the AdamW optimizer, from 102.20GB to 14.58GB; Reduced to 14.58GB. The large reduction in memory usage is mainly attributable to the reduced memory requirements for gradients and optimizer states. Therefore, during training, the memory is mostly occupied by parameters, which is comparable to the memory usage during inference.

500ee6b4cb2e27e0ffba4f38d9f2f8be.png

As shown in Figure 2, if AdamW optimizer is used for LLaMA-7B training, a considerable proportion of memory (73.7%) is allocated to the optimizer state. Replacing the AdamW optimizer with the SGD optimizer can effectively reduce the percentage of memory occupied by the optimizer state, thereby reducing GPU memory usage (from 102.20GB to 51.99GB). With LOMO, parameter updates and backward are merged into a single step, further eliminating memory requirements for optimizer state.

e30f4a6cac2dd74b7a80f19b9dfc408c.png

throughput

The researchers compared the throughput performance of LOMO, AdamW and SGD. Experiments are performed on a server equipped with 8 RTX 3090 GPUs.

For the 7B model, the throughput of LOMO shows a significant advantage, which exceeds AdamW and SGD by about 11 times. This significant improvement can be attributed to LOMO's ability to train 7B models on a single GPU, which reduces inter-GPU communication overhead. The slightly higher throughput of SGD compared to AdamW can be attributed to the fact that SGD excludes the computation of momentum and variance.

As for the 13B model, it cannot be trained with AdamW on the existing 8 RTX 3090 GPUs due to memory constraints. In this case, model parallelism is necessary for LOMO, which still outperforms SGD in terms of throughput. This advantage is attributed to the memory-efficient nature of LOMO and the fact that only two GPUs are required to train the model with the same settings, which reduces communication costs and improves throughput. Furthermore, SGD suffers from out-of-memory (OOM) on 8 RTX 3090 GPUs when training the 30B model, while LOMO performs well with only 4 GPUs.

Finally, the researchers successfully trained a 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. With this server configuration and LOMO, the training process of the model on 1000 samples (each sample contains 512 tokens) takes about 3.6 hours.

downstream performance

To evaluate the effectiveness of LOMO in fine-tuning large language models, the researchers conducted an extensive series of experiments. They compared LOMO with two other methods, one is Zero-shot that does not require fine-tuning, and the other is LoRA, a parameter-efficient fine-tuning technique that is currently very popular.

82398bce7049fbe1cfa5326da4104bfc.png

Table 3 results show:

  • LOMO performed significantly better than Zero-shot;

  • LOMO generally outperforms LoRA in most experiments;

  • LOMO scales efficiently to models with 65 billion parameters.

LOMO and LoRA are essentially independent of each other. To test this statement, the researchers conducted experiments on the BoolQ and MultiRC datasets using LLaMA-13B. The result is shown in Figure 3.

They found that LOMO consistently enhanced LoRA's performance, no matter how high LoRA's results were achieved. This shows that the different fine-tuning methods adopted by LOMO and LoRA are complementary. Specifically, LOMO focuses on fine-tuning the weights of pre-trained models, while LoRA tunes other modules. Therefore, LOMO does not affect the performance of LoRA; instead, it facilitates better model tuning for downstream tasks.

e33be65ab5952381ab6a9df305ff9caa.png

See the original paper for more details.

c6a5d5690607d2465d098f7fb2323bc5.png

Dry goods learning, like three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/131356015