65 billion parameters, 8 GPUs can fine-tune: Qiu Xipeng's team has lowered the threshold of large models

Editing | Heart of the Machine

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Full-parameter fine-tuning uses as much video memory as inference, and large models are no longer just toys for big tech companies.

In the direction of large models, technology giants are training larger models, while the academic community is trying to optimize them. Recently, the method of optimizing computing power has risen to new heights.

Large-scale language models (LLMs) have revolutionized the field of natural language processing (NLP), demonstrating extraordinary capabilities such as emergence and epiphany. However, if you want to build a model with certain general capabilities, you need billions of parameters, which greatly increases the threshold of NLP research. In the process of LLM model tuning, expensive GPU resources are usually required, such as 8×80GB GPU devices, which makes it difficult for small laboratories and companies to participate in research in this field.

Recently, Parameter Efficient Fine-tuning Techniques (PEFT), such as LoRA and Prefix-tuning, are being studied, which provide solutions for tuning LLMs with limited resources. However, these methods do not provide a practical solution for full-parameter fine-tuning, which has been recognized as a more powerful method than parameter-efficient fine-tuning.

In the paper "Full Parameter Fine-tuning for Large Language Models with Limited Resources" submitted by Qiu Xipeng's team at Fudan University last week, the researchers proposed a new optimizer, LOw-Memory Optimization (LOMO).

By integrating LOMO with existing memory saving techniques, the new method reduces the memory usage by 10.8% compared to the standard method (DeepSpeed ​​solution). Thus, the new method is able to perform full-parameter fine-tuning of the 65B model on a single machine with 8× RTX 3090s, each with 24GB of memory.

8d4f5bf2154c0a40b1efc98b97346629.jpeg

Paper link: https://arxiv.org/abs/2306.09782

In this work, the authors analyze four aspects of memory usage in LLM: activations, optimizer states, gradient tensors, and parameters, and optimize the training process in three aspects:

  1. We rethought the function of the optimizer from an algorithmic point of view and found that SGD is a good substitute in fine-tuning the full parameters of the LLM. This allows the authors to remove entire parts of the optimizer state, since SGD does not store any intermediate state.

  2. The newly proposed optimizer LOMO reduces the memory usage of the gradient tensor to O(1), which is equivalent to the memory usage of the largest gradient tensor.

  3. To stabilize mixed-precision training with LOMO, the authors integrate gradient normalization, loss scaling, and convert some computations to full precision during training.

The new technique makes memory usage equal to parameter usage plus activation and max gradient tensors. The memory usage of full-parameter fine-tuning is pushed to the extreme, which is only equivalent to the usage of inference. This is because the memory footprint of the forward+backward process should not be less than that of the forward process alone. It is worth noting that while using LOMO to save memory, the new method ensures that the fine-tuning process is not affected, because the parameter update process is still equivalent to SGD.

The study evaluated the memory and throughput performance of LOMO, showing that with LOMO, researchers can train a model with 65B parameters on 8 RTX 3090 GPUs. Furthermore, to verify the performance of LOMO on downstream tasks, they apply LOMO to tune all parameters of LLM on the SuperGLUE dataset collection. The results demonstrate the effectiveness of LOMO for optimizing LLMs with billions of parameters.

method introduction

In the methods section, this paper introduces LOMO (LOW-MEMORY OPTIMIZATION) in detail. Generally speaking, the gradient tensor represents the gradient of a parameter tensor, and its size is the same as the parameter, so the memory overhead is relatively large. Whereas existing deep learning frameworks such as PyTorch store gradient tensors for all parameters. At this stage, there are two reasons for storing gradient tensors: computing optimizer state and normalizing gradients.

Since this study uses SGD as the optimizer, there is no gradient-dependent optimizer state, and they have some alternatives to gradient normalization.

They proposed LOMO, as shown in Algorithm 1, LOMO fuses gradient computation and parameter update in one step, thereby avoiding the storage of gradient tensors.

The figure below compares SGD and LOMO in the backpropagation and parameter update phases. Pi is the model parameter, and Gi is the gradient corresponding to Pi. LOMO fuses gradient computation and parameter update into one step, minimizing the gradient tensor.

dc2c0fe820ea996276cbebb21f6148f6.png

Pseudocode of algorithm corresponding to LOMO:

67fff4428ec42a6a1d7db12abf6b1d0f.png

Specifically, the study represents vanilla gradient descent as d723bde3c3618b2a2523f5240846a929.pnga two-step process that first computes gradients and then updates parameters. The fusion version is  0325de8f1f505698eeb20337bb998b24.png .

The key idea of ​​this research is to update the parameters immediately when calculating the gradient, so that the gradient tensor is not stored in memory. This step can be achieved by injecting a hook function into the backpropagation. PyTorch provides related APIs for injecting hook functions, but it cannot achieve precise instant updates with the current API. Instead, the study stores the gradient of at most one parameter in memory and updates each parameter one by one with backpropagation. Our method reduces the memory usage of gradients from storing gradients for all parameters to only storing gradients for one parameter.

Most of the LOMO memory usage is consistent with that of parameter-efficient fine-tuning methods, suggesting that combining LOMO with these methods results in only a slight increase in gradient memory usage. This allows more parameters to be tuned for the PEFT method.

Experimental results

In the experimental part, the researchers evaluated their proposed method in three aspects, namely memory usage, throughput and downstream performance. Without further explanation, all experiments were performed with LLaMA models from 7B to 65B.

memory usage

The researchers first dissected, under different settings, the memory usage of the model state and activations during training. As shown in Table 1, the use of the LOMO optimizer results in a significant reduction in memory footprint compared to the AdamW optimizer, from 102.20GB to 14.58GB; Reduced to 14.58GB. The large reduction in memory usage is mainly attributable to the reduced memory requirements for gradients and optimizer states. Therefore, during training, the memory is mostly occupied by parameters, which is comparable to the memory usage during inference.

417e104e06b53015132520810d2fb7bc.png

As shown in Figure 2, if AdamW optimizer is used for LLaMA-7B training, a considerable proportion of memory (73.7%) is allocated to the optimizer state. Replacing the AdamW optimizer with the SGD optimizer can effectively reduce the percentage of memory occupied by the optimizer state, thereby reducing GPU memory usage (from 102.20GB to 51.99GB). With LOMO, parameter updates and backward are merged into a single step, further eliminating memory requirements for optimizer state.

8a8963a761db499339e6b50556101289.png

throughput

The researchers compared the throughput performance of LOMO, AdamW and SGD. Experiments are performed on a server equipped with 8 RTX 3090 GPUs.

For the 7B model, the throughput of LOMO shows a significant advantage, which exceeds AdamW and SGD by about 11 times. This significant improvement can be attributed to LOMO's ability to train 7B models on a single GPU, which reduces inter-GPU communication overhead. The slightly higher throughput of SGD compared to AdamW can be attributed to the fact that SGD excludes the computation of momentum and variance.

As for the 13B model, it cannot be trained with AdamW on the existing 8 RTX 3090 GPUs due to memory constraints. In this case, model parallelism is necessary for LOMO, which still outperforms SGD in terms of throughput. This advantage is attributed to the memory-efficient nature of LOMO and the fact that only two GPUs are required to train the model with the same settings, which reduces communication costs and improves throughput. Furthermore, SGD suffers from out-of-memory (OOM) on 8 RTX 3090 GPUs when training the 30B model, while LOMO performs well with only 4 GPUs.

Finally, the researchers successfully trained a 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. With this server configuration and LOMO, the training process of the model on 1000 samples (each sample contains 512 tokens) takes about 3.6 hours.

downstream performance

To evaluate the effectiveness of LOMO in fine-tuning large language models, the researchers conducted an extensive series of experiments. They compared LOMO with two other methods, one is Zero-shot that does not require fine-tuning, and the other is LoRA, a parameter-efficient fine-tuning technique that is currently very popular.

2a6e44a8d88a2d7c450a752fa84cc451.png

Table 3 results show:

  • LOMO performed significantly better than Zero-shot;

  • LOMO generally outperforms LoRA in most experiments;

  • LOMO scales efficiently to models with 65 billion parameters.

LOMO and LoRA are essentially independent of each other. To test this statement, the researchers conducted experiments on the BoolQ and MultiRC datasets using LLaMA-13B. The result is shown in Figure 3.

They found that LOMO consistently enhanced LoRA's performance, no matter how high LoRA's results were achieved. This shows that the different fine-tuning methods adopted by LOMO and LoRA are complementary. Specifically, LOMO focuses on fine-tuning the weights of pre-trained models, while LoRA tunes other modules. Therefore, LOMO does not affect the performance of LoRA; instead, it facilitates better model tuning for downstream tasks.

a389461c8d5206309fc6a13e1c6c7ab7.png

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, multi-sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Learning videos in each direction, welcome to take it yourself (scan the code to enter the learning)

5a88bb4895f67478b14dfaf8f7b5836a.png

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job-hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

d37c6c341dd3325cb9ddd6e029b19c76.jpeg

(3) [ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

e2d1701b917f6cd113fb1c4880ec05a1.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131368826