Fudan University released the low-memory optimization technology LOMO | It reduces the memory usage of large model training to 10.8%, which is far ahead of DeepSpeed!

Title: Full Parameter Fine-tuning for Large Language Models with Limited Resources
PDF: arxiv.org/pdf/2306.09…
Code: github.com/openlmlab/l…

guide

Large language models (LLMs) have revolutionized the field of natural language processing (NLP), but require massive GPU resources for training. Lowering the threshold for training LLMs will encourage more researchers to participate, thus benefiting academia and society. While existing methods mainly focus on parameter-efficient fine-tuning, i.e., tuning or adding a small number of parameters, few approaches address the challenge of tuning the full range of parameters of LLMs with limited resources. This paper proposes a new optimizer, Low Memory Optimization (LOMO), which fuses gradient computation and parameter update into one step to reduce memory usage. By combining LOMO with existing memory saving techniques, our method reduces the memory usage to 10.8% compared to the standard method (DeepSpeed ​​solution). Therefore, the method in this paper makes it possible to fine-tune all the parameters of the 65B model on a single computer equipped with 8 RTX 3090s and each with 24GB of video memory.

introduction

Large language models (LLMs) have revolutionized the field of natural language processing (NLP), demonstrating unexpected emergence capabilities. However, training these models with billions of parameters, such as models with 30B to 175B parameters, sets a higher bar for NLP research. Tuning LLMs usually requires expensive GPU resources, such as 8×80GB devices, which makes it difficult for small laboratories and companies to participate in research in this field.

Recently, parameter-efficient fine-tuning methods such as LoRA and Prefix-tuning have emerged, which provide a solution for tuning LLMs with limited resources. However, these methods do not provide practical solutions for full-parameter fine-tuning, which has been considered more powerful than parameter-efficient fine-tuning. In this paper, we aim to explore techniques to achieve full-parameter fine-tuning with limited resources.

This paper analyzes four aspects of memory usage in LLMs, namely activations, optimizer states, gradient tensors and parameters, and optimizes the training process in three aspects:

  • This paper rethinks the function of optimizers from an algorithmic perspective and finds that SGD is a good alternative for full-parameter fine-tuning of LLMs. This enables us to remove entire parts of the optimizer state, since SGD does not store any intermediate state.
  • The optimizer LOMO proposed in this paper reduces the memory usage of the gradient tensor to O(1), which is equivalent to the memory usage of the largest gradient tensor.
  • To stabilize LOMO's mixed-precision training, this paper integrates gradient normalization, loss scaling, and methods to convert certain calculations to full precision during training.

The technique in this paper makes the memory usage equal to the usage of parameters plus activation and the maximum gradient tensor, and pushes the memory usage of full parameter fine-tuning to the extreme, making it only equivalent to the usage of inference. It is worth noting that while using LOMO to save memory, our method ensures that the fine-tuning process remains unharmed, since the parameter update process is still equivalent to SGD.

This paper empirically evaluates the memory and throughput performance of LOMO and shows that the use of LOMO enables the successful training of a 65B model with only 8 RTX 3090 GPUs. Furthermore, to verify the performance of the technique on downstream tasks, we apply LOMO to full-parameter fine-tuning of LLMs collected on the SuperGLUE dataset. Empirical results demonstrate the efficiency and effectiveness of LOMO in optimizing LLMs with billions of parameters. The contributions of this paper are as follows:

  • Theoretical analysis is provided, showing that SGD can successfully fine-tune the full parameters of LLMs. Problems that previously prevented widespread use of SGD may no longer be serious problems in fine-tuning LLMs.
  • A method called LOw-Memory Optimization (LOMO) is proposed, which greatly saves GPU memory usage without compromising the fine-tuning process.
  • 通过对内存使用和吞吐性能进行彻底评估,我们从实证上验证了LOMO在资源有限的情况下优化LLMs的有效性。

方法

::: block-1 Figure 1. Comparison of the state of LOMO and SGD in the backpropagation and parameter update phases

其中,Pi表示模型的参数,Gi表示对应于Pi的梯度。LOMO将梯度计算和参数更新融合为一步,以最小化梯度张量的大小。 :::

重新思考优化器

优化器状态占据了用于训练LLMs的大部分内存。像Adam这样的现代优化器会存储比参数大两倍的中间状态。随着参数的增加,优化器状态成为内存使用的主要项。

SGD优化器

尽管Adam在训练深度模型方面取得了巨大成功,但我们能否使用一种更廉价的优化器来对LLMs进行微调?

很显然,SGD作为一个基本优化器,对于微调LLMs是一个可接受的解决方案。以前的研究经常讨论SGD的三个挑战:

  1. 曲率损失面积较大
  2. 局部最优
  3. 鞍点

现代优化器已经在处理问题1上显示出有效性,并且在某些情况下可以缓解问题2和3。然而,当我们限定范围为微调LLMs时,这三个挑战可能会有所不同。

  • 更平滑的损失面

一个重要的假设是LLMs的参数空间非常平滑,对参数进行微小扰动不会导致损失变化太大。如果我们相信更大的模型具有更平滑的损失面,我们可以得出结论:问题1不是一个问题,因为LLMs的损失面不应该有很大的曲率。需要注意的是,这仅在我们训练LLMs进行自然语言任务时成立(或者如果使用代码进行预训练,则适用于基于代码的任务)。与预训练任务无关的合成损失函数确实会面临曲率较大的问题。

  • 局部最优已足够好

微调的目标是在不显著改变模型本身的情况下,将LLMs调整到新的任务和领域中。因此,局部最优通常是一个足够好的解决方案,并且相对于预训练语料库而言,有限的训练数据使得将模型推向遥远的全局最优解变得困难。遥远的鞍点同样如此。同样,对于常见的NLP任务,LLMs的初始点应该在一个谷底中。如果模型是通过指令(任务)进行预训练的,这种现象可能会更加明显,因为我们有更多机会找到与新任务相似的预训练任务。鞍点通常出现在山脊上,并且与谷底有一定距离,因此如果我们不将参数从预训练的值改变得太远,可能不会遇到鞍点问题。

隐式Batch Size

除了上述的定性讨论,我们希望对使用SGD微调LLMs的稳定性进行更深入的分析。假设我们有一个参数为 θ \theta 的预训练模型 f ( ) f(\cdot) ,一个训练集 D = { d 1 , d 2 , , d n } D=\{d_1,d_2,\ldots,d_n\} ,和一个损失函数 L L 。在一个包含两个数据点的批次上,SGD的一步更新可以表示为:

θ = θ α ( L ( d i , f ( d i , θ ) ) + L ( d j , f ( d j , θ ) ) ) (1) \theta' = \theta - \alpha(\nabla L(d_i,f(d_i,\theta)) + \nabla L(d_j,f(d_j,\theta))) \tag{1}

其中 α \alpha 是学习率, d i d_i d j d_j 是两个不同的训练样本。

接下来,对这两个训练样本 d i d_i d j d_j 依次进行两步的SGD更新,可以表示为:

θ 1 = θ α L ( d i , f ( d i , θ ) ) θ 2 = θ 1 α L ( d j , f ( d j , θ 1 ) ) \begin{align*} \theta_1 &= \theta - \alpha\nabla L(d_i,f(d_i,\theta)) \\ \theta_2 &= \theta_1 - \alpha\nabla L(d_j,f(d_j,\theta_1)) \end{align*}

根据微分中值定理,我们有:

L ( d j , f ( d j , θ 1 ) ) = L ( d j , f ( d j , θ ) ) + L ( d j , ξ ) ( f ( d j , θ 1 ) f ( d j , θ ) ) (4) L(d_j,f(d_j,\theta_1)) = L(d_j,f(d_j,\theta)) + L(d_j,\xi)(f(d_j,\theta_1)-f(d_j,\theta)) \tag{4}

其中 ξ \xi 是在 f ( d j , θ ) f(d_j,\theta) f ( d j , θ 1 ) f(d_j,\theta_1) 之间的一个点,我们可以看到方程 ( 6 ) (6) 减去方程 ( 1 ) (1) 等于 α [ L ( d j , ξ ) ( f ( d j , θ 1 ) f ( d j , θ ) ) ] \alpha\nabla[L(d_j,\xi)(f(d_j,\theta_1)-f(d_j,\theta))] 。假设损失面足够平滑,这一项可以忽略不计。这表明在平滑的损失面上利用SGD优化器可能意味着更大的批量大小。

正如我们上面提到的,合理地假设LLMs的损失面是平滑的,而更大的批量大小则表示更强的训练稳定性,因此我们相信使用SGD优化器对LLMs进行微调的过程是稳定的。这也解释了为什么SGD在小模型上失败而在大模型上成功的原因。

LOMO:LOW-MEMORY OPTIMIZATION

梯度张量表示参数张量的梯度,并具有与参数相同的大小,因此会导致很大的内存开销。现代深度学习训练框架(如PyTorch)会为所有参数存储梯度张量,主要出于计算优化器状态和梯度归一化的目的。

在我们采用SGD作为优化器的情况下,没有基于梯度的优化器状态,因此我们有一些替代方法可以用于梯度归一化。因此,我们提出了一种名为LOw-Memory Optimization(LOMO)的方法,将梯度计算和参数更新融合为一个步骤,以避免存储任何梯度张量。

具体而言,我们可以将传统的梯度下降表示为 g r a d = L p grad = \frac{\partial L}{\partial p} p = p l r g r a d p = p - lr \cdot grad ,这是一个两步过程,首先计算梯度,然后将其更新到参数。而融合版本则是 p = p l r L p p = p - lr \cdot \frac{\partial L}{\partial p}

核心思想是在计算梯度时立即更新参数,这样就不需要将梯度张量存储在内存中。 可以通过在反向传播中注入钩子函数来实现此目的。PyTorch提供了相应的API用于注入钩子函数,但是目前的API无法实现精确的立即更新。因此,我们最多只需在内存中存储一个参数的梯度,并且在反向传播期间逐个更新每个参数。这种方法将梯度的内存使用量从存储所有参数的梯度减少到仅存储一个参数的梯度。

LOMO方法的内存使用量与参数高效微调(Parameter-Efficient Fine-Tuning,PEFT)方法相当,这意味着将LOMO与这些方法结合只会稍微增加梯度占用的内存。这使得可以在PEFT方法中调整更多的参数。

实验

::: block-1

使用不同优化器训练LLaMA-7B时,每个部分的内存使用比例。序列长度和批量大小分别设置为512和8。 :::

::: block-1

在不同设置下,训练LLaMA-7B时的内存使用量(以GB为单位)。AC表示激活检查点技术。序列长度和批量大小分别设置为512和8。 :::

::: block-1

Throughput testing on a server with 8 RTX 3090 GPUs. The sequence length and batch size are set to 1024 and 1, respectively. Memory represents the peak memory allocated per GPU during training. Throughput represents the number of tokens (TGS) processed per second per GPU. :::

::: block-1

The main results of SuperGLUE are summarized on various scales using LLaMA (using 1,000 training examples). :::

::: block-1

Results using the LLaMA-13B model on the BoolQ and MultiRC datasets (using 1,000 training examples). "LoRA+LOMO" indicates that the LoRA module is injected when LOMO is used to fine-tune the weights of the pre-trained model. :::

in conclusion

This paper introduces a novel optimizer called LOw-Memory Optimization (LOMO), which aims to achieve full parameter fine-tuning of large language models with limited resources. This paper demonstrates the feasibility of fine-tuning the 65B model on a server equipped with a consumer-grade GPU such as the RTX 3090. By analyzing LOMO's memory usage, conducting throughput tests, and conducting experiments on the SuperGLUE dataset, this paper demonstrates its effectiveness and potential impact.

With the advent of the era of large models, an important part of the future work is to further reduce the resource threshold required to train large language models, so that more people can access and adopt these models. Currently, when training with LOMO, most of the memory is occupied by parameters. Therefore, a promising direction is to investigate parameter quantization techniques that can significantly reduce memory usage. In addition, this paper will explore more applicable scenarios and delve into the theoretical analysis of optimizing large-scale language models, which is of great value in promoting the development of this field.

Guess you like

Origin juejin.im/post/7250491326260264997