Full-parameter fine-tuning of large language models with limited resources


insert image description here

Summary

Paper link: https://arxiv.org/pdf/2306.09782v1.pdf
Large language models (LLM) have revolutionized natural language processing (NLP), but require massive GPU resources for training. Lowering the training threshold for LLM will encourage more researchers to participate, benefiting both academia and society. Although existing methods mainly focus on parameter-efficient fine-tuning, i.e., adjusting or increasing a small number of parameters, few have been able to address the challenge of tuning all parameters of LLM with limited resources. In this paper, we propose a new optimizer, Low Memory Optimization (LOMO), which fuses gradient computation and parameter updates to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce the memory usage to 10.8% compared to the standard method (DeepSpeed ​​solution). Therefore, our method is able to perform full-parameter fine-tuning of the 65B model on a single machine with 8 × RTX 3090, each with 24GB memory.

1 Introduction

Large language models (LLMs) have revolutionized natural language processing (NLP), exhibiting remarkable capabilities such as emergence and grokking (Wei et al., 2022), pushing model sizes ever larger. However, training these models with billions of parameters, such as those with 30B to 175B parameters, raises the bar for NLP research. Tuning LLM usually requires expensive GPU resources, such as 8×80GB devices, which makes it difficult for small laboratories and companies to participate in research in this area.

Recently, parameter-efficient fine-tuning methods (Ding et al., 2022), such as LoRA (Hu et al., 2022) and Prefix-tuning (Li & Liang, 2021), provide a solution for the tuning of LLM with limited resources . However, these methods do not provide a practical solution for full-parameter fine-tuning, which has been considered more powerful than parameter-efficient fine-tuning (Ding et al., 2022; Sun et al., 2023). In this work, we aim to explore techniques for accomplishing full-parameter fine-tuning with limited resources.

We analyze the memory usage in LLM from four aspects: activations, optimizer states, gradient tensors, and parameters, and optimize the training process from three aspects: 1) We rethink the optimizer from an algorithmic perspective function, and found that SGD is a good substitute in fine-tuning the full parameters of LLM. This allows us to remove entire parts of the optimizer state, since SGD does not store any intermediate state (Section 3.1). 2) Our proposed optimizer LOMO, shown in Fig. 1, reduces the memory usage of the gradient tensor to O(1), which is equivalent to the memory usage of the largest gradient tensor (Section 3.2). 3) To stabilize LOMO mixed precision training, we integrate gradient normalization, loss scaling, and convert some computations to full precision during training (Section 3.3).
insert image description here

Our technique results in memory usage equal to parameters plus activation and max gradient tensor usage. We push the memory usage of full-parameter fine-tuning to such an extreme that it is only equivalent to the usage of inference. This is because the memory usage of the forward + back pass process should not be less than that of the forward pass process alone. It is worth noting that when using LOMO to save memory, we ensure that the fine-tuning process is not affected, because the parameter update process is still equivalent to SGD.

We empirically evaluate the memory and throughput performance of LOMO and show that the use of LOMO can successfully train a 65B model with only 8 RTX 3090 GPUs. Furthermore, to verify the downstream performance of our proposed technique, we apply LOMO to tune all parameters of LLM on the SuperGLUE dataset collection (Wang et al., 2019). Empirical results demonstrate the efficiency and effectiveness of LOMO for optimizing LLMs with billions of parameters. Overall, our contributions are as follows:

  • We provide a theoretical analysis showing that SGD can successfully fine-tune the full range of parameters of the LLM. The problems that previously hindered the widespread use of SGD may no longer be serious problems for fine-tuning LLMs.
  • We propose a low-memory optimization, named LOMO, to significantly save GPU memory usage without compromising the fine-tuning process.
  • Through comprehensive evaluation of memory usage and throughput performance, we empirically verify the effectiveness of LOMO in optimizing LLM in resource-constrained scenarios. This is further supported by performance evaluations on downstream tasks.

2. Related work

In this section, we present related work on memory saving techniques during full parameter tuning. These techniques can be effectively combined with LOMO to further reduce memory consumption.

Activate checkpoint. During ordinary backpropagation, all activations from the forward pass are stored in memory to compute gradients. This can be a large memory overhead, especially for large language models. Alternatively, all activations can be discarded and recomputed for gradient computation as needed to save memory. However, this may result in substantial additional computational cost. Activation checkpointing (or gradient checkpointing) considers both memory usage and computation cost, providing a compromise solution (Chen et al., 2016). The activations of the checkpoint nodes selected by the policy in the computation graph are kept in memory after the forward pass, while the activations of the remaining nodes are recomputed at most once. Activation memory can be reduced to the square root of the original amount, at the cost of one extra forward pass.

Mixed precision training. Due to its ability to speed up training and reduce memory footprint, mixed precision training has become a popular approach for training large language models (Narayanan et al., 2021; Rajbhandari et al., 2020). By employing half-precision storage for parameters, activations, and gradients, mixed-precision training enables high-throughput computation during forward and backward propagation. To maintain stability and model accuracy, micicikevicius et al. (2018) proposed three techniques involving using full-precision weight copies, loss scaling, and performing specific arithmetic operations with full precision.

Heterogeneous training system. Several studies (Rhu et al., 2016; Wang et al., 2018; Ren et al., 2021a) have attempted to reduce GPU memory consumption by utilizing heterogeneous memory such as CPU and NVMe memory. L2L (Pudipeddi et al., 2020) employs a layer-to-layer strategy, where only the tensors required for computation of a particular layer are transferred to GPU memory, while the remaining tensors are kept in CPU memory. ZeRO-Offload (Ren et al., 2021b) is an extension of ZeRO-2 (Rajbhandari et al., 2020) that keeps gradients and optimizer state in CPU memory and updates parameters via CPU computation. Tensors and compute operations are assigned to GPUs or CPUs according to the dataflow graph. ZeRO-Infinity (Rajbhandari et al., 2021), a follow-up improvement of ZeRO-Offload on ZeRO3 (Rajbhandari et al., 2020), enables further scaling of the model size. The split model state and other tensors can be offloaded not only to CPU memory but also to NVMe to take full advantage of heterogeneous architectures.

3. Method

3.1. Rethinking the function of optimizer

The optimizer state takes up most of the memory used to train the LLM. Modern optimizers such as Adam (Kingma & Ba, 2015) store intermediate state twice the size of the parameters. As the parameter size increases, the optimizer state becomes the dominant term for memory usage.

3.1.1. Using SGD

Despite Adam's great success in training deep models, we asked the question "Can we use a cheaper optimizer to fine-tune LLM?" Our answer was SGD, the most basic optimizer. Fortunately, this was found to be an acceptable solution for LLM fine-tuning when we restricted the scope.

Previous work often discusses three challenges of SGD: 1) large curvature loss surfaces, 2) local optima, 3) saddle points (Ruder, 2016; Sun et al., 2020). Modern optimizers have shown effectiveness in dealing with 1) and can mitigate 2) and 3) in some cases. However, these three challenges may differ when we limit the scope to fine-tuning the LLM.

Smooth loss plane. An important assumption is that the parameter space of LLMs is very smooth, and small perturbations to the parameters do not change the loss too much. There are empirical results and theoretical analysis supporting this hypothesis (Hao et al., 2019). If we believe that larger models have smoother loss surfaces, we can conclude that: 1) the problem is not a problem, because the loss surface of LLM should not have large curvature. Note that this only applies to tasks where we teach LLMs natural language-based (or code-based, if pretrained with code). Synthetic loss functions that are not related to the pre-training task do suffer from large curvature.

insert image description here

A local optimum is sufficient. The goal of fine-tuning is to adapt the LLM to new tasks and domains without significantly changing the model itself. Therefore, a local optimum is often a good enough solution, while limited training data (compared to a pre-trained corpus) makes it difficult for the model to reach a distant global optimum.

Saddle up. Similarly, for a common NLP task, the initial point of LLM should be in the valley. This phenomenon may be much more pronounced if the model is pretrained with instructions (tasks), since we have more chances to find pretrained tasks similar to the new task. Saddle points usually appear on ridges some distance away from valleys, so if we change the parameters not too far from the pretrained values, we probably won't have the saddle point problem.

However, there is no guarantee that SGD is a strong optimizer compared to modern optimizers. Our intuition is to create a simple and practical solution for fine-tuning the LLM, and identify its flaws to continuously improve it.

3.1.2, implicit BatchSize

In addition to the qualitative discussion above, this paper would like to conduct a deeper analysis of the stability of fine-tuned LLMs using SGD. Suppose we have a parameter θ \boldsymbol{\theta}θ 's pre-training modelf ( ⋅ ) f(\cdot)f ( ) , a training setD = { d 1 , d 2 , ⋯ , dn } \mathcal{D}=\left\{d_{1}, d_{2}, \cdots, d_{n}\right \}D={ d1,d2,,dn} , and a loss function l, a one-step update of SGD for a batch of two data points can be:
θ ′ = θ − α [ ∇ L ( di , f ( di , θ ) ) + ∇ L ( dj , f ( dj , θ ) ) ] \boldsymbol{\theta}^{\prime}=\boldsymbol{\theta}-\alpha\left[\nabla \mathcal{L}\left(d_{i}, f\left( d_{i}, \boldsymbol{\theta}\right)\right)+\nabla \mathcal{L}\left(d_{j}, f\left(d_{j}, \boldsymbol{\theta}\right )\right)\right]i=ia[L(di,f(di,i ))+L(dj,f(dj,i ))]

where α \alphaα is the learning rate,di d_{i}didj d_{j}djare two different training samples.
Next, between the two training samples di d_{i}didj d_{j}djGive the following SGD function:
θ 1 = θ − α ∇ L ( di , f ( di , θ ) ) θ 2 = θ 1 − α ∇ L ( dj , f ( dj , θ 1 ) ) \begin {array}{l}\ballsymbol{\theta}_{\mathbf{1}}=\ballsymbol{\theta}-\alpha \nabla \mathcal{L}\left(d_{i}, f\left(d_ {i}, \ballsymbol{\theta}\right)\right) \\\ballsymbol{\theta}_{\mathbf{2}}=\ballsymbol{\theta}_{\mathbf{1}}-\alpha \nabla \mathcal{L}\left(d_{j}, f\left(d_{j}, \ball symbol{\theta}_{\mathbf{1}}\right)\right)\end{array}i1=iαL(di,f(di,i ))i2=i1αL(dj,f(dj,i1))
According to the differential mean theorem, we have
L ( d j , f ( d j , θ 1 ) ) = L ( d j , f ( d j , θ ) ) + L ( d j , ξ ) ( f ( d j , θ 1 ) − f ( d j , θ ) ) , θ 2 = θ − α ∇ L ( d i , f ( d i , θ ) ) − α ∇ L ( d j , f ( d j , θ ) ) − α ∇ [ L ( d j , ξ ) ( f ( d j , θ 1 ) − f ( d j , θ ) ) ] , θ 2 = θ − α [ ∇ L ( d i , f ( d i , θ ) ) + ∇ L ( d j , f ( d j , θ ) ) ] − α ∇ [ L ( d j , ξ ) ( f ( d j , θ 1 ) − f ( d j , θ ) ) ] \begin{array}{c} \mathcal{L}\left(d_{j}, f\left(d_{j}, \boldsymbol{\theta}_{\mathbf{1}}\right)\right)=\mathcal{L}\left(d_{j}, f\left(d_{j}, \boldsymbol{\theta}\right)\right)+\mathcal{L}\left(d_{j}, \xi\right)\left(f\left(d_{j}, \boldsymbol{\theta}_{\mathbf{1}}\right)-f\left(d_{j}, \boldsymbol{\theta}\right)\right), \\ \boldsymbol{\theta}_{\mathbf{2}}=\boldsymbol{\theta}-\alpha \nabla \mathcal{L}\left(d_{i}, f\left(d_{i}, \boldsymbol{\theta}\right)\right)-\alpha \nabla \mathcal{L}\left(d_{j}, f\left(d_{j}, \boldsymbol{\theta}\right)\right)-\alpha \nabla\left[\mathcal{L}\left(d_{j}, \xi\right)\left(f\left(d_{j}, \boldsymbol{\theta}_{\mathbf{1}}\right)-f\left(d_{j}, \boldsymbol{\theta}\right)\right)\right], \\ \boldsymbol{\theta}_{\mathbf{2}}=\boldsymbol{\theta}-\alpha\left[\nabla \mathcal{L}\left(d_{i}, f\left(d_{i}, \boldsymbol{\theta}\right)\right)+\nabla \mathcal{L}\left(d_{j}, f\left(d_{j}, \boldsymbol{\theta}\right)\right)\right]-\alpha \nabla\left[\mathcal{L}\left(d_{j}, \xi\right)\left(f\left(d_{j}, \boldsymbol{\theta}_{\mathbf{1}}\right)-f\left(d_{j}, \boldsymbol{\theta}\right)\right)\right] \end{array} L(dj,f(dj,i1))=L(dj,f(dj,i ))+L(dj,x )(f(dj,i1)f(dj,i )),i2=iαL(di,f(di,i ))αL(dj,f(dj,i ))α[L(dj,x )(f(dj,i1)f(dj,i ))],i2=ia[L(di,f(di,i ))+L(dj,f(dj,i ))]α[L(dj,x )(f(dj,i1)f(dj,i ))]

where ξ \xiξ f ( d j , θ ) f(dj, θ) f(djθ) f ( d j , θ 1 ) f(dj, θ1) f ( d j , θ 1 ) , we can see that Eq. 6 minus Eq. 1 equalsα ∇ [ L ( dj , ξ ) ( f ( dj , θ 1 ) − f ( dj , θ ) ) ] α∇[L(dj, ξ)(f(dj, θ1)−f(dj, θ))]α[L(djξ)(f(djθ1)f ( d j , θ ))] . Assuming the loss surface is smooth enough, this term is negligible. It shows that using an SGD optimizer on a smooth loss surface can imply larger batch sizes.

As we mentioned above, we have reason to assume that the loss surface of LLMs is smooth, and larger batch size means stronger training stability, so we believe that the fine-tuning process of LLMs using SGD optimizer is stable . This also explains why SGD fails on small models but works on large ones.

3.2, LOMO: low memory optimization

The gradient tensor represents the gradient of the parameter tensor, which has the same size as the parameter, resulting in a large memory overhead. Modern deep learning training frameworks like PyTorch (Paszke et al., 2017) store gradient tensors for all parameters. Gradient tensors are stored for two reasons: to compute optimizer state and to normalize gradients.

Since we use SGD as the optimizer, there is no gradient-dependent optimizer state, and we have some alternatives to gradient normalization. Therefore, this paper proposes Low Memory Optimization (LOMO), as shown in Algorithm 1, which fuses gradient computation and parameter update in one step to avoid storing any gradient tensors.

In detail, we can express vanilla gradient descent as grad ⁡ = ∂ L ∂ p \operatorname{grad}=\frac{\partial \mathcal{L}}{\partial p}grad=pL, which is a two-step process, first computing the gradient and then updating it as a parameter. The fused version is p = p − lr ∗ grad ⁡ p=pl r * \operatorname{grad}p=plrgrad

The key idea is to update the parameters as soon as the gradient is computed so that we don't store the gradient tensor in memory. This can be achieved by injecting hook functions into backpropagation. PyTorch provides related apis for injecting hook functions, but we cannot use the current api to achieve accurate immediate updates. Instead, we store the gradient of at most one parameter in memory, and update each parameter individually as we backpropagate. This method reduces the memory usage of gradients from storing gradients for all parameters to storing gradients for only one parameter.

Most LOMO memory usage is consistent with Parameter Efficient Fine-Tuning (PEFT) methods, suggesting that combining LOMO with these methods introduces only a small increase in the memory occupied by gradients. This makes it possible to tune more parameters of the PEFT method.

3.3. Use LOMO to stabilize training

3.3.1. Alternatives to Gradient Normalization and Clipping

Gradient normalization and clipping are necessary tools to deal with the problem of exploding and vanishing gradients (Chen et al. (2018), but their calculation process requires the use of gradient tensors with all parameters. This article proposes two options here:

  • Clips a gradient tensor based on its value rather than its norm.
  • Gradient norm is computed in an extra pass.

Clipping the gradient tensor according to its value before the gradient norm is approximated is a simple and effective way to solve the gradient explosion. The main problem with clipping by value is that truncating some gradient elements may change the orientation of the gradient tensor. For example, a binary vector [1.3,0.8] and its cropped version 1.0,0.8 represent different orientations. Our experience is that clipping by value performs poorly when the learning rate is high, because truncation occurs more frequently in this case. However, clipping by value performs well at medium and small learning rates. Note that the scaling of the learning rate is highly task- and data-dependent, but in general we recommend clipping for learning rates less than 1e−3.

Our method cannot directly compute the gradient norm because we update parameters as we backpropagate, so we do not know the norm of the rest of the parameters when updating a certain parameter. However, we can introduce an additional pass to compute and accumulate the gradient norm for each parameter, resulting in two backward passes, one for computing the gradient norm and one for updating the parameters. Memory usage remains the same at the expense of speed.

A controversial solution. Our current training framework computes the gradient norm over all parameters and requires two backward passes. One solution to save the extra back pass is to approximate the norm of the gradient tensor with a set of parameters, such as adjacent layers. However, this approach is indeed biased as it leads to different update steps for different parameters. When updating, multiply by a scale factor according to the gradient norm. This approximation leads to differences in the scale factors due to differences in gradient norms between parameter sets. Despite this limitation, this grouped gradient clipping approach can be thought of as applying dynamic learning rates to different groups of parameters according to their gradient norms. Sun et al. (2020) argue that it is not always appropriate to use the same learning rate for all parameters in SGD, so we believe our method has the potential to further benefit SGD as well. We will explore as a compelling future direction.

3.3.2. Mitigate the decrease in accuracy

Mixed precision training is often used to speed up the training process. To mitigate the loss of precision, dynamic loss scaling is utilized and some computations are transitioned to full precision. The method of loss scaling is crucial to prevent underflow during FP16 training, amplifying the loss by a specific factor before the back pass and reducing the gradient by the same factor.

In this case, dynamic loss scaling is integrated with LOMO to dynamically adjust the scaling factor throughout training. If overflow does not occur within the specified number of backward passes, the scale factor is doubled. Otherwise, this step is removed and the scale factor is halved. This process is similar to what is encountered during gradient normalization. It is not known whether overflow will occur until the reverse operation is complete. Therefore, we perform two reverse passes: the first pass to determine if there is an overflow, and the second pass to update the parameters if no overflow is detected. These two reverse processes of dynamic loss scaling can be performed concurrently with gradient normalization. To efficiently update parameters and process gradients in operations such as normalization and scaling, the gradients and their associated parameters are converted to full precision in these calculations.

4. Experiment

In this section, we evaluate the proposed method in terms of memory configuration, throughput and downstream performance. If not further explained, all experiments were performed with the LLaMA model (Touvron et al., 2023), ranging from 7B to 65B.
insert image description here

4.1. Memory configuration

We first analyze the memory usage of model states and activations during training under different settings. As shown in Table 1, when training the Lama-7b model, the use of the LOMO optimizer results in a significant reduction in memory footprint from 102.20GB to 14.58GB compared to the AdamW optimizer (Loshchilov & Hutter, 2019), and compared to SGD, The memory footprint has been reduced from 51.99GB to 14.58GB. The significant reduction in memory usage is mainly due to the reduced memory requirements for gradients and optimizer states. Therefore, the memory during training is mainly occupied by parameters, commensurate with the memory usage during inference.
insert image description here

Optimizer state. Figure 2 illustrates that using the AdamW optimizer for lama-7b training (a widely adopted configuration) results in a large amount of memory (73.7%) allocated to the optimizer state. This result is a consequence of the mixed-precision training approach, where full-precision copies of weights, momentums, and variances are maintained in the optimizer state for weight updates. Replacing the AdamW optimizer with the SGD optimizer can effectively reduce the percentage of optimizer state in memory, thereby reducing GPU memory usage (from 102.20GB to 51.99GB). This reduction is due to the fact that the SGD optimizer does not need to store full precision momentum and variance. For LOMO, parameter updates and reverse updates are fused in one step, further eliminating the need for optimizer state memory.
insert image description here

gradient. During training with LOMO, parameters are updated as soon as gradients are received and then discarded from memory. Therefore, the upper bound on gradient memory consumption is determined by the gradient associated with the parameter matrix of maximum size. This approach greatly reduces memory usage and almost reduces the size of the parameters.

activation. Training a 7B model with 512×8 tokens in one batch requires a lot of memory for activations. LOMO is compatible with techniques to reduce activation memory such as activation checkpointing. By integrating activation checkpointing with LOMO, the memory footprint caused by activation can be reduced from 45.61GB to 1.79GB.

4.2. Throughput

We evaluate the throughput performance of LOMO with AdamW and SGD. Experiments are performed on a server equipped with 8 RTX 3090 GPUs, interconnected via a PCIe motherboard. The sequence length and batch size are set to 1024 and 1, respectively. Throughput is measured in terms of tokens processed per second (TGS) per GPU, and parameter partitioning is achieved using ZeRO-3 (Rajbhandari et al., 2020).

For the 7B model, LOMO exhibits significant throughput, surpassing AdamW and SGD by about 11 times. This dramatic improvement can be attributed to LOMO's ability to train 7B models on a single GPU, reducing inter-GPU communication overhead. The throughput of SGD is slightly higher than that of AdamW, which can be attributed to the fact that SGD excludes the computation of momentum and variance.

As for the 13B model, due to memory limitations, AdamW cannot be used for training on the existing 8 RTX 3090 GPUs. In cases where LOMO requires model parallelism, LOMO still outperforms SGD in terms of throughput. This advantage is attributed to the memory-efficient nature of LOMO and the fact that only two GPUs are needed to train the model under the same settings, resulting in reduced communication costs and greater throughput. Furthermore, SGD suffers from out-of-memory (OOM) issues on 8 RTX 3090 GPUs when training a 30B model, while LOMO performs well on only 4 GPUs.

Finally, the 65B model was successfully trained using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. With such server configuration and LOMO, the training process takes about 3.6 hours on 1000 samples, each containing 512 tokens.

4.3. Downstream performance

To evaluate the effectiveness of LOMO in fine-tuning large language models, an extensive set of experiments is performed. LOMO is compared with two other methods, one is Zero-shot that does not require fine-tuning, and the other is LoRA, one of the most popular parameter efficient fine-tuning techniques. As described in (Hu et al., 2022), LoRA reparameterizes dense layers to only update low-rank matrices without introducing delays during inference.

We evaluate model performance using the SuperGLUE dataset, specifically RTE (Dagan et al., 2005), BoolQ (Clark et al., 2019), WSC (Levesque et al., 2012), WIC (Pilehvar and camachoc-collados, 2019), MultiRC (Khashabi et al., 2018), and COPA (Roemmele et al., 2011). Considering the high computational cost associated with running large language models, this paper follows MeZO (Malladi et al., 2023) by randomly sampling 1000 training data from the training set and 1000 test data from the validation set, and reports obtaining best results. The hints used in the experiments are the same as MeZO, and the details of the hyperparameters are in Appendix a. During inference, we insert different labels or candidate labels into the prompt and compute the average log-likelihood for each label. The label with the highest score is chosen as the model's answer. To evaluate performance, we use accuracy as an evaluation metric.

4.3.1. Main results

The downstream performance of LOMO compared with zero-shot and LoRA is shown in Table 3. Based on the results, we obtain the following observations.
insert image description here

The performance of LOMO is significantly better than that of Zero-shot. Across all 6 datasets and model sizes, LOMO consistently achieves better results than zero-shot, with an average improvement of more than 20 points using LLaMA-13B. While previous studies have demonstrated the impressive capabilities of large language models in the zero-shot setting, fine-tuning can still lead to significant performance gains for specific downstream tasks. Experimental results confirm the effectiveness of LOMO in optimizing large language models of different scales.

LOMO generally outperforms LoRA in most experiments. LOMO provides strong performance compared to LoRA, e.g., using LLaMA13B leads to an average improvement of 2.8 points. This suggests that full-parameter fine-tuning benefits model performance more than parameter-efficient fine-tuning because more parameters are tuned. LOMO strikes a good balance between performance and efficiency, making it a competitive choice for fine-tuning.

In some cases, LOMO performs worse than LoRA. One possible reason is that we use a relatively small training set, which may not be sufficient for full-parameter fine-tuning of large models. Furthermore, LoRA and LOMO employ different model architectures. Specifically, LoRA provides a shortcut for model tuning, which may be advantageous in some cases. In fact, the two approaches are not conflicting or mutually exclusive. In the next subsection, we verify that the combination of LoRA and LOMO does not hurt the model performance, and in most cases leads to performance improvement.

LOMO scales efficiently to 65 billion parameter models. Although all experiments are performed on a single machine equipped with 8 × RTX 3090, LOMO consistently shows strong performance even at the scale of 65 parameters. This further supports the effectiveness of LOMO in optimizing LLM in resource-constrained scenarios.

4.3.2, LoRA and LOMO

LOMO and LoRA are basically independent of each other. To verify this conclusion, experiments were conducted on BoolQ and MultiRC datasets using LLaMA-13B. The result is shown in Figure 3. It is found that LOMO consistently improves the performance of LoRA regardless of the higher results achieved by LoRA. This shows that the different fine-tuning methods adopted by LOMO and LoRA are complementary. Specifically, LOMO focuses on fine-tuning the weights of pre-trained models, while LoRA tunes other modules. Therefore, LOMO does not affect the performance of LoRA; instead, it facilitates better model tuning for downstream tasks.
insert image description here

5 Conclusion

This paper presents Low Memory Optimization (LOMO), a new optimizer designed to facilitate full-parameter fine-tuning of large language models with limited resources. We have demonstrated the feasibility of fine-tuning the 65B model on a server equipped with consumer-grade GPUs such as RTX 3090. The effectiveness and potential impact of LOMO are demonstrated by analyzing the memory usage of LOMO, conducting throughput tests, and conducting experiments on SuperGLUE.

Going forward, our future work aims to further lower the resource threshold required to train large language models, thereby enabling wider access and adoption of these models. When training with LOMO, most of the memory is currently occupied by parameters. Therefore, a promising direction is to explore parameter quantization techniques, which can significantly reduce memory usage. Studying more applicable scenarios of LOMO and in-depth theoretical analysis of optimizing large-scale language models are of great value in promoting the development of this field.

loss surface: loss plane

Guess you like

Origin blog.csdn.net/hhhhhhhhhhwwwwwwwwww/article/details/131320059