Fudan Qiu Xipeng's new work: single-machine fine-tuning of a large model with 65 billion parameters, industry insiders: it is of great significance to the popularization of large models...

From: Qubit

Enter the NLP group —> join the NLP exchange group

A single machine can fine-tune the full-parameter alpaca model!

This latest achievement, which made the open source party ecstatic, came from Fudan Qiu Xipeng's team.

ab5bf9489c71cf7e556d3e71428e84b3.png

Specifically, the researchers proposed a new optimizer called LOMO (Low Memory Optimization), and successfully fine-tuned 65B LLaMA on a single server with 8 cards RTX 3090 (24GB memory) .

Once the paper was released, it sparked a lot of discussion——

After the frenzy of GPT-4, people are increasingly thinking about the issue of model control while marveling at the capabilities of large language models.

Insiders are very excited about this:

For the popularization of large models, the stand-alone fine-tuning of LLaMA 65B is of great significance!
I used to dream that everyone could at least fine-tune a model of the size and quality of Chinchilla (70 billion parameters, produced by DeepMind), and now Fudan has done it.

2c8558f664a71de02bd578edba6cfeb7.png

Single-machine fine-tuning of a large model with 65 billion parameters

The main contribution of the paper is the LOMO (Low-Memory Optimization) optimizer, which wants to solve the problem of fine-tuning the full parameters of large models under the condition of limited resources.

The researchers point out that during the training of large language models, the optimizer state takes up most of the memory. For example, Adam will store the intermediate state, and the size of these states can reach twice the size of the parameter.

Therefore, the optimization idea of ​​the Fudan team is as follows:

The first step is to rethink the function of the optimizer from an algorithmic perspective.  Since SGD (Stochastic Gradient Descent) does not store any intermediate state, this is a good alternative. The problem is that in SGD, the gradient calculation and parameter update are performed separately, which may still cause the gradient tensor to be too large and the memory usage to be high.

Therefore, the researchers proposed LOMO, which combines gradient calculation and parameter update into one, and avoids storing any gradient tensors to reduce memory usage .

cc7bd64fc478ffb6060266cbf8bf4d3c.png

To stabilize LOMO's mixed-precision training, the researchers also took the following steps.

  • Gradient Normalization: Normalizes the gradients before applying them to the model parameters.

  • Loss Scaling: Before computing gradients, the loss function is multiplied by a scaling factor.

  • Convert some calculations to full precision during training

The researchers analyzed the memory usage of model states and activations during training with different optimizers.

5d56446480e6dcb4898694d797c478c6.png

It can be seen that compared with AdamW, the memory usage of LOMO is reduced from 102.20GB to 14.58GB.

The results of the throughput test show that on a server equipped with 8 RTX 3090 graphics cards, LOMO can hold the training of LLaMA 65B.

The researchers mentioned that using such a server configuration and LOMO, training is performed on 1000 samples, each sample contains 512 tokens, and the training time is about 3.6 hours.

41be5c3bb49a932a33c9bbdebe67b9be.png

The researchers also compared the downstream task performance of LOMO with Zero-shot and LoRA on the SuperGLUE benchmark.

The results show that LOMO performs better than Zero-shot in 6 datasets and models of different sizes. In most experiments, LOMO outperforms LoRA.

18cd09c87f94961bba35f67ec26b6b4a.png

Of course, although in large model training, 8 yuan 3090 is not a high configuration, but for ordinary people, it is still a bit unfriendly.

Many netizens complained: Can 8 yuan 3090 still be called limited resources?

However, others believe that this is still good news.

Although it is unlikely to have such a server configuration, it is not expensive to rent a machine with this configuration.

b559212dca7313f1d22fb4abdd51791f.png

On the other hand, the researchers also admitted the limitations of the paper and said that they will further reduce the resource threshold for training large language models.

Currently, when training with LOMO, most of the memory is occupied by parameters. Therefore, a promising direction is to explore parameter quantization techniques, which may significantly reduce memory usage.

94348e6f367783c66d66bfeef7e57f23.png

Lv Kai , the first author of LOMO , is the corresponding author of the paper and a master student of Professor Qiu Xipeng from the School of Computer Science and Technology, Fudan University. Bachelor degree also graduated from Fudan University.

Previously, Fudan’s open-source MOSS model came from Qiu Xipeng’s team.

Paper address:
https://arxiv.org/abs/2306.09782

Project address:
https://github.com/OpenLMLab/LOMO


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131336024
Recommended