The new work of Chen Danqi's team: A single card A100 can train 30 billion parameter models!

Xi Xiaoyao Technology said the original
author | IQ dropped all over the place, ZenMoore

In recent years, with the emergence of large models, fine-tuned language models have demonstrated superior performance on various downstream tasks. However, the parameters of these huge models often reach the level of billions or even tens of billions. Training a model of this size requires a large amount of memory, and the traditional backpropagation method is slow in the optimization process.

The authors of this paper propose MeZO, a memory-efficient zero-order optimizer . By improving the classic ZO-SGD method to operate in-place , this method achieves the same memory footprint as the inference stage , making it more efficient to fine-tune the language model. Taking a single A100 80GB GPU as an example, a model with 30 billion parameters can be trained using MeZO , while the traditional backpropagation method can only train a large model with 2.7 billion parameters under the same budget.

As shown in Figure 1, the authors performed experiments on the OPT-13B model and compared their results. Despite using only 1/12 of the memory, MeZO shows better results than zero-shot and ICL in 7 tasks.

Figure 1 Fine-tuning using zero-shot, contextual learning (ICL), MeZO, and Adam

That means that researchers and developers using MeZO, a memory-efficient zero-order optimizer, may be able to overcome the memory constraints and computing bottlenecks encountered in traditional methods . It can realize more free exploration, training and optimization of language models with huge parameter scale, and bring more accurate and efficient solutions to the field of natural language processing~

Thesis Title :
Fine-Tuning Language Models with Just Forward Passes

Paper link :
https://arxiv.org/abs/2305.17333

Code address :
https://github.com/princeton-nlp/MeZO

Large model research test portal

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):
https://gpt4test.com

Paper Quick Facts

Background setting: Consider a labeled dataset D \mathcal{D}D and a mini-batchB \mathcal{B}B,用 L ( θ ; B ) L(\theta;\mathcal{B}) L ( i ;B ) Denotes loss on mini-batch. Under this setting, the classical zero-order (ZO) gradient estimation is introduced.

Memory Efficient ZO-SGD (MeZO)

Since the classic ZO-SGD algorithm needs to store the distribution z ∈ R d \mathcal{z} ∈ \mathbb{R}^dzRd , so its memory overhead is twice that of inference. Memory-efficient ZO-SGD (MeZO) is proposed to solve this problem, as shown in Algorithm 1.

Arithmetic 1 MeZO

At each step, a seed s is first randomly selected, and for each of the four usages of z in Algorithm 1, the random number generator is reset by s and the relevant entries of z are resampled. With this in-place implementation, the memory footprint of MeZO is comparable to the memory cost of the inference stage .

It should be noted that Algorithm 1 describes the operation of perturbing each parameter separately , which may be time-consuming for large models. In practice, time can be saved by perturbing the entire weight matrix at the same time rather than perturbing each scalar independently. This incurs an additional memory overhead that is comparable in size to the largest weight matrix; typically, the largest weight matrix is ​​the word embedding matrix.

MeZO extension

MeZO can be combined with other gradient-based optimizers , such as stochastic gradient descent with momentum (SGD) or the Adam optimizer.

While the original implementation required additional memory to store estimates of gradient momentum, MeZO-momentum and MeZO-Adam reduce this memory by recomputing the moving average of gradients using the saved forward propagation loss and z overhead.

Furthermore, all coordinates in the SPSA gradient estimation have the same scale, but different layers of the deep Transformer may have gradients of different scales. Therefore, the author borrowed the idea of ​​hierarchical adaptive optimizer and designed several variants of MeZO.

experiment

  • ICL: used in contextual learning
  • LP: Linear probing
  • FT: Full Fine Tuning with Adam

memory usage

MeZO performs on par with FT on many tasks and outperforms equivalent memory methods while substantially reducing memory cost. Figure 2 and Figure 3 compare the memory consumption of ICL, FT, LP, and MeZO.

Figure 2 GPU memory consumption of different OPT models and tuning methods on MultiRC

Figure 3. The largest OPT model that can be tuned with specific hardware and algorithms

Medium-Scale Masked Language Models

Figure 4 Experiments on RoBERTa-large

experiment result shows:

  • MeZO performs significantly better compared to zero-shot, LP and other equivalent memory methods.
  • With enough data, MeZO achieves comparable performance to FT (up to 5% difference).
  • MeZO performs well on both full parameter tuning and PEFT.

Large Autoregressive Language Models

MeZO exhibits strong performance in classification, multiple choice, and generation tasks.

Table 1 Experiments on OPT-13B

MeZO scales to models with 66 billion parameters.

Table 2 Experiments performed on OPT-30B and OPT-66B

Training with non-differentiable objectives

MeZO can optimize non-differentiable objectives such as accuracy and F1 score.

Table 3 Using MeZO under non-differentiable objectives

summary

The importance of this method is self-evident. It provides an effective path for researchers and developers to train larger-scale language models with limited resources.

  • First, the memory efficiency of MeZO makes it possible to use similar hardware configurations in the training phase as in the inference phase, which is very important for practical applications and deployments. Traditional backpropagation methods require more memory and computing resources , limiting the scope of use for large-scale models. With MeZO, researchers and developers can train at a relatively low cost while obtaining higher model capacity and expressive power.
  • Secondly, the in-place operation feature of MeZO avoids unnecessary memory overhead and data transmission , further improving training efficiency. The traditional backpropagation method needs to store and transmit a large number of intermediate calculation results, while MeZO updates model parameters without adding additional memory overhead, reducing memory usage and data transmission requirements . This is especially important for training large-scale language models, making the training process faster and more efficient .
  • Therefore, the proposal of MeZO is of great significance for promoting the development of large-scale language models . It provides researchers and developers with an innovative optimization method that allows them to design and train models with a large number of parameters more flexibly. By improving training efficiency and reducing resource costs , MeZO opens up new possibilities for further research and application of language models, and is expected to bring greater breakthroughs and innovations in the field of natural language processing. In addition, this memory-efficient optimizer also helps to lower the threshold for training large-scale models , enabling more people to participate in the research and application of language models, thereby promoting innovation and technology popularization .

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/131118363