Meta proposes a new parameter efficient fine-tuning scheme, only one RNN is needed, and the GPU usage of the Transformer model is reduced by 84%!

Recently, with the continuous development of ChatGPT and GPT-4 models , major Internet companies at home and abroad have launched their own large language models, such as Google's PaLM series, MetaAI's LLaMA series, and some large models launched by domestic companies and universities, such as Baidu's Wenxinyiyan, Tsinghua University's ChatGLM and other models. A brand new large model will be released almost every few days, but for researchers and developers, perhaps everyone is more concerned about innovations in the actual implementation of basic large model training, fine-tuning, reasoning, and deployment. This has to talk about the language modeling architecture at the bottom of the large model. Nowadays, the infrastructure of most large models still uses the Transformer published on NeurIPS 6 years ago.

Fine-tuning the entire Transformer model becomes increasingly expensive as the model size and number of tasks increase. Therefore, many parameter efficient transfer learning methods (Parameter Efficient Transfer Learning, PETL) have been proposed. This article comes from Meta AI and proposes a parameter efficient adaptation method REcurrent ADaption (READ) based on the traditional RNN architecture. Specifically, READ only needs to insert a small RNN network next to the basic Transformer to achieve efficient parameter fine-tuning, and the model does not need to be backpropagated through the backbone Transformer. Through a series of experiments, the author shows that READ can save 56% of training memory consumption and 84% of GPU usage while maintaining a high-quality model fine-tuning effect.

Paper link:

https://arxiv.org/abs/2305.15348

I. Introduction 

Since 2018, the growth rate of large language model parameter scale is nearly two orders of magnitude faster than the growth rate of GPU memory, which makes the entry threshold of large models higher and higher, and the cost of configuring an "alchemy furnace" that can hold large models is very expensive. Only a few well-funded companies and institutions have the ability to train and fine-tune large models. In order to lower this threshold, the PETL method has become the preferred solution at present. For example, the Adapter method [1] reduces the amount of parameters that the model needs to update by inserting small modules in the Transformer. The Soft Prompts method [2] stitches small-scale parameters after the model inputs embeddings to achieve a similar effect. There is also the Lora method [3] that has received widespread attention, which minimizes the amount of model parameters through low-rank approximation , and the BitFit method [4] that only fine-tunes the paranoid items in the first few layers of the network. The following table shows the comparison results of the fine-tuning costs between the READ method proposed in this paper and the above-mentioned methods.

It can be seen from the above table that through the optimization of the PETL method, the cost of fine-tuning the model has been greatly reduced compared to full fine-tuning. At the same time, READ in this paper has obvious advantages over other methods. This is due to the small RNN structure added inside READ . Today, with the Transformer architecture rampant, the relatively old RNN has shown strong vitality. Recently, a Chinese-dominated open source team also released a large language model RWKV [5] based on the RNN architecture, and made a slogan of "having both fish and bear's paw" with Transformer.

2. The method of this paper

2.1 What is READ? 

The READ proposed in this paper is mainly composed of a standard RNN and a Joiner network . The overall architecture of the READ network is shown in the figure below.

2. The optimization process of the network only involves RNN and feed-forward network (FFN), and there is no need to update the Self-Attention layer. This improves the overall usability and training efficiency of the model, and READ can be plug-and-play in any Transformer structure.

3. Due to the recurrent network characteristics of READ, the scale of trainable parameters for model fine-tuning does not increase with the increase of the number of backbone network layers. The relationship between the two grows sublinearly.

4. READ can be calculated without modifying the intermediate results of the backbone Transformer network.

2.2 How does READ work? 

3. Experimental results 

The experiments in this paper are carried out in multiple natural language tasks of the GLUE benchmark. The basic Transformer architecture used is the T5 model. The RNN model also uses a variety of cyclic neural network structures including the original RNN, LSTM, and GRU.

3.1 The READ method outperforms other methods with significantly lower energy consumption 

The figure below shows the performance comparison between the READ method and other PETL methods in the case of reduced GPU energy consumption. From the left half of the figure below, we can see that compared with full-tuning, READ can reduce GPU usage by about 90%, and GPU memory usage by 56%. At the same time, the prediction accuracy of the model remains the same as before.

While PETL methods such as LoRA, BitFit, or Adapter can also significantly reduce the number of trainable parameters, they do not reduce the computational cost of fine-tuning, which is the main optimization goal of PETL. From the right half of the figure above, we can see that the video memory used by READ in the training process is very small. The figure mainly shows the performance and space trade-off between model performance and video memory usage. Compared with all other baseline methods, READ achieves at least 25% training memory optimization, while achieving better downstream task prediction performance.

3.2 READ has strong scalability

As shown in the figure below, compared to other PETL methods, the number of trainable parameters of READ grows very slowly. As the size of the T5 backbone model increases, the number of parameters for READ shows a log-linear trend. This is due to the nature of READ's recurrent network, which makes its fine-tuning parameter scale independent of the number of backbone network layers, which makes READ more suitable for fine-tuning ultra-large-scale Transformer models in specific engineering implementations.

3.3 READ also has a great improvement in model reasoning speed and memory usage

As shown in the left half of the figure below, READ has lower memory usage in the model inference phase than other PETL methods, and the inference speed is also maintained at a high level. In addition, in order to more comprehensively evaluate the inference memory usage of READ, the author shows the change of inference memory usage as the model backbone network size increases in the right half of the figure below. Compared with the full fine-tuning method, the increase in inference memory usage of READ is almost negligible .

Four. Summary 

This paper proposes a new efficient parameter fine-tuning method for large-scale Transformer models , called REcurrent ADaption (READ). The READ method is not only lightweight, but also comparable to traditional fine-tuning methods in terms of accuracy. By introducing the form of RNN+Joiner module, READ does not need to go through the backbone Transformer model when fine-tuning the network , which significantly reduces the GPU usage of model fine-tuning, and can achieve a saving effect of up to 84%. In addition, READ also shows strong scalability, and can be plug-and-play on almost all Transformer structures without considering modifying the complex self-attention layer in the original model. At the same time, compared with the full fine-tuning method, READ can reduce the training memory usage by 56%, which further reduces the threshold for deep learning engineers to fine-tune large models.

reference 

[1] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attarian, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

[2] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.

[3] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

[4] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.

[5] Peng B, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the Transformer Era[J]. arXiv preprint arXiv:2305.13048, 2023.

Author: seven_

Guess you like

Origin blog.csdn.net/hanseywho/article/details/131688340