LoRA: Best Practices for Personalization with Large Language Models

Producer: Towhee Technical Team

Large language models (LLMs) have gained a lot of attention this year. In the past, pre-training + finetuning has become the best paradigm to adapt the model to specific data. However, with large models, this kind of full fine-tuning (retraining all model parameters) will become less and less feasible. For example, with GPT-3 175B, deploying independent fine-tuned model instances (each with 175B parameters) is cost-prohibitive. In 2021, Microsoft proposed a method called LoRA (Low Rank Adaptation), which has received more and more attention in the era of large models, and has brought very good results. This method freezes the weights of the pre-trained model and injects a trainable rank factorization matrix into each layer of the Transformer architecture, which greatly reduces the number of trainable parameters required for downstream tasks. Compared to the GPT-3 175B model fine-tuned using the Adam algorithm, LoRA can reduce the number of trainable parameters by 10,000 times and reduce GPU memory requirements by 3 times. Furthermore, LoRA performs comparable or better than fine-tuning in terms of model quality for RoBERTa, DeBERTa, GPT-2, and GPT-3 models, although it has fewer trainable parameters, higher training throughput, and is not as good as the adapter ( adapters) without increasing inference latency.

|LoRA's reparametrization: only train A and B. alt alt

|Performance of RoBERTa pretrained with/without LoRA

The experimental results well illustrate the effectiveness of the algorithm. The author used the RoBERTa model to do finetune in various downstream tasks. FT is to use all parameters for training, and BitFit only trains the bias vector to freeze all other weights. It can be seen that in addition to complete finetune, the number of parameters trained by several other adaptation methods is relatively small. And LoRA can train fewer parameters to get better results.

LoRA can prove itself to be a very effective method not only in NLP, because more and more algorithms are developed based on transformers, and this method is very easy to adapt to transformers. Now the very popular stable-diffusion has also been captured by LoRA, so that most users with limited computing power can quickly finetune a LoRA model for their own data. As large models receive more and more attention and become the baseline for various tasks, it is believed that this method will become the most routine operation in the era of large models.

Relevant information:

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/weixin_44839084/article/details/130434258