The secret weapon for efficient development of large models: MindSpore PET, a low-parameter fine-tuning kit for large models

Abstract: This article introduces the large model low-parameter fine-tuning suite - MindSpore PET.

This article is shared from Huawei Cloud Community " The Secret Weapon for Efficient Development of Large Models - Large Model Low-Parameter Fine-tuning Kit MindSpore PET ", author: yd_280874276.

Artificial intelligence has entered the "big model era". The large model has a stronger generalization ability, and when it is implemented in various vertical fields, it only needs to fine-tune the parameters to adapt to multiple scenarios. Therefore, the development of large-scale models has become the consensus of all walks of life in industry, academia and research.

In terms of large-scale model development, Shengteng has launched a large-scale model development enablement platform. Based on Shengsi MindSpore, it has built a full-process large-scale model enablement suite that supports large-scale model development, including the TransFormers large-scale model suite MindSpore TransFormers, and the Wenshengtu large-scale model suite. MindSpore Diffusion, Human Feedback Reinforcement Learning Kit MindSpore RLHF, and Large Model Low-Parameter Fine-tuning Kit MindSpore PET support large models from pre-training, fine-tuning, compression, inference, and service deployment.

In this issue, we will start the first article of the "Secret Weapon for Efficient Development of Large Models" series, and introduce the low-parameter fine-tuning kit for large models - MindSpore PET.

1. Introduction to MindSpore PET

MindSpore PET (MindSpore Parameter-Efficient Tuning) is a large-scale model low-parameter fine-tuning kit developed based on the Shengsi MindSpore AI fusion framework. Currently, the kit provides 6 algorithms, including 5 classic low-parameter fine-tuning algorithms LoRA, Prefix-Tuning, Adapter, LowRankAdapter, BitFit, and 1 fine-tuning algorithm R_Drop for improving the accuracy of downstream tasks. The low-parameter fine-tuning algorithm only needs to fine-tune a very small number of parameters, while maintaining the accuracy of full-parameter fine-tuning, it can greatly save computing and storage memory, and reduce the time for fine-tuning training; the fine-tuning algorithm with improved accuracy hardly increases computing memory and time In this case, increase the randomness of the model, prevent the model from overfitting and improve the accuracy of the model.

The kit provides API calling interfaces and use cases for all algorithms, which can be used out of the box, and provides interfaces for low-parameter fine-tuning algorithms that only save very few learnable parameters, making the generated ckpt files very small.

Open source warehouse address: https://github.com/mindspore-lab/MindPet

二、MindSpore PET - LoRA

2.1 Algorithm principle

LoRA: Low-Rank Adaptation of Large Language Models is a low-parameter fine-tuning algorithm for large language models proposed by Microsoft. LoRA assumes that when adapting to downstream tasks, the fully connected layer of the large model has a low intrinsic rank (low intrinsic rank), that is, it contains a lot of redundant information. Therefore, it is proposed to inject the trainable rank decomposition matrix into the fully connected layer of the Transformer architecture, and freeze the weight of the original pre-trained model, which can greatly reduce the number of parameters involved in training.

2.2 Application Effect——Taking Wukong's painting as an example

The Wukong painting model is a large model of Chinese Wensheng diagram based on the diffusion model. Although it has powerful capabilities, the scale of the model network is huge, with about 900 million parameters. When adapting to downstream tasks, the training time is long, and the calculation and storage memory overhead is large.

After analysis, Wukong uses the CLIP model to convert human language into mathematical vectors that can be understood by machines, and predicts noise through the U-Net model. The Attention structure of these two models includes a fully connected layer, which may contain a lot of redundant information when adapting to downstream tasks.

Therefore, we injected the LoRA module into the four modules of U-Net's cross-attention layer q, k, v, and output, and found that the effect is very good.

As shown in the figure below, after adapting LoRA, even if only 0.07% of the parameters are trained, high-quality images can be generated!

At the same time, compared with the full parameter fine-tuning, the application of the LoRA algorithm has greatly improved the training performance:

  1. Originally, it took 17 hours for end-to-end full parameter fine-tuning, but only 9 hours after adaptation, saving nearly 50% of the time;
  2. The calculation memory is saved by 40%, and the batch_size can be doubled, and the speed is faster;
  3. The final saved ckpt size is only 3.06MB, and it is no longer necessary to use 4 GB to save all parameters.

This shows that when there are n downstream tasks, only nx 3.06MB needs to be saved, avoiding the "behemoth" of nx 4GB. Moreover, we also performed exciting experiments. If the user has trained models with multiple styles, it only takes 0.5s to switch styles, truly seamless switching between "Picasso" and "Makoto Shinkai"!

The reason lies in the static graph feature of the MindSpore framework, which only needs to be compiled during the first forward training, and there is no need to recompile the graph even if other LoRA-ckpt update parameters are loaded later.

2.3 How to use

The LoRA algorithm, which reduces the burden on large models, is also very easy to use, and the end-to-end adaptation can be completed in only five simple steps.

first step:

Replace the Dense layer of qkvo in the model CrossAttention structure with LoRADense:

from tk.delta import LoRADense
# original Dense Layer
# self.to_q = nn.Dense(query_dim, inner_dim, has_bias=False).to_float(dtype)


# replace Dense Layer with LoRADense
self.to_q = LoRADense(query_dim, inner_dim, has_bias=False, lora_rank=4, lora_alpha=4).to_float(dtype)

Step two:

Call the freezing method in the training script to train only the newly added lora module:

from tk.graph import freeze_delta
# freeze all cells except LoRA and head
freeze_delta(LatentDiffusionWithLoss, 'lora’)

third step:

In the training script, replace the ModelCheckpoint that saves ckpt with TrainableParamsCheckPoint, and only save the parameters that need to be updated:

from tk.graph import TrainableParamsCheckPoint
# original callback
# ckpt_callback = ModelCheckpoint(...)
# replace ModelCheckpoint with TrainableParamsCheckPoint
ckpt_callback = TrainableParamsCheckPoint(...)

the fourth step:

Adjust the learning rate, batch_size and other parameters according to the training goal:

epochs: 15 
start_learning_rate: 1e-4 
end_learning_rate: 1e-6 
train_batch_size: 3 
warmup_steps: 0
lora_rank: 4
lora_alpha: 4

the fifth step:

After the training is complete, load the pre-trained ckpt and the ckpt generated after fine-tuning in the evaluation script:

# 加载预训练ckpt
pre_trained_pramas = load_checkpoint(pre_trained_ckpt_path)
load_param_into_net(net, pre_trained_pramas)
# 加载微调后生成的ckpt
trainable_pramas = load_checkpoint(trainable_ckpt_path)
load_param_into_net(net, trainable_pramas)
# 开始评估
model.eval()

We have open sourced all the codes, and gave a detailed interface and use case introduction: https://github.com/mindspore-lab/MindPet/blob/master/doc/TK_DeltaAlgorithm_README.md

It should be noted that compared with full-parameter fine-tuning, a larger learning rate is generally set after adapting LoRA. For example, when adapting to Wukong's painting, we will increase the learning rate from 1e-5 to 1e-4.

三、MindSpore PET - Prefix-Tuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation is also a low-parameter fine-tuning algorithm for large language models. The researchers proposed that using continuous vectors instead of discrete vocabulary to construct prefix templates, that is, adding continuous token embedding before input can increase the correlation between query and key. Therefore, Prefix-Tuning can greatly improve the performance of generation tasks by injecting trainable prefix vectors k, v before the key matrix and value matrix of each multi-head attention, and freezing the original network parameters.

Prefix-Tuning works well on both GPT-2 and Pangu Alpha models. Compared with full parameter fine-tuning, on the premise of maintaining the original accuracy, using Prefix-Tuning to train Pangu Alpha requires only 5.5% of the parameters, saving more than 65% of computing memory, and shortening the time consumption of one iteration to half .

4. MindSpore PET - Rdrop

R-Drop: Regularized Dropout for Neural Networks is a fine-tuning algorithm for improving accuracy. It mainly constructs positive samples for comparative learning through simple "two dropouts" to increase the randomness of the model. Specifically, after the model loads a batch of data sets, copy a copy of the data and input it into the model at the same time, then calculate the loss function separately, and add the results to get the final loss value. Although the logic is very simple, it can prevent the model from overfitting and further improve the accuracy of the model. After verification on multiple downstream tasks on Bert, the accuracy of 2.6 points can be improved by maintaining almost the same memory and time overhead.

Large model development to deployment is a high-threshold and complicated process. The large model enabling kit will help developers to make large models easier to develop, adapt, and deploy.

If you want to know more about the TransFormers large-scale model suite MindSpore TransFormers, the Yiwensheng graph large-scale model suite MindSpore Diffusion, and the human feedback reinforcement learning suite MindSpore RLHF, please follow the Shengsi MindSpore public account, and we will continue to bring you the field of artificial intelligence Technical dry goods and event news.

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/8645422