LORA large model accelerates fine-tuning and training algorithms

ChatGPT leads large models to spring up like mushrooms after a spring rain. Everyone is eager to fine-tune the large models. Now let’s talk about the common algorithms

1 LORA

theory

Low Rank Matrix Factorization (LRMF) is a common data dimensionality reduction technique, which can map high-dimensional data into a low-dimensional space and preserve important information of the original data as much as possible. LoRA (Low Rank Approximation) is a matrix approximation algorithm based on LRMF, which can further reduce the storage and computational complexity of the matrix while maintaining the low rank of the original matrix.

The core idea of ​​the LoRA algorithm is to convert the original matrix AAA is decomposed into two low-rank matricesXXXYYThe product form of Y , that is, A = X ⋅ YA=X\cdot YA=XY. _ Specifically, the LoRA algorithm will first perform SVD decomposition on the original matrix to obtain the matrixA = U Σ VTA=U\Sigma V^TA=UΣVT , among whichUUUV andVVV is AAT AA^TrespectivelyAAT A T A A^TA AThe eigenvector matrix of T AΣ \SigmaΣ is the singular value matrix. Then, the LoRA algorithm will takeUUU 's exkkcolumn k andVVV 's ex-kkK rows, get a low-rank matrixX = U ( : , 1 : k ) X=U(:,1:k)X=U(:,1:k )Y = V ( 1 : k , : ) Y=V(1:k,:)Y=V(1:k,:) , wherekkk is a preset parameter, representing the matrixAAA 's rank. Finally, the LoRA algorithm approximates the matrixA k = X ⋅ Y A_k=X\cdot YAk=XY as original matrixAAApproximation of A , that is, A k ≈ A A_k \approx AAkA

The advantage of the LoRA algorithm is that it can further reduce the storage and computational complexity of the matrix on the premise of ensuring the low rank of the matrix. In particular, the storage complexity of the LoRA algorithm is O ( mk + nk ) O(mk+nk)O(mk+nk ) , the computational complexity isO ( mnk ) O(mnk)O ( mnk ) wheremmm andnnn is the matrixAAThe number of rows and columns of A. Therefore, the LoRA algorithm is suitable for large-scale data processing, especially in resource-constrained environments, and can greatly reduce computing and storage overhead.

Summary: Freezing the pre-trained model weights and injecting a trainable rank decomposition matrix into each weight of the Transformer layer greatly reduces the number of trainable parameters for downstream tasks

How to use

The package peft of HuggingFace provides package support for LoRA, just call the api
insert image description here

Model Parallel Computing

In the parallel training process, each graphics card calculates in parallel, and each model uses the same model weight parameter weights. When the gradient descent is updated, each process will be synchronized once, so that the model of each process has the same gradient as the new one
. When the model is covered with a layer of DistributedDataParallel, it is possible to realize the forward update gradient of the backword, and other operations remain the same.

After using DistributedDataParallel to encapsulate the model into ddp_model, a module is added to the parameter name of the model. This is because the original model model is saved in the member variable ddp_model.module.

When mixing single-GPU and multi-GPU training codes, pay attention to the incompatibility of the parameter names, including when we use LoRA to load the model above, the model layer name will also change. The best way is to access ddp_model.module each time, so that single-GPU and multi-GPU checkpoints can be easily compatible.

Reinforcement Learning with Human Feedback (RLHF)

RLHF-Stage1 is supervised-fintuning, which uses the dataset mentioned above for model fine-tuning.

RLHF-Stage2 trains the reward model, which manually sorts the different outputs of the same prompt to obtain corresponding scores, and supervises the training reward model.

RLHF-Stage3 uses a reinforcement learning algorithm and is the most complex part of the training process:

RLHF (Reinforcement Learning Hyperparameter Optimization Framework) is a hyperparameter optimization framework for reinforcement learning models. It combines classical methods in reinforcement learning and Bayesian optimization techniques to more efficiently find the best combination of hyperparameters. The following is the complete RLHF process for reinforcement learning fine-tuning:

Data preprocessing: Process the data of reinforcement learning tasks as needed, such as normalization, denoising, etc.
Determine the hyperparameter space: Specify ranges and distributions for each hyperparameter for hyperparameter optimization.
Determine the evaluation index: According to the nature and goal of the reinforcement learning task, select the appropriate evaluation index, such as cumulative return, average reward, etc.
Design search strategy: According to the characteristics of the evaluation index and hyperparameter space, select an appropriate search strategy, such as random search, grid search, Bayesian optimization, etc.
Perform hyperparameter optimization: Search for optimal hyperparameter combinations in the hyperparameter space using the chosen search strategy, and record the performance of each hyperparameter combination.
Analyze results: Analyze the performance of each hyperparameter combination and the relationship between hyperparameters to understand which hyperparameters have a significant impact on model performance.
Fine-tuning the model: Based on the analysis results,

p tuning v2 is simply an improvement of the soft prompt. The soft prompt is only used in the embedding layer. After the actual test, if it only works in the embedding layer, the interaction ability will be weakened, and all the parameters of the model will be frozen to learn to insert tokens. The small amount of change makes the effect sometimes not stable, which is worse than fine-tuning. p tuning v2 is not only aimed at the embedding layer, but inserts continuous tokens into each layer to increase the amount of change and interactivity.

This is the common point for low-resource fine-tuning of large models is to freeze the parameters of the large model, and learn the low-rank changes produced by fine-tuning through small modules. However, some problems currently exist are that these two training methods are prone to catastrophic forgetting of parameters, because the parameters of the entire model layer do not change when the model is fine-tuned, but the amount of change is huge when the learning module with few parameters is fine-tuned, which is easy to cause the model to lose weight during the fine-tuning. A large bias is generated during reasoning, which makes the previous answering ability biased by the learnable module. When fine-tuning, it must also be noted that the learnable module cannot over-fit the fine-tuning data, otherwise the original pre-training knowledge ability will be lost, resulting in catastrophic forget.

It is best to add general learning corpus to the fine-tuning corpus and fine-tune it together to avoid a great bias towards the fine-tuning corpus. It is also mentioned in the instruct gpt paper that the model will be easy to fit the ppo data when learning ppo intensively. Reduce the ability of the model for general natural language tasks, so the SFT gradient and pre-training gradient are added to the ppo loss to alleviate this forgetting problem

Guess you like

Origin blog.csdn.net/dream_home8407/article/details/129837940