Brief reading of the paper LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Paper address: https://arxiv.org/pdf/2106.09685.pdf
Project address: https://github.com/microsoft/LoRA
Full text translation address: https://zhuanlan.zhihu.com/p/611557340本来想自行翻译的,但最近没有空

1. Key condensation

1.1 What is LORA?

LORA is a technology that solves finetune for large models. The cost of training and fine-tuning of current large models (such as GPT3, parameter size 175B) is relatively high, and a training session takes several months to complete, which increases the entry threshold for large nlp models. The purpose of large model finetune is to transfer the large model capabilities in the general field to the professional field (downstream application environment), because directly training the nlp model in the professional field has the risk of difficulty in convergence (the application of nlp in the professional field requires vocabulary embedding support in the general field 提供初级词汇理解能力, the vocabulary embedding ability can be enhanced after training under the scale of big data in general fields, and then trained in professional fields.

In layman's terms, LORA technology is like a makeup technology. It feels like a person who is not good-looking (large models are not accurate enough in the professional field), wants to have plastic surgery but has no money (cannot meet the hardware threshold for transfer learning), and can only change himself through makeup (performing some parameters). Improve training).

1.2 What does LORA solve?

1. LORA effectively reduces the cost of finetune for large models, lowers its hardware entry threshold by 3 times, and improves training efficiency. Current finetune technologies for large models mainly include adapter layers and optimizing some forms of the input layer activations. These two forms modify the network details of the original large model and increase model parameters, resulting in inference delays.

2. LORA technology also solves the problem of hot switching of capabilities during model deployment. When the model is running, only some fine-tuned parameters need to be replaced to achieve switching of large model capabilities. Large-scale models have a large amount of parameters. For example, the model file of GPT3 with 175 billion parameters is estimated to be 800Gb (fp32). Even in ddr5 memory (90GB/s=12.25Gb/s), it takes a minute to switch. The LORA switching model only involves parameter replacement of its optimization part, which is only 35M.

1.3 LORA’s technical solution?

1. LORA believes that the existing large model is an over-parameterized model for professional fields 参数冗余模型, and actually exists in a lower intrinsic dimension that can represent all dimensions of this large model 即存在一个低秩矩阵可指代原有的参数. LORA
trains low-rank parameters, freezes the original parameters of the model, and then superimposes the trained low-rank matrix onto the original parameters after training.
类似于矩阵的奇异值分解,只对分解后的矩阵进行训练;然后将训练好的矩阵做乘法,得到最新的全尺寸参数,并叠加到原模型中。

The figure below shows the technical solution of ROLA. The blue area represents the frozen original parameters, and the orange part represents the new parameter part of LORA (where d is the original parameter dimension). A is initialized to Gaussian distribution, and B is initialized to Gaussian distribution. Initialized to all 0s (where r is the low-rank number of the original parameters). The original parameter training volume is dxd, and the parameter training volume of ROLA technology is dx 2r. In the actual operation of LORA, the matrix W` obtained by BA is superimposed on the original parameters after scaling.
insert image description here

2. ROLA specifies the research goal of parameter redundancy into the Transformer layer. In its experiments, it mainly focuses on the attention module, which performs low-rank reconstruction training on Wk, Wq, Wv and Wo. It shows that the closer the optimized parameters are to the output, the better the effect will be. The experimental effect of LORA shows that it only amplifies features useful for downstream tasks during training, rather than the main features in the pre-trained model.

2. Key to the original text

2.1 Low-rank parameterized update matrix

The content is referenced from https://zhuanlan.zhihu.com/p/611557340.
insert image description here
The original transfer learning is for W 0 W_0W0Perform tuning training, and the tuning result part is defined as Δ W ΔWW ∆ W ∆W W givenW 0 W_0W0have the same number of parameters. LORA will Δ W ΔWΔ W is decomposed into two parts BA. Assuming that the dimension of the original W is dxk and the intrinsic rank is r, the parameter amount of training BA is dxr + rxk = rx (d + k).

2.2 Implementation effects of LORA

insert image description here
insert image description here
insert image description here

2.3 Effectiveness of low-rank structures

Low-rank structures are very common in machine learning. Many machine learning problems have a certain intrinsic low-rank structure. Furthermore, it is known that for many deep learning tasks, especially those with severely overparameterized neural networks, the learned neural network will have low-rank properties after training. Some previous works even explicitly impose low-rank constraints when training raw neural networks; however, to the best of our knowledge, none of these works consider low-rank updates to freeze the model to adapt to downstream tasks. In the theoretical literature, it is known that neural networks outperform other classical learning methods, including corresponding (finite width) neural tangent kernels, when the underlying concept classes have a certain low-rank structure. Another theoretical result by Allen Zhu & Li (2020b) shows that low-rank adaptation is useful for adversarial training.

The low-rank structure revealed here is somewhat similar to Criss-Cross Attention , and it also has some similarities to depth-separable convolution . Limit the fineturn space of the model to the low-rank subspace of the original parameters ( 该操作必然会影响模型性能,但是将通用模型迁移到专业领域本质就是在降低原有模型的能力范围), optimize the parameters in the low-rank range, and then apply it to the original parameter space.

Guess you like

Origin blog.csdn.net/a486259/article/details/132767182