In-depth analysis of the principle of LoRA technology

Eat jelly without spitting out jelly skin2023-07-29  12:22Published  in Sichuan

Editor's recommendation:

This article describes the principles and details of LoRA in detail; at the same time, it makes a detailed interpretation of the experiments in the paper.

The following article comes from Brief Notes on Moving Bricks of the Great Ape , author Meng Yuan

Brief Notes of Great Ape Moving Bricks.

A code farmer with a background in accounting, learning with everyone.

【Click】Join large model technology exchange group

Regarding the explanation of the LORA part, we will divide it into "principle articles" and "source code articles" .

In the principle chapter, we will analyze in detail the core issues such as how to use LoRA, why it works, and what advantages and disadvantages exist by means of diagrams . Especially when you are learning LoRA, if you are confused about the definition and function of "rank", then this article may provide some concrete interpretations.

In the source code chapter, we will analyze the source code of Microsoft LoRA together, and help everyone use free GPU on the google colab platform to build a LoRA fine-tuning environment , so that everyone can run the original LoRA code by themselves, and deepen the understanding of the operating mechanism of LoRA ( Happiness without money is true happiness).

picture

1. Full parameter fine-tuning

picture

We know that the meaning of fine-tuning is to take the pretrained model and give it specific downstream task data, so that the model continues to train on the pre-trained weights until it meets the downstream task performance standards. The pre-training model is like a feature extractor , which can extract effective features for us based on the experience learned from previous training data , greatly improving the training effect and convergence speed of downstream tasks.

Full fine-tuning refers to updating each parameter of the pre-trained model during the training of downstream tasks . For example, in the figure, an example of full fine-tuning of Transformer's Q/K/V matrix is ​​given . For each matrix, its parameters must participate in the update during fine-tuning .d*d

A significant disadvantage of full fine-tuning is that it is expensive to train . For example, the parameter volume of GPT3 is 175B , and I can only stay away from the single-card nobles, not to mention the overwhelm when a bug is found in the fine-tuning. At the same time, since the model has eaten enough data and gained enough experience in the pre-training stage, I just need to find a way to add an additional knowledge module to the model , so that this small module can adapt to my downstream tasks, and the main body of the model remains Just leave it unchanged (freeze) .

How to add such a small knowledge module?

2. Adapter Tuning and Prefix Tuning

Let's look at two mainstream local fine-tuning methods before LoRA appeared: Adapter Tuning and Prefix Tuning. These are also the two fine-tuning methods for key comparisons in LoRA's original paper.

2.1 Adapter Tuning

picture

There are many methods of Adapter Tuning. Here we cite the method proposed by Houlsby et al., 2019, which is also the first article cited when this technology is mentioned in the LoRA paper.

The left side of the legend is a layer of Transformer Layer structure, the Adapter is what we call "extra knowledge module"; the right side is the specific structure of Adatper. When fine-tuning, except for the Adapter part, the rest of the parameters are frozen (freeze) , so that we can effectively reduce the cost of training. The internal architecture of the Adapter is not the focus of this article, so we won't introduce it here.

But such a design architecture has a significant disadvantage: after adding the Adapter, the overall number of layers of the model will become deeper, which will increase the training speed and reasoning speed . The reasons are:

  • Need to spend extra computing power on the Adapter

  • When we use parallel training (for example: tensor model parallelism commonly used in Transformer architecture), the Adapter layer will generate additional communication traffic and increase communication time

2.2 Prefix Tuning

picture

There are also many methods of Prefix Tuning. Here we choose Li&Liang, 2021 for a brief introduction. In this article, the author fine-tunes by adding a prefix to the input data. Of course, the prefix can not only load the input layer, but can also be added to the intermediate layer output by the Transformer Layer. Interested friends can search for papers and do their own research.

As shown in the figure, for a generative model such as GPT , a prefix token is added at the front of the input sequence, and two prefix tokens are added to the legend. In practical applications, the number of prefix tokens is a hyperparameter, which can be fine-tuned according to the actual model The effect is adjusted. For the Encoder-Decoder architecture model like BART , prefix token is added in front of x and y at the same time. In the subsequent fine-tuning, we only need to freeze the rest of the model and train the parameters related to the prefix token separately. Each downstream task can train a set of prefix tokens separately.

So what is the meaning of prefix ? The function of prefix is ​​to guide the model to extract information related to x, and then generate y better. For example, if we want to do a summarization task, after fine-tuning, prefix can realize that what we are doing is a "summary form" task, and then guide the model to extract key information from x; if we want to do a sentiment classification Task, prefix can guide the model to extract the semantic information related to emotion in x, and so on. This explanation may not be so rigorous, but you can roughly understand the role of prefix.

Although Prefix Tuning seems convenient, it also has the following two significant disadvantages :

  • It is difficult to train, and the effect of the model does not strictly increase with the increase of the prefix parameter, which is also pointed out in the original paper

  • It will reduce the effective information length of the input layer. In order to save calculation and video memory, we generally fix the input data length. After adding the prefix, there is less space left for the original text data, so the expressive ability of the prompt in the original text may be reduced.

3. What is LoRA

To sum up, full parameter fine-tuning is too expensive, Adapter Tuning has training and inference delays, Prefix Tuning is difficult to train and will reduce the effective text length in the original training data , so is there a fine-tuning method that can improve these deficiencies?

Driven by such motivation, the author proposed  a fine-tuning method such as LoRA (Low-Rank Adaptation, low-rank adapter) . Let's put aside the explanation of abstract words such as "low rank" and "adapter". Let's first look at what LoRA looks like and how to use it. In the next section, we will explain the principle of the "low rank" effect in detail.

3.1 LoRA Overall Architecture

picture

The left side of the figure shows the scenario of "full parameter finetune" . We split the parameters into two parts:

  • : pretrained weights

  • : finetune incremental weight

The reason for this split is that the full parameter finetune can be understood as "frozen pre-training weights" + "weight update amount generated during fine-tuning" . If the input is and the output is, then:

The right side of the figure represents the scene of "LoRA finetune" . In LoRA, we use matrices A and B to approximate the expression:

  • : A low-rank matrix, where is called "rank", initialized with a Gaussian for .

  • : Low-rank matrix, zero-initialized for B.

After such a split, we will rewrite the form to d*dreduce the amount of fine-tuning parameters from to 2*r*d, while not changing the dimension of the output data, that is, under LoRA we have:

In addition, it was mentioned in the original paper that for two low-rank matrices, a hyperparameter (a constant) will be used to make adjustments, but it does not explain the role of this hyperparameter. After reading the source code of LoRA, I found that this hyperparameter is directly multiplied by the low-rank matrix as the scaling rate, that is, the final output is :

In practice, it is generally taken, for example, when the LoRA source code is fine-tuned to GPT2, and when doing NLG tasks, it is taken. We will introduce the role of this scaling rate in detail later, as well as the specific meaning of "rank".

Initialization methods for A and B

It should be noted that the purpose of using Gaussian initialization and zero initialization here is to make the value at the beginning of training 0, so that it will not bring additional noise to the model. Then you may want to ask, can I do zero initialization and Gaussian initialization? Anyway, it seems that just let the initialization be 0?

In response to this problem, I found an answer from LoRA on github issue:

picture

In short, the current author has not found a significant difference in the conversion initialization method, as long as either of the two is 0 and the other is not 0 .

Eat jelly without spit jelly skin

Long-termism, focusing on the implementation of AI engineering (LLM/MLOps).

3.2 LoRA training and reasoning process

In 3.1, we introduced the overall architecture of LoRA : low-rank matrices A and B are used to approximate incremental updates on the bypass path of the original pre-trained matrix. You can do this on the model layer you want, such as the weight of the MLP layer in Transformer, or even the weight of the Embedding part. In the original paper of LoRA, only low-rank adaptation is made to the parameters of the Attention part, but in actual operation, we can flexibly set the experimental plan according to the needs and find the best adaptation plan.

3.2.1 Training

During training , we fix the pretrained weights and only train on the low-rank matrix sum. When saving weights, we only need to save parts of the low-rank matrix . According to the statistics in the LoRA paper, such an operation reduces the memory consumption from 1.2TB to 350GB when fine-tuning GPT3 175B; when r=4, the final saved model drops from 350GB to 35MB, which greatly reduces the training overhead.

Regarding the training part, let's look at another interesting question: Overall, LoRA saves video memory significantly, but can LoRA save video memory at every moment of training ?

When considering backward, for calculating the gradient, according to (for the convenience of typing the formula, one item is temporarily ignored), we have:

Pay attention to this item, and you will find that it has exactly the same dimension as the pre-trained weight d*d , that is, for the calculated gradient, we need to use the same size intermediate value result as in the full parameter fine-tuning process. Therefore, for LoRA, the peak memory of this layer is basically the same as the full fine-tuning (if one item is counted, it is higher than the full fine-tuning).

But why can LoRA reduce the overall memory usage , because:

  • LoRA does not act on every layer of the model. For example, LoRA in the paper only acts on the attention part.

  • Although LoRA will cause the peak memory of a certain layer to be higher than the full amount of fine-tuning, after the gradient is calculated, the intermediate result can be cleared and will not be saved forever

  • When the weight to be trained d*dis reduced from 2*r*d, the optimizer states that need to be saved are also reduced (that is fp32).

3.2.2 Reasoning

In the inference process , we merge the low-rank matrix and pre-trained weights according to the method , and then do forward inference normally. In this way, we will not change the architecture of the model at all, so there will be no reasoning delay like Adapter Tuning . The figure below shows the experimental results in the paper. The unit of reasoning time is milliseconds. It can be found that the reasoning speed of LoRA is significantly higher than that of Adapter Tuning.

picture

When switching between different downstream tasks , we can flexibly remove parts of low-rank weights from it. For example, we do the downstream task A first, then combine the weights and keep the low-rank weights separately. When we switch to the downstream task B, we can fine-tune by subtracting the low-rank weight part from it, and then turn on the new LoRA. That is, each downstream task can have its own set of low-rank weights.

You may want to ask, after each fine-tuning, do I have to combine the low-rank weights into the middle ? Can I store "pretrained weights" and "low rank weights" separately? Of course, no problem, LoRA is very flexible, you can rewrite the code according to your own needs, decide how to save the weight, as long as you master a core principle: whether it is suitable or not, you can always distinguish between pre-training and LoRA part, on the line. In the source code interpretation chapter, we will look at this point in detail.

congratulations! At this point, you have mastered the structure of LoRA. Isn't it very simple, are you eager to try it? However, as a qualified alchemist, in order to better debug the training process, we need to study the principle of LoRA more deeply.

4. The principle of LoRA low-rank adaptation

In the previous article, we have repeatedly mentioned the concept of "rank", and explained that the rank of LoRA is the hyperparameter. At the same time, we have also constantly emphasized the approximation of yes. In this section, we will take a concrete look at "rank" and explain why it is "approximate". After understanding these, we can interpret the role of hyperparameters and grasp a certain feeling of alchemy .

4.1 What is rank

Let's first look at a matrix A:

A = [[1, 2, 3],
     [2, 4, 6],
     [3, 6, 9]]

In this matrix, row2 = row1 * 2, row3 = row1*3, that is to say, each row in the matrix can be represented linearly by the first row .

Let's look at another matrix B:

B = [[1, 2, 3],
     [7, 11, 5],
     [8, 13, 8]]

In this matrix, any row can always be represented by a linear combination of the other two rows .

Let's finally look at a matrix C:

C = [[1, 0, 0],
     [0, 1, 0],
     [0, 0, 1]]

In this matrix, no row can be derived from a linear combination of the remaining rows.

Calling np.linalg.matrix_rankthe function, we can calculate the rank of any matrix. The ranks of the above three matrices are:

A = np.array(A)
B = np.array(B)
C = np.array(C)

print("Rank of A:", np.linalg.matrix_rank(A)) # 1
print("Rank of B:", np.linalg.matrix_rank(B)) # 2
print("Rank of C:", np.linalg.matrix_rank(C)) # 3

For matrix A, the rank of A is 1 because as long as any row is mastered, the rest of the rows can be derived linearly from this row.

For matrix B, the rank of B is 2 because as long as any two rows are mastered, the rest of the rows can be derived from the linear combination of these two rows.

For the matrix C, the rank of C is 3 because three rows must be fully mastered to obtain a complete C.

Seeing this, do you already have a perceptual understanding of rank? Rank represents the amount of information in a matrix . If a certain dimension in the matrix can always be derived linearly through the other dimensions, then for the model, the information of this dimension is redundant and repeatedly expressed. For the case of A and B, we call it rank deficient , and for the case of C, we call it full rank . For a more rigorous mathematical definition, you can refer to "Linear Algebra" (狗头).

With this understanding of rank, we naturally think that the incremental weight in the full parameter fine-tuning may also have redundant information, so we do not need to d*drepresent it with a full size. So, how do we find out the really useful feature dimension in ? SVD decomposition (Singular Value Decomposition) , can help us solve this problem

4.2 SVD decomposition

picture

As shown in the figure, the matrix is ​​the matrix we need to check the amount of information. Assume that in the feature space of the input data, there is a set of orthogonal unit vectors . After the transformation, they become another set of orthogonal vectors, which are also a set of orthogonal unit vectors , respectively representing the modulus in the corresponding direction . The above change can be written as:

With a little rewriting, there is:

It is not difficult to find that there is a hint of "information volume" implied in . In this case the transformed transformations are projected onto , emphasizing the information implied in the 1 direction.

Now a little more broadly, if we can find such a set of sums and arrange the values ​​of the matrix from large to small, then we can disassemble it, and at the same time, find out the highlighted ones during the disassembly process feature orientation? That is to say:

When we find such a matrix, we then take out the corresponding top r rows (or columns) from the three, which is equivalent to paying attention to the most emphasized few-dimensional features, and then we can use lower-dimensional matrices, Come to an approximate expression ? The method of dismantling M according to this kind of thinking is called SVD decomposition (singular value decomposition) . We will not describe its specific method in this article, friends who are interested, hey, you can also refer to "Linear Algebra".

Let's use another code example to feel this approximation more intuitively. Please pay attention to the comments (example adapted from: https://medium.com/@Shrishml/lora-low-rank-adaptation-from-the-first -principle-7e1adec71541)

import torch
import numpy as np
torch.manual_seed(0)

# ------------------------------------
# n:输入数据维度
# m:输出数据维度
# ------------------------------------
n = 10
m = 10

# ------------------------------------
# 随机初始化权重W
# 之所以这样初始化,是为了让W不要满秩,
# 这样才有低秩分解的意义
# ------------------------------------
nr = 10
mr = 2
W = torch.randn(nr,mr)@torch.randn(mr,nr)

# ------------------------------------
# 随机初始化输入数据x
# ------------------------------------
x = torch.randn(n)

# ------------------------------------
# 计算Wx
# ------------------------------------
y = W@x
print("原始权重W计算出的y值为:\n", y)

# ------------------------------------
# 计算W的秩
# ------------------------------------
r= np.linalg.matrix_rank(W)
print("W的秩为: ", r)

# ------------------------------------
# 对W做SVD分解
# ------------------------------------
U, S, V = torch.svd(W)

# ------------------------------------
# 根据SVD分解结果,
# 计算低秩矩阵A和B
# ------------------------------------
U_r = U[:, :r]
S_r = torch.diag(S[:r])
V_r = V[:,:r].t()

B = U_r@S_r # shape = (d, r)
A = V_r     # shape = (r, d)

# ------------------------------------
# 计算y_prime = BAx
# ------------------------------------
y_prime = B@A@x

print("SVD分解W后计算出的y值为:\n", y)

print("原始权重W的参数量为: ", W.shape[0]*W.shape[1])
print("低秩适配后权重B和A的参数量为: ", A.shape[0]*A.shape[1] + B.shape[0]*B.shape[1])

The output is:

原始权重W计算出的y值为:
 tensor([ 3.3896,  1.0296,  1.5606, -2.3891, -0.4213, -2.4668, -4.4379, -0.0375,
        -3.2790, -2.9361])
W的秩为:  2
SVD分解W后计算出的y值为:
 tensor([ 3.3896,  1.0296,  1.5606, -2.3891, -0.4213, -2.4668, -4.4379, -0.0375,
        -3.2790, -2.9361])
原始权重W的参数量为:  100
低秩适配后权重B和A的参数量为:  40

The number of parameters is reduced, but it does not affect the final output result . Through this example, can everyone better understand the role of low-rank matrices~

4.3 LoRA low-rank adaptation

Well, since the SVD decomposition is so effective, I can directly do SVD and find the corresponding low-rank matrix, isn't it done?

Although the idea is good, the difficulty is obvious: the premise of being able to do SVD directly is certain , but in reality, as the weight increment in the full parameter fine-tuning, if you don’t fine-tune the full parameters again, how can you know what it looks like? And if you have done full fine-tuning, what do you need to do with low-rank adaptation?

Hey, you may think again: Can I do SVD on the pre-trained weights, because it is certain.

Although the idea is good, the logic is irrational: we said that the purpose of fine-tuning is to inject new domain knowledge related to downstream tasks into the model . That is to say, the expression meaning of and is different. The former is new knowledge, and the latter is old knowledge . Our purpose is to disassemble the information-rich dimension in new knowledge.

Well, since it is not feasible to do SVD directly through mathematical methods, let the model learn how to do SVD by itself ! Therefore, LoRA's final low-rank adaptation strategy is: I take the rank as a hyperparameter, and then let the model learn the low-rank matrix by itself. Isn't this simple and trouble-free!

Okay, here we have a concrete understanding of the principle of LoRA low-rank adaptation, and also know the difference from the meaning expressed. Now, we can look at the questions left over from the previous article: What does hyperparameter mean?

4.4 Hyperparameters

Let's first look at the explanation of the paper pair:

picture

This passage roughly means that when we use Adam as the optimizer, the adjustment is equivalent to adjusting the learning rate. Generally speaking, we set the settings to the ones we set when we first experimented, then fix them, and then only adjust them. The advantage of this is that when we try different ones, we don't need to adjust others It's too high.

I don’t know how you felt when you read this passage for the first time, but I didn’t understand it anyway. Google searched again, but did not find a specific explanation. Until I went through the design ideas of LoRA low-rank adaptation in order, I seemed to understand something. Let me talk about my personal opinion below.

First, review our output calculation method as:

where , denotes the pre-trained weights (old knowledge), and denotes the approximation of the incremental weights (new knowledge). Theoretically speaking, when it is smaller, we extract the dimension with the most information in the middle, and the information is refined but not comprehensive; when it is larger, our low-rank approximation is closer , and the information is more comprehensive at this time, but with The more noise comes (contains a lot of redundant and invalid information).

Based on this conjecture, when we do the experiment for the first time, we will try to adjust it as large as possible, for example: 32, 64, and assume that under this rank, the low-rank weights are already very similar, so we set at this time, which means Then we assume that the effect of LoRA low-rank fine-tuning is equal to that of full-parameter fine-tuning.

So next, we will definitely try to go to the small one. At this time, we fix it, which means that as the decreases, it will become larger and larger. The reason we do this is:

  • When the value is smaller, the information represented by the low-rank matrix is ​​refined but not comprehensive. We amplify the influence of new knowledge on the model in the forward process by adjusting the size.

  • When it is smaller, the information represented by the low-rank matrix is ​​refined, and there is less noise/redundant information. At this time, the direction of gradient descent is more certain. Therefore, we can increase the pace of gradient descent by increasing it, which is equivalent to adjusting the learning rate . up.

Well, here, we have learned the core idea of ​​LoRA low-rank adaptation together. As we said before, because SVD cannot be used for direct decomposition, the author hopes that LoRA can "learn" the real low-rank decomposition matrix, but how to prove that what LoRA learns is related to what SVD decomposes? Next, let's interpret the author's experiment together.

5. LoRA experiment: verify the effectiveness of low-rank matrix

5.1 Overall effect

picture

First, the author compared LoRA with other fine-tuning methods (full parameter fine-tuning, Adatper Tuning, etc.). The vertical columns represent different fine-tuning models, the horizontal columns represent different data sets, and the bold parts represent the best performance indicators. It can be found that LoRA has achieved good performance in both the fine-tuning accuracy index of each data set and the final average fine-tuning accuracy index (Avg.), and the number of parameters it can train is also very small.

5.2 Verification of low-rank matrix information content

As we said before, the smaller the value, the more refined the information contained in the low-rank matrix, but at the same time it may be less comprehensive. So how much is appropriate?

5.2.1 Directly verify the fine-tuning effect under different r values

Although in theory we can embed low-rank adapters (such as Embedding, Attention, MLP, etc.) in any layer of the model, but in LoRA, we only choose to embed in the Attention layer, and have done related experiments (the paper also encourages readers to do more Other attempts), let's look at the experimental results of the Attention layer:

picture

WikiSQLand MultiNLIis the data set used for fine-tuning, and the Weight Type indicates which part of the Attention is used for low-rank adaptation. It can be found that the effect of Yu is almost the same, or even slightly better. This further illustrates the effectiveness of "low rank". In order to verify this more visually, we further look at the degree of intersection with these two low-rank spaces.

5.2.2 Degree of intersection of different low-rank spaces

Assuming that and are low-rank matrices trained under and respectively, we now want to do such a thing:

  • Take the most informative dimension (of which)

  • Take the most informative dimension (of which)

  • Calculate the degree of intersection between this dimension and a dimension to determine the degree of coincidence of information between two low-rank matrices

Well, how do I find out the top dimension with the most information? Don't forget, we have the SVD method , and this time the sum is deterministic. Therefore, we can perform SVD decomposition on the low-rank matrix, and then obtain the right singular matrix of the two (that is, the above mentioned), but in the LoRA paper, it is used to represent the right singular matrix, so we do as the Romans do, make:

  • The right singular matrix represented by the right singular matrix represents the most informative dimension of the right singular matrix (review the previous article and judge the information content)

  • The right singular matrix represented by is the most informative dimension of the right singular matrix.

Well, after clarifying these definitions, we can look at the feature dimension and calculate the degree of intersection with the feature dimension. This intersection index is also called " Grassmann distance ".

picture

It can be seen from the above formula that the degree of intersection (Grassmann distance) is between and the larger the value, the more similar the corresponding two subspaces are. Interested friends can refer to the relevant proofs in the appendix G of the paper. We only focus on the conclusion here.

Ok, after calculating this indicator, let’s visualize it, so the author continues to give the following four pictures:

picture

I don’t know how you felt when you saw this picture for the first time, anyway, I didn’t understand it (hey, I feel like I’ve heard this before). So here is my (irresponsible) interpretation again.

First of all, the author has done a low-rank decomposition on the above, so pictures 1 and 3 and pictures 2 and 4 are respectively a group, let's choose pictures 1 and 3 to see.

Secondly, the purpose of the author's experiment is to see how much information of the low-rank space is contained in the high-rank space, so as to explain why the effect of the sum is basically the same .

So the author's logic when calculating the Grassmann distance and drawing the chart is:

  • Yes, at that time, I wanted to calculate the similarity with , so that I could know how much of the most abundant 1-dimensional information contained in .

  • Yes, at that time, I wanted to calculate the similarity with , so that I could know how much of the most abundant 2-dimensional information contained in .

  • By analogy, because what I want to verify is the degree of inclusion of the large-rank space () to the small-rank space (), so I only draw the part, and the rest are not drawn. That's why there is a blank space in the lower left corner in Figure 1.

  • That part, that is, the extent to which the small-rank space includes the top dimension in the large-rank space, although I did not draw it in Figure 1, I can draw it separately in Figure 3. So Figure 3 is actually the filling of the missing part in the lower left corner of Figure 1.

Ok, now that this is explained, let's look at the legend in detail. The lighter the color, the higher the similarity . In Figure 1, it is not difficult to find that the color of this line is the lightest, and the color gradually becomes darker with the increase of . This shows that in the small-rank space, the dimensional features with higher information content have a higher degree of intersection with the large-rank space, so they are also the main reason why the performance of the small-rank space can be equal to that of the large-rank space . Said "low rank" validity.

Seeing the conclusion of this chart, you may have a doubt: Didn’t it mean that the 8 dimensions with the most information are taken, but the 64 dimensions with the most information are taken? Then their first 8 dimensions should be the same! So with the increase of , shouldn't the space coincidence be getting bigger and bigger? How is the result of the chart getting smaller and smaller?

This is because the phenomenon of "taking the 8 dimensions with the most information, and taking the 64 dimensions with the most information" is our ideal, but it is not the case when the model is actually learned. The model will learn from the most informative dimension as much as possible, but there is no guarantee of how much it will take. The objective top r must be learned in the end . It can only be said that when r is relatively small, the model is more likely to be close to the real top r ; When r is relatively large, the model learns some valuable information and some noise , and this experiment just demonstrates this point.

If we understand this, then we can better interpret the next experiment: how to set the r of different layers of the model?

5.2.3 r-value settings for different layers

We saw earlier that LoRA acts on and , so for these two different matrices, is there any difference in value setting?

In order to answer this point, the author designed another experiment: for three matrices, set two different random seeds for each matrix, run two different low-rank matrices, and calculate the Grassmann distance of these two low-rank matrices, The result is as follows:

picture

As we explained before, the two groups did not perfectly learn the most objective top 64-dimensional information, but "partially valid information + some noise". Based on this, it is not difficult to imagine that both groups can learn The information is likely to be useful information. Therefore, we also calculated the similarity of these two groups. From the left figure, we can see that the top 10 has the lightest color, and the information within this may be more effective information. Depending on the results of such an analysis, we can also use different ranks for different parts of the model.

5.2.4 Pre-training weights VS fine-tuning weights

We said before that pre-training weights are old knowledge and fine-tuning weights are new knowledge. So normally, there should be some parts that have not been paid attention to. Therefore, we also need to demonstrate whether the low-rank matrix we have trained meets this point. The experimental results designed by the author are as follows:

picture

Among them, it means that the result approximated by the low-rank matrix after training is not the objective existence mentioned above.

Let's interpret this experiment:

  • First look at the bottom row of the table, and calculate the norms for the pre-training weights and incremental weights respectively. We can roughly understand this indicator as the full amount of information contained in it.

  • Next, we find the index. Here we have 6 sets of values: the first 3 are from the singular value decomposition result of the time matrix, and the last three are deduced by analogy. Therefore, it means: project the pre-training weights into the low-rank approximate feature spaces of the three, and calculate the amount of information in the corresponding feature space after projection. If it is more similar to the corresponding feature space, the value will be larger.

Just looking at the concept is not a bit confusing, let's find a specific indicator to interpret it:

  • First look at the group of 61.95 and 21.67. 61.95 represents the amount of information the pre-training weight itself has, and 21.67 represents the amount of information projected into its own low-rank space (projection to the low-rank space will inevitably result in information loss).

  • Let's look at the group of 0.32 and 0.02. 0.32 indicates the amount of information after the pre-training weight is projected into the low-rank space of the incremental weight, and 0.02 is the same. It can be seen that compared with random weights, the incremental weights containing new knowledge still have a certain correlation with the pre-trained weights.

  • Finally, let's look at 6.91 and 0.32. After the pre-training weights are projected into the low-rank space of the incremental weights, the amount of information decreases from 61.95 to 0.32, indicating that there is still a significant difference between the distributions of the pre-training weights (old knowledge) and incremental weights (new knowledge). 6.91 represents the amount of information in the incremental weight itself. Therefore, the value of 21.95 = 6.91/0.32 can just represent the degree to which the incremental weight amplifies the unemphasized information in the pre-training weight. The smaller the rank, the more obvious the degree of amplification.

good! Regarding the introduction of the principle of LoRA, we will learn here together. You may find that this article spends a lot of space on the introduction of the experiment. On the one hand, through the experiment, it can help us better understand the meaning and role of low rank; on the other hand, I personally think that the results of the LoRA experiment are not very easy to read. , so I want to spend some time digging into it. So in the next article, let's interpret the code implementation of LoRA again!

Eat jelly without spit jelly skin

Long-termism, focusing on the implementation of AI engineering (LLM/MLOps).

67 original content

No public

6. Reference

1、https://arxiv.org/pdf/2106.09685.pdf
2、https://github.com/microsoft/LoRA
3、https://medium.com/@Shrishml/lora-low-rank-adaptation-from-the-first-principle-7e1adec71541
4、https://blog.sciencenet.cn/blog-696950-699432.html
5、https://kexue.fm/archives/9590/comment-page-1

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131995554