Article directory
This paper introduces DiffPruning, BitFit of Selective Methods in PEFT; LoRA and AdaLoRA in reparameterization methods; and MAM Adapters and UniPELT in hybrid methods. For classification methods, see PEFT review paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning"
四、Selective Methods
Reference : Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
The selective method is a method of fine-tuning the existing parameters of the model. You can choose based on layer depth, layer type, or even a certain parameter.
4.1 DiffPruning(2020.10)
Adapter Tuning works by inserting task-specific residual modules between layers of the model and optimizing only those residual modules. Since the residual module has fewer parameters (about 3.6%), fine-tuning costs are lower.
The proposed in this paper is similar Diff pruning
to Adapters, but Diff pruning
instead of modifying the structure of the model, it extends the basic model through a task-specific diff
vector only needs to fine-tune 0.5% of the pre-training parameters, that is, Diff pruning expresses the fine-tuning of a specific task as learning a diff vector δ τ \delta _{\tau }dt, the vector is added to the pretrained model parameters θ pretrained \theta _{pretrained}ipretrained(out) :
θ task = θ pretrained + δ task \theta _{task}=\theta _{pretrained}+\delta _{task}itask=ipretrained+dtask
The difference vector is reconstructed with a differentiable approximation of the L0-norm penalty to encourage sparsity (see Zhihu post for details).
prompt tuning
It is to fine-tune one soft prompt tokens
, but DiffPruning
to freeze most of the language model parameters and only fine-tune an inserted diff
vector, which is essentially the same.
4.2 BitFit(2021.6)
Ideally, we would like to have an efficient fine-tuning method that satisfies the following conditions:
- To achieve the effect that can match the full amount of fine-tuning.
- Change only a small set of model parameters.
- The data can arrive in a stream instead of at the same time, which is convenient for efficient hardware deployment.
- The changed parameters are consistent across different downstream tasks.
Ben-Zaken et al. (2021) propose to fine-tune only the bias of the network. That is, in each linear or convolutional layer, the weight matrix W remains unchanged, and only the bias vector b is optimized. The pseudocode is as follows:
params = (p for n, p
in model.named_parameters()
if "bias" in n)
optimizer = Optimizer(params)
BitFit
Only about 0.05% of the model parameters are updated. The original paper demonstrates that the method achieves similar or better performance in the BERT model (less than 1 billion parameters) in low and medium data cases. But on larger networks, such as T0-3B or GPT-3, BitFit is less effective than fine-tuning and other PEFT methods.
For the Transformer model, most of the transformer-encoder parameters are frozen, and only the bias parameter and the classification layer parameters of the specific task are updated. The bias parameters involved include the bias involved in calculating query, key, and value in the attention module and merging multiple attention results, the bias in the MLP layer, and the bias parameter in the Layernormalization layer.
By comparing the effects of BitFit, Adapter and Diff-Pruning based on the GLUE dataset on the Bert-Large model, it can be found that:
- When the amount of parameters of BitFit is much smaller than that of Adapter and Diff-Pruning, the effect of BitFit is equivalent to that of Adapter and Diff-Pruning, and even better in some tasks.
- Although the result of BitFit fine-tuning is not as good as fine-tuning, it is far superior to the Frozen method that fixes all model parameters.
At the same time, by comparing the parameters before and after BitFit training, it is found that only the bias parameters of the calculation query and the first layer of the FFN layer (the feature dimension is enlarged from N to 4N) have the most obvious changes, and only updating these two types of bias parameters can also achieve good results. , on the contrary, if any one of them is fixed, the effect of the model will be greatly lost.
4.3 Freeze and Reconfigure (FAR,2022)
FAR (Vucetic et al., 2022) selects the columns of the parameter matrix for pruning and reconfigures the linear layer to a trainable and frozen state. The method is divided into two phases.
- Phase 1: Determine the most important row in the parameter matrix to update. The process is similar to structured pruning, and any pruning method can be used.
- Phase 2: Convert each parameter WWW is split into trainable partsW t W_tWtand the frozen part W f W_fWf, do a similar operation for the bias, then concatenate the results to reconfigure the network.
The pseudo code of the whole method is as follows:
def far_layer(x):
h1 = x @ W_t # W_t为可训练部分参数
h2 = x @ W_f # W_f为冻结部分参数
return concat([h1, h2], dim=-1)
The original paper focuses on edge scenarios and uses DistilBERT (66M) in experiments. FAR is only applied to the feed-forward layers, since these layers account for most of the parameters of DistilBERT. The authors show that FAR updates 6% more parameters on five GLUE tasks and SQuAD 2.0 (Rajpurkar et al., 2018) and achieves similar performance to fine-tuning.
4.4 FishMask (omitted)
Five, Reparametrization-based methods (reparameterization)
5.1 Intrinsic SAID(2020.12)
《Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning》
While pretrained language models can be fine-tuned to produce state-of-the-art results in a wide range of language understanding tasks, the dynamics of this process are not fully understood, especially in low-data situations. Why can we fine-tune a model with hundreds of millions of parameters using a relatively traditional gradient descent algorithm (e.g., without strong regularization) and only use a dataset of hundreds or thousands of labeled examples?
In the work of Aghajanyan et al. (2020), they show that common pre-trained models have very low intrinsic dimensionality, so there exists a low-dimensional reparameterization that makes fine-tuning comparable fine-tuning
.
Specifically, they use the Fastfood transform to reparameterize the update of the model weights. Their results show that larger models require variations in lower-rank subspaces to achieve the same fine-tuning performance than smaller models. This observation motivates a focus on fine-tuning large models and parameter efficiency
Although the number of parameters that can be optimized is low, Fastfood's memory complexity and updates to all model parameters make Intrinsic SAID impractical for fine-tuning large networks.
5.2 LoRa(2021.6)
《LoRA: Low-Rank Adaptation of Large Language Models》、Microsoft/LoRA、stanford_alpaca
5.2.1 Background
Neural networks contain many dense layers that perform matrix multiplication. The weight matrices in these layers usually have full rank. Intrinsic SAID research shows that although the pre-training model has a large number of parameters, the corresponding Intrinsic Dimension
(intrinsic dimension) of each downstream task is not large, and it can still be effectively learned even when randomly projected into a smaller subspace. In other words, in theory we can fine-tune a very small amount of parameters and achieve good results in downstream tasks.
Inspired by this, we hypothesize that updates of weights also have low intrinsic rank
(intrinsic rank) during adaptation. For a pre-trained weight matrix W 0 ∈ R d × k W_{0}\in \mathbb{R}^{d\times k}W0∈Rd × k , we do not directly fine-tuneW 0 W_{0}W0, but fine-tune an increment Δ W \Delta WΔW to update the model.
5.2.2 Algorithms
Specifically, a new channel (equivalent to a plug-in) is added next to the original pre-trained model PLM, and the A,B
intrinsic rank is simulated by multiplying the two matrices before and after. Both the plug-in layer and the pre-training model layer have dimensions d
. The first layer will first d
reduce the dimension to the dimension through the fully connected layer r
, and the second layer will r
map back to d
the dimension through the fully connected layer. Among them, r<<d
.
Here r
is the rank of the matrix, so that the calculation of the matrix d x d
changes from d x r + r x d
, and the number of parameters is greatly reduced. This step is called low-rank decomposition.
After adding the plug-in layer, the forward propagation can be expressed by the formula:
h = W 0 x + Δ W x = W 0 x + BA xh=W_{0}x+\Delta Wx=W_{0}x+BAxh=W0x+ΔWx=W0x+BAx
其中 W 0 ∈ R d × k W_{0}\in \mathbb{R}^{d\times k} W0∈Rd×k, B ∈ R d × r B\in \mathbb{R}^{d\times r} B∈Rd×r, A ∈ R r × k A\in \mathbb{R}^{r\times k} A∈Rr×k。
The whole process is expressed in pseudocode as follows:
def lora_linear(x):
h = x @ W
h += x @ W_A @ W_B # 低秩分解
return scale * h # sacle为缩放因子,等于1/r
The weight parameter of A in the first matrix will be initialized by a Gaussian function, and the weight parameter of B in the second matrix will be initialized to a zero matrix, which can ensure that the newly added path BA=0 at the beginning of training has no effect on the model result. Influence.
When reasoning, just add the results of the left and right parts together, h = W 0 x + BA x = ( W 0 + BA ) xh=W_0x+BAx=(W_0+BA)xh=W0x+BAx=(W0+B A ) x , so just multiply the completed matrix productBA BAB A and the original weight matrixW 0 W_0W0Add together as a new weight parameter to replace the original PLM's W 0 W_0W0That is, for reasoning, no additional computing resources will be added.
5.2.3 Experiment
- Comparing the performance of other PEFT methods
- Fine-tune weight selection
Transformer's weight matrix includes:
- Attention module:
- Calculate W q W_q of query, key, and valueWq, W k W_k Wk, W v W_v Wv
- For multi-head attention calculation results head 1 . . . headn head_1...head_nhead1...headnMatrix W o W_o when splicingWo
- The weight matrix of the MLP layer
LoRA is only applied to the 4 weight matrices in the Attention module, and it is found through ablation experiments that W q W_q is adjusted at the same timeWqSum W v W_vWvwill produce the best results.
In addition, ensuring the number of types of weight matrices is more important than increasing the dimension r of the hidden layer, and increasing r does not necessarily cover more meaningful subspaces.
3. The choice of rank, usually choose 4, 8, 16.
5.3 AdaLoRA(2023.3)
论文《Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning》、QingruZhang/AdaLoRA
5.3.1 Background
There are some problems with the previous Adapter tuning method and the method of downstream task increment:
- Adding small network modules : Adding small network modules to PLMs, fine-tuning these modules only for each task while keeping the base model unchanged, can be used for all tasks. In this way, only a small number of task-specific parameters need to be introduced and updated to adapt to downstream tasks, which greatly improves the practicability of the pre-trained model. Such as: Adapter tuning, Prefix tuning, Prompt Tuning, etc. Although this method greatly reduces memory consumption. However, these methods have some problems, such as: Adapter tuning introduces inference delay; Prefix tuning or Prompt tuning directly optimizes Prefix and Prompt is non-monotonic, difficult to converge, and consumes input tokens.
- Incremental updates for downstream tasks : Model incremental updates of pre-trained weights without modifying the model architecture, i.e., W=W0+△W. For example: Diff pruning, LoRA, etc. These methods can achieve almost the same performance as full fine-tuning, but there are also some problems, such as: Diff pruning needs the underlying implementation to accelerate the calculation of unstructured sparse matrices, and cannot directly use the existing framework, the complete ∆W matrix needs to be stored during the training process, which does not reduce the computational cost compared to full fine-tuning. LoRA needs to pre-specify that the intrinsic rank r of each increment matrix is the same, ignoring that when fine-tuning the pre-training model, the importance of the weight matrix is significantly different between different modules and layers, and only the Attention is trained without training. FFN, in fact FFN is more important.
To summarize based on the above questions:
- We cannot pre-specify the rank of the matrix and need to dynamically update R of the incremental matrix, since the importance of the weight matrix varies significantly across different modules and layers.
- It is necessary to find more important matrices, assign more parameters, and crop unimportant matrices. Finding important matrices can improve the effect of the model; while cutting out unimportant matrices can reduce the amount of parameter calculations and reduce the risk of poor model effects.
To bridge this gap, the authors propose AdaLoRA, which adaptively allocates parameter budgets among weight matrices according to their importance scores.
5.3.2 Algorithms
AdaLoRA
is an improvement to LoRA that dynamically allocates parameter budgets to weight matrices based on importance scores. The specific method is as follows:
- Adjust the incremental moment distribution. AdaLoRA assigns high ranks to critical incremental matrices to capture finer and task-specific information, and lower ranks to less important matrices to prevent overfitting and save computational budget.
- Incremental updates are parameterized in the form of singular value decomposition, and unimportant singular values are clipped according to the importance index, while singular vectors are preserved. Since exact SVD decomposition of a large matrix is computationally expensive, this method speeds up computation by reducing their parameter budget, while preserving the possibility of future recovery and stabilizing training.
W = W ( 0 ) + ∆ = W ( 0 ) + P Λ Q W = W^{(0)} + ∆ = W^{(0)}+ PΛQ W=W(0)+∆=W(0)+P Λ Q
其中, P ∈ R d 1 × r P\in \mathbb{R}^{d_{1}\times r} P∈Rd1×r, Q ∈ R r × d 2 Q\in \mathbb{R}^{r\times d_{2}} Q∈Rr×d2, means Δ \DeltaLeft/right singular vectors for Δ . Diagonal matrixΛ ∈ R r × r \Lambda \in \mathbb{R}^{r\times r}L∈Rr×r。
- An extra penalty term is added in the training loss to normalize the orthogonality of the singular matrices P and Q, thus avoiding the heavy computation of SVD and stabilizing the training.
5.3.3 Experiment
It is experimentally demonstrated that AdaLoRA achieves better or comparable performance to existing methods on all budgets and all datasets. For example, when the parameter budget is 0.3M, AdaLoRA is 1.8% higher than the best-performing baseline (Baseline) on the RTE dataset.
6. Hybrid method
6.1 SparseAdapter (abbreviated)
6.2 MAM Adapters(2021.10)
6.2.1 Background
Recent studies have proposed a variety of parameter-efficient transfer learning methods that achieve robust performance with only a small number of (extra) parameters fine-tuned. Although these approaches are effective, little is known about the key factors for success and how the various approaches are linked.
For example, the figure below shows different fine-tuning methods, the effect of doing English text summarization tasks on the Xsum data set (ROUGE-2 is the evaluation index of this task (the bigger the better)) and other efficient fine-tuning methods The parameter amount is relative to the full parameter fine-tuning The percentage of the parameter amount. The position of the upper left corner in the figure is an idealized method, and it is found from the figure that Adapter,Prefix Tuning,LoRA
it is a method with better performance.
Figure 1: Demonstrates the Transformer architecture and some state-of-the-art methods for efficient tuning of parameters. We use blocks with dashed borders to denote modules added by these methods. | Figure 2: Performance of different methods on the XSum summarization task. |
The mathematical representation of these three methods is organized as follows:
Why do they look Adapter,Prefix Tuning,LoRA
different (in terms of structure and formula), especially Prefix Tuning, but these three methods have similar effects?
6.2.2 Further Research on Prefix Tuning
l
Prefix Tuning adds an adjustable prefix vector in front of the key and value in the multi-head attention of each layer . Specifically, two sets of prefix vectors P k , P v ∈ R l × d P^{k},P^{v}\in \mathbb{R}^{l\times d}Pk,Pv∈Rl × d is concatenated with the original key K and value V. Multi-head attention computations are then performed on the new prefix keys and values. The computation of the first head of multi-head attentioni
becomes:
Prompt-tuning
is to simplify prefix adjustment by only feeding word embeddings with prefixes in the first layer; similar work also includes P-tuning
. The authors below derive an equivalent form of Equation 5 and provide an alternative view of prefix adjustment.
where λ(x) is a scalar denoting the sum of normalized attention weights over the prefix. The first term in Equation 7 Attn ( x W q , CW k , CW v ) Attn(xW_q, CW_k, CW_v)Attn(xWq,CWk,CWv) , is the original attention without a prefix, while the second term is a C-independent position-wise modification. Equation 7 provides another view of prefix adjustment, which basically performs position-wise modification of the original head attention output h via linear interpolation:
we redefineW 1 = W q P k T , W 2 = P v , f = softmax W_{1}=W_{q}P_{k}^{T},W_{2}=P_{v},f=softmaxW1=WqPkT,W2=Pv,f=so f t max x , rewrite formula 9 to have:
the formula obtained from this point of view and the formulaAdapter
h← h + f ( h ⋅ W down ) ⋅ W uph\leftarrow h+f(h\cdot W_{down} )\cdot W_{up}h←h+f(h⋅Wdown)⋅WupVery similar, except that prefix adjustments perform weighted additions, whereas adapters do not. Figure 3b shows the computational graph of prefix adjustment from this perspective, which allows the abstraction of prefix adjustment as an adapter-like plug-in module.
Furthermore, we note that when l
small, W 1 ∈ R dh × l W_1∈\mathbb{R}^{d_h×l}W1∈Rdh×l和 W 2 ∈ R l × d h W_2∈\mathbb{R}^{l×d_h} W2∈Rl×dhare low-rank matrices, so they are functionally identical to W down in the adapter W_{down}Wdown和 W u p W_{up} WupMatrix is similar. This view also suggests that the number of prefix vectors l
plays a similar role as a bottleneck dimension in the adapter r
: they both represent a rank constraint on the computation of the modification vector ∆h. Therefore, we will l
also refer to the bottleneck dimension.
The rank constraint means that for any x, ∆h is a linear combination of the same l (or ≤ l) basis vectors.
6.2.3 Unified framework of PEFT
In the previous section, through Prefix Tuning
the transformation, it was found that the formulas of Prefix Tuning
and Adapters
are highly similar. Further, the authors deconstruct the state-of-the-art PEFT methods and propose a unified framework to establish the connections between them. Specifically, we redefine them as modifications (modification ∆h) to a specific hidden state in the pretrained model, and define a set of design dimensions, including the function to compute the modification and where to apply the modification, etc., which are in different There are variations between methods .
The figure below analyzes the similarities in the internal structure and structure insertion form of different fine-tuning methods. The figure below shows the structure of the efficient fine-tuning methods Adapter, Prefix Tuning, LoRA and new variants (by replacing some elements, a variant that was not in previous work was designed) Parallel Adapter and Scaled PA.
The following table shows the comparison of the efficient fine-tuning methods Adapter, Prefix Tuning, LoRA and new variants in each dimension.
∆h functional form
: Calculate the specific function of ∆h, this part is the part that needs to be learned. The functional form of all these methods is similar toprojdown→nonlinear→projup
the architecture,nonlinear
whichLoRA
degenerates into characteristic functions in .Insertion form
: add module to structure insert formmodified representation
: The specific location of the new structure in the PLM modificationcomposition function
: refers to how the modified vector ∆h is composed with the original hidden expression h to form a new hidden expression. For example, adapters perform simple additive synthesis, prefix tuning uses gated additive synthesis, and LoRA scales ∆h by a constant factor and adds it to the original hidden representation
Among them, the newly added trainable parameter structure form is the part that needs to be learned (Note: Prefix Tuning is the converted format); the insertion form can be connected in series or in parallel; the specific position of the model modification is the Attention and FFN layers.
This unified framework enables us to investigate parameter-efficient fine-tuning methods along these design dimensions, identifying key design choices and potentially transferring design elements between different approaches. Based on this, we are able to implement new parameter-efficient fine-tuning methods MAM Adapters
that fine-tune fewer parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.
6.2.4 Transfer design elements
In Figure 3 and Table 1, we have designed several new methods, which can be obtained by transferring design elements between different methods through our unified viewpoint above:
Parallel Adapteris
: A variant by transferring the parallel insertion of prefix tuning to the adapter. Interestingly, while we proposed Parallel Adapteris due to its similarity to prefix tuning, concurrent work independently proposed this variant and studied it empirically.Multi-head Parallel Adapter
: A further measure to make the adapter more similar to prefix tuning: We apply a Parallel Adapteris to modify the attention output of the head, as a prefix tuning.Scaled Parallel Adapter
: A variant by transferring the composition and insertion form of LoRA into the adapter, as shown in Fig. 3e.
6.2.5 MAM Adapters
The author conducted a detailed investigation on the placement of the Adapter and the soft prompt. The following conclusions are drawn (see the experimental part of the paper for details):
- Scaled parallel adapter is the best variant to modify FFN . The Adapter placed in parallel is better than the Adapter placed in sequence, and the Adapter placed in parallel with FFN is better than the Adapter placed in parallel with multi-head attention (MHA) (as shown in the figure below, blue indicates modification of Attention, red indicates modification of FFN).
- Modified head attention shows the best results when the parameter budget is very small, while FFN can make better use of modification at larger capacity.
- Soft prompts such as prefix tuning can achieve strong performance by changing only 0.1% of parameters.
Based on this, the author proposes MAM
(mix-and-match), and the final model MAM Adapter
is a combination of parallel Adapter and soft hints using the FFN layer . Specifically, we use l=30
prefix tuning with a smaller bottleneck dimension ( ) in the attention sublayer, and allocate more parameter budget to r=512
modifying the FFN representation using the scaling parallel adapter ( ).
In Table 6, we compare the MAM adapter with various parameter efficient tuning methods. For completeness, we also show the results of other combined versions in Table 6: using parallel adapters in both attention and FFN layers, and combining prefix adjustment (attn) with LoRA (ffn)—these two combined versions can both improve their respective prototypes
From the experimental results in the above figure, it can be seen that the MAM Adapter has achieved similar effects to the full fine-tuning on the two tasks of Xsum and MT when only 6.7% of the parameter amount (compared to the full fine-tuning) is used, and the method Greatly outperforms BitFit and Prompt Tuning, and consistently outperforms LoRA, Adapter and Prefix Tuning.
6.3 UniPELT(2021.10)
《UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning》
6.3.1 Background
In recent years, many parameter efficient fine-tuning (PELT) methods for language models have emerged. In the case of greatly reduced model training parameters, the model effect is equivalent to full fine-tuning. However, different PELT methods may have very different performances on the same task, which makes it very cumbersome to choose the appropriate method for a specific task.
Based on this, the authors propose UniPELT
methods that treat different PELT
methods as submodules and learn to activate the most suitable method for the current data or task through a gating mechanism.
6.3.2 Model structure
UniPELT is a gated combination of LoRA, Prefix Tuning and Adapter, where:
LoRA
: Through low-rank decomposition, the pre-training parameters W 0 W_0 will be optimizedW0Convert to optimized plug-in layer W down , W up W_{down},W_{up}Wdown,WupThe parameter matrix WB , WA W_B,W_AWB,WA;Prefix Tuning
: In the multi-head attention of each layer,l
an adjustable prefix vector is added in front of the key and value. Specifically, two sets of prefix vectors P k , P v ∈ R l × d P^{k},P^{v}\in \mathbb{R}^{l\times d}Pk,Pv∈Rl × d is concatenated with the original key K and value V. Multi-head attention computations are then performed on the new prefix keys and values.Adapter
: Add the Adapter module after the feed-forward sublayer of the Transformer block
Then these three modules are combined, and each module is controlled by a gating mechanism (implemented as a linear layer), that is, the GP
switch of the Prefix-tuning method is controlled by parameters, GL
the switch of the LoRA method is controlled, and GA
the switch of the Adapter method is controlled. All trainable parameters (blue color in the figure) include LoRA's reparameterization matrix WB , WA W_B,W_AWB,WA, prompt tuning parameters P k , P v P_k,P_vPk,Pv, Adapter parameters and gate function weights. The whole structure is shown in the figure below:
6.3.3 Experiment
- Low Data Performance Comparison
UniPELT
Demonstrates significant improvement over a single LoRA, Adapter, and Prefix Tuning approach in low data scenarios with only 100 examples. In higher data scenarios, UniPELT performs comparable or better than these methods.
- high data contrast
- Table 3 lists the performance of different methods when using all training samples, which
UniPELT
is still the best overall, but the advantage is not as high as in the low-resource environment. This is also understandable since existing PELT methods usually perform comparable to full fine-tuning given sufficient training data and potential for improvement. - Furthermore, simply combining multiple PELT methods without gating (
UniPELT-NoGate
) does not perform well in high-resource settings (average ratio isUniPELT
low0.89
).
- Table 3 lists the performance of different methods when using all training samples, which
- Amount of training parameters, training/inference time comparison
- The amount of training parameters: LoRA, BitFit, and Prefix-tuning are relatively small, and the amount of UniPELT parameters is relatively more.
- Training speed: UniPELT has more fine-tuning methods than before, but it is still acceptable.
- Inference speed: The BitFit method increased the least, and the UniPELT method increased time by 27%.
Table 4: Comparison of the number of trainable parameters and training/inference time of various PEFT methods relative to fine-tuning
6.4 Compacter (omitted)
6.5 S4 (omitted)
7. RLHF (refill when available)
Currently the best end-to-end implementation is Microsoft's DeepSpeedChat
.