Under the overview of efficient fine-tuning of large models: DiffPruning, BitFit, LoRa, AdaLoRA, MAM Adapters, UniPELT

  This paper introduces DiffPruning, BitFit of Selective Methods in PEFT; LoRA and AdaLoRA in reparameterization methods; and MAM Adapters and UniPELT in hybrid methods. For classification methods, see PEFT review paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning"

四、Selective Methods

Reference : Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

  The selective method is a method of fine-tuning the existing parameters of the model. You can choose based on layer depth, layer type, or even a certain parameter.

4.1 DiffPruning(2020.10)

  Adapter Tuning works by inserting task-specific residual modules between layers of the model and optimizing only those residual modules. Since the residual module has fewer parameters (about 3.6%), fine-tuning costs are lower.

  The proposed in this paper is similar Diff pruningto Adapters, but Diff pruninginstead of modifying the structure of the model, it extends the basic model through a task-specific diffvector only needs to fine-tune 0.5% of the pre-training parameters, that is, Diff pruning expresses the fine-tuning of a specific task as learning a diff vector δ τ \delta _{\tau }dt, the vector is added to the pretrained model parameters θ pretrained \theta _{pretrained}ipretrained(out) :
θ task = θ pretrained + δ task \theta _{task}=\theta _{pretrained}+\delta _{task}itask=ipretrained+dtask

The difference vector is reconstructed with a differentiable approximation of the L0-norm penalty to encourage sparsity (see Zhihu post for details).

  prompt tuningIt is to fine-tune one soft prompt tokens, but DiffPruningto freeze most of the language model parameters and only fine-tune an inserted diffvector, which is essentially the same.

4.2 BitFit(2021.6)

Ideally, we would like to have an efficient fine-tuning method that satisfies the following conditions:

  • To achieve the effect that can match the full amount of fine-tuning.
  • Change only a small set of model parameters.
  • The data can arrive in a stream instead of at the same time, which is convenient for efficient hardware deployment.
  • The changed parameters are consistent across different downstream tasks.

  Ben-Zaken et al. (2021) propose to fine-tune only the bias of the network. That is, in each linear or convolutional layer, the weight matrix W remains unchanged, and only the bias vector b is optimized. The pseudocode is as follows:

params = (p for n, p
		in model.named_parameters()
		if "bias" in n)
optimizer = Optimizer(params)

  BitFitOnly about 0.05% of the model parameters are updated. The original paper demonstrates that the method achieves similar or better performance in the BERT model (less than 1 billion parameters) in low and medium data cases. But on larger networks, such as T0-3B or GPT-3, BitFit is less effective than fine-tuning and other PEFT methods.

  For the Transformer model, most of the transformer-encoder parameters are frozen, and only the bias parameter and the classification layer parameters of the specific task are updated. The bias parameters involved include the bias involved in calculating query, key, and value in the attention module and merging multiple attention results, the bias in the MLP layer, and the bias parameter in the Layernormalization layer.

  By comparing the effects of BitFit, Adapter and Diff-Pruning based on the GLUE dataset on the Bert-Large model, it can be found that:

  • When the amount of parameters of BitFit is much smaller than that of Adapter and Diff-Pruning, the effect of BitFit is equivalent to that of Adapter and Diff-Pruning, and even better in some tasks.
  • Although the result of BitFit fine-tuning is not as good as fine-tuning, it is far superior to the Frozen method that fixes all model parameters.

insert image description here
  At the same time, by comparing the parameters before and after BitFit training, it is found that only the bias parameters of the calculation query and the first layer of the FFN layer (the feature dimension is enlarged from N to 4N) have the most obvious changes, and only updating these two types of bias parameters can also achieve good results. , on the contrary, if any one of them is fixed, the effect of the model will be greatly lost.

4.3 Freeze and Reconfigure (FAR,2022)

  FAR (Vucetic et al., 2022) selects the columns of the parameter matrix for pruning and reconfigures the linear layer to a trainable and frozen state. The method is divided into two phases.

  • Phase 1: Determine the most important row in the parameter matrix to update. The process is similar to structured pruning, and any pruning method can be used.
  • Phase 2: Convert each parameter WWW is split into trainable partsW t W_tWtand the frozen part W f W_fWf, do a similar operation for the bias, then concatenate the results to reconfigure the network.

The pseudo code of the whole method is as follows:

def far_layer(x):
	h1 = x @ W_t 	# W_t为可训练部分参数
	h2 = x @ W_f 	# W_f为冻结部分参数
	return concat([h1, h2], dim=-1)

  The original paper focuses on edge scenarios and uses DistilBERT (66M) in experiments. FAR is only applied to the feed-forward layers, since these layers account for most of the parameters of DistilBERT. The authors show that FAR updates 6% more parameters on five GLUE tasks and SQuAD 2.0 (Rajpurkar et al., 2018) and achieves similar performance to fine-tuning.

4.4 FishMask (omitted)

Five, Reparametrization-based methods (reparameterization)

5.1 Intrinsic SAID(2020.12)

《Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning》

  While pretrained language models can be fine-tuned to produce state-of-the-art results in a wide range of language understanding tasks, the dynamics of this process are not fully understood, especially in low-data situations. Why can we fine-tune a model with hundreds of millions of parameters using a relatively traditional gradient descent algorithm (e.g., without strong regularization) and only use a dataset of hundreds or thousands of labeled examples?

  In the work of Aghajanyan et al. (2020), they show that common pre-trained models have very low intrinsic dimensionality, so there exists a low-dimensional reparameterization that makes fine-tuning comparable fine-tuning.

  Specifically, they use the Fastfood transform to reparameterize the update of the model weights. Their results show that larger models require variations in lower-rank subspaces to achieve the same fine-tuning performance than smaller models. This observation motivates a focus on fine-tuning large models and parameter efficiency

  Although the number of parameters that can be optimized is low, Fastfood's memory complexity and updates to all model parameters make Intrinsic SAID impractical for fine-tuning large networks.

5.2 LoRa(2021.6)

《LoRA: Low-Rank Adaptation of Large Language Models》Microsoft/LoRAstanford_alpaca

5.2.1 Background

  Neural networks contain many dense layers that perform matrix multiplication. The weight matrices in these layers usually have full rank. Intrinsic SAID research shows that although the pre-training model has a large number of parameters, the corresponding Intrinsic Dimension(intrinsic dimension) of each downstream task is not large, and it can still be effectively learned even when randomly projected into a smaller subspace. In other words, in theory we can fine-tune a very small amount of parameters and achieve good results in downstream tasks.

  Inspired by this, we hypothesize that updates of weights also have low intrinsic rank(intrinsic rank) during adaptation. For a pre-trained weight matrix W 0 ∈ R d × k W_{0}\in \mathbb{R}^{d\times k}W0Rd × k , we do not directly fine-tuneW 0 W_{0}W0, but fine-tune an increment Δ W \Delta WΔW to update the model.

5.2.2 Algorithms

  Specifically, a new channel (equivalent to a plug-in) is added next to the original pre-trained model PLM, and the A,Bintrinsic rank is simulated by multiplying the two matrices before and after. Both the plug-in layer and the pre-training model layer have dimensions d. The first layer will first dreduce the dimension to the dimension through the fully connected layer r, and the second layer will rmap back to dthe dimension through the fully connected layer. Among them, r<<d.

  Here ris the rank of the matrix, so that the calculation of the matrix d x dchanges from d x r + r x d, and the number of parameters is greatly reduced. This step is called low-rank decomposition.

insert image description here

Figure 1: Reparameterization, training only A and B

After adding the plug-in layer, the forward propagation can be expressed by the formula:
h = W 0 x + Δ W x = W 0 x + BA xh=W_{0}x+\Delta Wx=W_{0}x+BAxh=W0x+ΔWx=W0x+BAx

其中 W 0 ∈ R d × k W_{0}\in \mathbb{R}^{d\times k} W0Rd×k B ∈ R d × r B\in \mathbb{R}^{d\times r} BRd×r A ∈ R r × k A\in \mathbb{R}^{r\times k} ARr×k

The whole process is expressed in pseudocode as follows:

def lora_linear(x):
	h = x @ W 			
	h += x @ W_A @ W_B  # 低秩分解
	return scale * h    # sacle为缩放因子,等于1/r

  The weight parameter of A in the first matrix will be initialized by a Gaussian function, and the weight parameter of B in the second matrix will be initialized to a zero matrix, which can ensure that the newly added path BA=0 at the beginning of training has no effect on the model result. Influence.

  When reasoning, just add the results of the left and right parts together, h = W 0 x + BA x = ( W 0 + BA ) xh=W_0x+BAx=(W_0+BA)xh=W0x+BAx=(W0+B A ) x , so just multiply the completed matrix productBA BAB A and the original weight matrixW 0 W_0W0Add together as a new weight parameter to replace the original PLM's W 0 W_0W0That is, for reasoning, no additional computing resources will be added.

5.2.3 Experiment

  1. Comparing the performance of other PEFT methods

insert image description here

Table 2: RoBERTa base, RoBERTa large and DeBERTa XXL use different adaptation methods on the GLUE benchmark. We report overall (match and mismatch) accuracy for MNLI, Matthew correlation coefficient for CoLA, Pearson correlation coefficient for STS-B, and accuracy for other tasks. The higher the better for all indicators. *Denotes figures published in previous studies. †Denotes a fair comparison in a run configuration set up similar to Houlsby et al. (2019).

insert image description here

Table 3: Performance of GPT-2 medium (M) and large (L) models on the E2E NLG challenge using different adaptation methods. For all metrics, higher numbers are better. LoRA outperforms several baseline models with comparable or fewer trainable parameters. Confidence intervals are shown for the experiments we ran. *Denotes figures published in previous studies.

  1. Fine-tune weight selection

Transformer's weight matrix includes:

  • Attention module:
    • Calculate W q W_q of query, key, and valueWq W k W_k Wk W v W_v Wv
    • For multi-head attention calculation results head 1 . . . headn head_1...head_nhead1...headnMatrix W o W_o when splicingWo
  • The weight matrix of the MLP layer

  LoRA is only applied to the 4 weight matrices in the Attention module, and it is found through ablation experiments that W q W_q is adjusted at the same timeWqSum W v W_vWvwill produce the best results.
insert image description here

  In addition, ensuring the number of types of weight matrices is more important than increasing the dimension r of the hidden layer, and increasing r does not necessarily cover more meaningful subspaces.
insert image description here
3. The choice of rank, usually choose 4, 8, 16.

insert image description here

Table 18: LoRA’s validation loss and test set metrics achieved on the E2E NLG Challenge according to different rank r when using the GPT-2 Medium model. Unlike on GPT-3 where r = 1 works well for many tasks, here the performance of the validation loss peaks at r = 16, while the BLEU metric peaks at r = 4, suggesting that GPT-2 Medium is in terms of adaptability Has a similar intrinsic rank to GPT-3 175B. Note that some of our hyperparameters were tuned at r = 4, which matches the number of parameters of the other baseline, so the choice of other r may not be optimal.

5.3 AdaLoRA(2023.3)

论文《Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning》QingruZhang/AdaLoRA

5.3.1 Background

There are some problems with the previous Adapter tuning method and the method of downstream task increment:

  • Adding small network modules : Adding small network modules to PLMs, fine-tuning these modules only for each task while keeping the base model unchanged, can be used for all tasks. In this way, only a small number of task-specific parameters need to be introduced and updated to adapt to downstream tasks, which greatly improves the practicability of the pre-trained model. Such as: Adapter tuning, Prefix tuning, Prompt Tuning, etc. Although this method greatly reduces memory consumption. However, these methods have some problems, such as: Adapter tuning introduces inference delay; Prefix tuning or Prompt tuning directly optimizes Prefix and Prompt is non-monotonic, difficult to converge, and consumes input tokens.
  • Incremental updates for downstream tasks : Model incremental updates of pre-trained weights without modifying the model architecture, i.e., W=W0+△W. For example: Diff pruning, LoRA, etc. These methods can achieve almost the same performance as full fine-tuning, but there are also some problems, such as: Diff pruning needs the underlying implementation to accelerate the calculation of unstructured sparse matrices, and cannot directly use the existing framework, the complete ∆W matrix needs to be stored during the training process, which does not reduce the computational cost compared to full fine-tuning. LoRA needs to pre-specify that the intrinsic rank r of each increment matrix is ​​the same, ignoring that when fine-tuning the pre-training model, the importance of the weight matrix is ​​significantly different between different modules and layers, and only the Attention is trained without training. FFN, in fact FFN is more important.

To summarize based on the above questions:

  1. We cannot pre-specify the rank of the matrix and need to dynamically update R of the incremental matrix, since the importance of the weight matrix varies significantly across different modules and layers.
  2. It is necessary to find more important matrices, assign more parameters, and crop unimportant matrices. Finding important matrices can improve the effect of the model; while cutting out unimportant matrices can reduce the amount of parameter calculations and reduce the risk of poor model effects.

  To bridge this gap, the authors propose AdaLoRA, which adaptively allocates parameter budgets among weight matrices according to their importance scores.

5.3.2 Algorithms

AdaLoRAis an improvement to LoRA that dynamically allocates parameter budgets to weight matrices based on importance scores. The specific method is as follows:

  • Adjust the incremental moment distribution. AdaLoRA assigns high ranks to critical incremental matrices to capture finer and task-specific information, and lower ranks to less important matrices to prevent overfitting and save computational budget.
  • Incremental updates are parameterized in the form of singular value decomposition, and unimportant singular values ​​are clipped according to the importance index, while singular vectors are preserved. Since exact SVD decomposition of a large matrix is ​​computationally expensive, this method speeds up computation by reducing their parameter budget, while preserving the possibility of future recovery and stabilizing training.

W = W ( 0 ) + ∆ = W ( 0 ) + P Λ Q W = W^{(0)} + ∆ = W^{(0)}+ PΛQ W=W(0)+=W(0)+P Λ Q

其中, P ∈ R d 1 × r P\in \mathbb{R}^{d_{1}\times r} PRd1×r Q ∈ R r × d 2 Q\in \mathbb{R}^{r\times d_{2}} QRr×d2, means Δ \DeltaLeft/right singular vectors for Δ . Diagonal matrixΛ ∈ R r × r \Lambda \in \mathbb{R}^{r\times r}LRr×r

  • An extra penalty term is added in the training loss to normalize the orthogonality of the singular matrices P and Q, thus avoiding the heavy computation of SVD and stabilizing the training.

5.3.3 Experiment

  It is experimentally demonstrated that AdaLoRA achieves better or comparable performance to existing methods on all budgets and all datasets. For example, when the parameter budget is 0.3M, AdaLoRA is 1.8% higher than the best-performing baseline (Baseline) on the RTE dataset.

insert image description here

Table 1: Results using DeBERTaV3-base on the GLUE development set. The best results on each dataset are shown in bold. We report the average correlation of STS-B. Full FT, HAdapter, and PAdapter stand for full fine-tuning, Houlsby adapter, and Pfeiffer adapter, respectively. We report the average of 5 runs with different random seeds.

insert image description here

Table 2: Results using DeBERTaV3-base on SQuAD v1.1 and SQuAD v2.0. Here #Params is relative to the number of trainable parameters in full fine-tuning. We report EM/F1. The best results in each setting are shown in bold.

6. Hybrid method

6.1 SparseAdapter (abbreviated)

6.2 MAM Adapters(2021.10)

6.2.1 Background

  Recent studies have proposed a variety of parameter-efficient transfer learning methods that achieve robust performance with only a small number of (extra) parameters fine-tuned. Although these approaches are effective, little is known about the key factors for success and how the various approaches are linked.

  For example, the figure below shows different fine-tuning methods, the effect of doing English text summarization tasks on the Xsum data set (ROUGE-2 is the evaluation index of this task (the bigger the better)) and other efficient fine-tuning methods The parameter amount is relative to the full parameter fine-tuning The percentage of the parameter amount. The position of the upper left corner in the figure is an idealized method, and it is found from the figure that Adapter,Prefix Tuning,LoRAit is a method with better performance.

Image 1 Image 2
Figure 1: Demonstrates the Transformer architecture and some state-of-the-art methods for efficient tuning of parameters. We use blocks with dashed borders to denote modules added by these methods. Figure 2: Performance of different methods on the XSum summarization task.

The mathematical representation of these three methods is organized as follows:

insert image description here

  Why do they look Adapter,Prefix Tuning,LoRAdifferent (in terms of structure and formula), especially Prefix Tuning, but these three methods have similar effects?

6.2.2 Further Research on Prefix Tuning

lPrefix Tuning adds an adjustable prefix vector   in front of the key and value in the multi-head attention of each layer . Specifically, two sets of prefix vectors P k , P v ∈ R l × d P^{k},P^{v}\in \mathbb{R}^{l\times d}Pk,PvRl × d is concatenated with the original key K and value V. Multi-head attention computations are then performed on the new prefix keys and values. The computation of the first head of multi-head attentionibecomes:
insert image description here

  Prompt-tuningis to simplify prefix adjustment by only feeding word embeddings with prefixes in the first layer; similar work also includes P-tuning. The authors below derive an equivalent form of Equation 5 and provide an alternative view of prefix adjustment.

insert image description here
  where λ(x) is a scalar denoting the sum of normalized attention weights over the prefix. The first term in Equation 7 Attn ( x W q , CW k , CW v ) Attn(xW_q, CW_k, CW_v)Attn(xWq,CWk,CWv) , is the original attention without a prefix, while the second term is a C-independent position-wise modification. Equation 7 provides another view of prefix adjustment, which basically performs position-wise modification of the original head attention output h via linear interpolation:
insert image description here
we redefineW 1 = W q P k T , W 2 = P v , f = softmax W_{1}=W_{q}P_{k}^{T},W_{2}=P_{v},f=softmaxW1=WqPkT,W2=Pv,f=so f t max x , rewrite formula 9 to have:
insert image description here
  the formula obtained from this point of view and the formulaAdapterh← h + f ( h ⋅ W down ) ⋅ W uph\leftarrow h+f(h\cdot W_{down} )\cdot W_{up}hh+f(hWdown)WupVery similar, except that prefix adjustments perform weighted additions, whereas adapters do not. Figure 3b shows the computational graph of prefix adjustment from this perspective, which allows the abstraction of prefix adjustment as an adapter-like plug-in module.

  Furthermore, we note that when lsmall, W 1 ∈ R dh × l W_1∈\mathbb{R}^{d_h×l}W1Rdh×l W 2 ∈ R l × d h W_2∈\mathbb{R}^{l×d_h} W2Rl×dhare low-rank matrices, so they are functionally identical to W down in the adapter W_{down}Wdown W u p W_{up} WupMatrix is ​​similar. This view also suggests that the number of prefix vectors lplays a similar role as a bottleneck dimension in the adapter r: they both represent a rank constraint on the computation of the modification vector ∆h. Therefore, we will lalso refer to the bottleneck dimension.

The rank constraint means that for any x, ∆h is a linear combination of the same l (or ≤ l) basis vectors.

6.2.3 Unified framework of PEFT

  In the previous section, through Prefix Tuningthe transformation, it was found that the formulas of Prefix Tuningand Adaptersare highly similar. Further, the authors deconstruct the state-of-the-art PEFT methods and propose a unified framework to establish the connections between them. Specifically, we redefine them as modifications (modification ∆h) to a specific hidden state in the pretrained model, and define a set of design dimensions, including the function to compute the modification and where to apply the modification, etc., which are in different There are variations between methods .

  The figure below analyzes the similarities in the internal structure and structure insertion form of different fine-tuning methods. The figure below shows the structure of the efficient fine-tuning methods Adapter, Prefix Tuning, LoRA and new variants (by replacing some elements, a variant that was not in previous work was designed) Parallel Adapter and Scaled PA.
insert image description here

Figure 3: Graphical illustration of existing methods and proposed variants. A "PLM module" means that a certain sublayer of a PLM (such as an attention or feed-forward network) is frozen. "Scaled PA" stands for Scaled Parallel Adapter

  The following table shows the comparison of the efficient fine-tuning methods Adapter, Prefix Tuning, LoRA and new variants in each dimension.

insert image description here

  • ∆h functional form: Calculate the specific function of ∆h, this part is the part that needs to be learned. The functional form of all these methods is similar to projdown→nonlinear→projupthe architecture, nonlinearwhich LoRAdegenerates into characteristic functions in .
  • Insertion form: add module to structure insert form
  • modified representation: The specific location of the new structure in the PLM modification
  • composition function: refers to how the modified vector ∆h is composed with the original hidden expression h to form a new hidden expression. For example, adapters perform simple additive synthesis, prefix tuning uses gated additive synthesis, and LoRA scales ∆h by a constant factor and adds it to the original hidden representation

  Among them, the newly added trainable parameter structure form is the part that needs to be learned (Note: Prefix Tuning is the converted format); the insertion form can be connected in series or in parallel; the specific position of the model modification is the Attention and FFN layers.

  This unified framework enables us to investigate parameter-efficient fine-tuning methods along these design dimensions, identifying key design choices and potentially transferring design elements between different approaches. Based on this, we are able to implement new parameter-efficient fine-tuning methods MAM Adaptersthat fine-tune fewer parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.

6.2.4 Transfer design elements

  In Figure 3 and Table 1, we have designed several new methods, which can be obtained by transferring design elements between different methods through our unified viewpoint above:

  • Parallel Adapteris: A variant by transferring the parallel insertion of prefix tuning to the adapter. Interestingly, while we proposed Parallel Adapteris due to its similarity to prefix tuning, concurrent work independently proposed this variant and studied it empirically.
  • Multi-head Parallel Adapter: A further measure to make the adapter more similar to prefix tuning: We apply a Parallel Adapteris to modify the attention output of the head, as a prefix tuning.
  • Scaled Parallel Adapter: A variant by transferring the composition and insertion form of LoRA into the adapter, as shown in Fig. 3e.

6.2.5 MAM Adapters

The author conducted a detailed investigation on the placement of the Adapter and the soft prompt. The following conclusions are drawn (see the experimental part of the paper for details):

  • Scaled parallel adapter is the best variant to modify FFN . The Adapter placed in parallel is better than the Adapter placed in sequence, and the Adapter placed in parallel with FFN is better than the Adapter placed in parallel with multi-head attention (MHA) (as shown in the figure below, blue indicates modification of Attention, red indicates modification of FFN).
  • Modified head attention shows the best results when the parameter budget is very small, while FFN can make better use of modification at larger capacity.
  • Soft prompts such as prefix tuning can achieve strong performance by changing only 0.1% of parameters.

insert image description here

Figure 5: Results on XSum (left image) and en-ro (right image). PA stands for Parallel Adapter. Blue and red marks apply modifications in the attention and FFN sublayers, respectively

  Based on this, the author proposes MAM(mix-and-match), and the final model MAM Adapteris a combination of parallel Adapter and soft hints using the FFN layer . Specifically, we use l=30prefix tuning with a smaller bottleneck dimension ( ) in the attention sublayer, and allocate more parameter budget to r=512modifying the FFN representation using the scaling parallel adapter ( ).

  In Table 6, we compare the MAM adapter with various parameter efficient tuning methods. For completeness, we also show the results of other combined versions in Table 6: using parallel adapters in both attention and FFN layers, and combining prefix adjustment (attn) with LoRA (ffn)—these two combined versions can both improve their respective prototypes

insert image description here

Table 6: Comparison of various parameter efficient tuning methods and their proposed variants. For the highest performing method, we run with 3 random seeds and report the mean and standard deviation.

  From the experimental results in the above figure, it can be seen that the MAM Adapter has achieved similar effects to the full fine-tuning on the two tasks of Xsum and MT when only 6.7% of the parameter amount (compared to the full fine-tuning) is used, and the method Greatly outperforms BitFit and Prompt Tuning, and consistently outperforms LoRA, Adapter and Prefix Tuning.

6.3 UniPELT(2021.10)

《UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning》

6.3.1 Background

  In recent years, many parameter efficient fine-tuning (PELT) methods for language models have emerged. In the case of greatly reduced model training parameters, the model effect is equivalent to full fine-tuning. However, different PELT methods may have very different performances on the same task, which makes it very cumbersome to choose the appropriate method for a specific task.

  Based on this, the authors propose UniPELTmethods that treat different PELTmethods as submodules and learn to activate the most suitable method for the current data or task through a gating mechanism.

6.3.2 Model structure

UniPELT is a gated combination of LoRA, Prefix Tuning and Adapter, where:

  • LoRA: Through low-rank decomposition, the pre-training parameters W 0 W_0 will be optimizedW0Convert to optimized plug-in layer W down , W up W_{down},W_{up}Wdown,WupThe parameter matrix WB , WA W_B,W_AWB,WA
  • Prefix Tuning: In the multi-head attention of each layer, lan adjustable prefix vector is added in front of the key and value. Specifically, two sets of prefix vectors P k , P v ∈ R l × d P^{k},P^{v}\in \mathbb{R}^{l\times d}Pk,PvRl × d is concatenated with the original key K and value V. Multi-head attention computations are then performed on the new prefix keys and values.
  • Adapter: Add the Adapter module after the feed-forward sublayer of the Transformer block

  Then these three modules are combined, and each module is controlled by a gating mechanism (implemented as a linear layer), that is, the GPswitch of the Prefix-tuning method is controlled by parameters, GLthe switch of the LoRA method is controlled, and GAthe switch of the Adapter method is controlled. All trainable parameters (blue color in the figure) include LoRA's reparameterization matrix WB , WA W_B,W_AWB,WA, prompt tuning parameters P k , P v P_k,P_vPk,Pv, Adapter parameters and gate function weights. The whole structure is shown in the figure below:
insert image description here

6.3.3 Experiment

  1. Low Data Performance Comparison
    UniPELTDemonstrates significant improvement over a single LoRA, Adapter, and Prefix Tuning approach in low data scenarios with only 100 examples. In higher data scenarios, UniPELT performs comparable or better than these methods.

insert image description here

Table 1: Results using K = {100, 500, 1000} training samples on the GLUE benchmark. The evaluation indicators are Matthew's correlation of CoLA, F1 value of MRPC and QQP, Spearman's correlation of STS-B and the accuracy of other tasks. For MNLI, we evaluate on the matching dataset. We report the average performance over five random seeds, with standard deviation as subscript. Under each setting, the best and second-best methods are bolded and underlined.

  1. high data contrast
    • Table 3 lists the performance of different methods when using all training samples, which UniPELTis still the best overall, but the advantage is not as high as in the low-resource environment. This is also understandable since existing PELT methods usually perform comparable to full fine-tuning given sufficient training data and potential for improvement.
    • Furthermore, simply combining multiple PELT methods without gating ( UniPELT-NoGate) does not perform well in high-resource settings (average ratio is UniPELTlow 0.89).

insert image description here

Table 3: Results on the GLUE benchmark when using all training samples

  1. Amount of training parameters, training/inference time comparison
    • The amount of training parameters: LoRA, BitFit, and Prefix-tuning are relatively small, and the amount of UniPELT parameters is relatively more.
    • Training speed: UniPELT has more fine-tuning methods than before, but it is still acceptable.
    • Inference speed: The BitFit method increased the least, and the UniPELT method increased time by 27%.
      insert image description here
      Table 4: Comparison of the number of trainable parameters and training/inference time of various PEFT methods relative to fine-tuning

6.4 Compacter (omitted)

6.5 S4 (omitted)

7. RLHF (refill when available)

Currently the best end-to-end implementation is Microsoft's DeepSpeedChat.

Guess you like

Origin blog.csdn.net/qq_56591814/article/details/131334254