LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-into Attention论文解读

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-into Attention论文解读

Introduction

The author's point is that the large language model has recently received extensive attention from academia and industry, and LLMs have shown very good capabilities. Instructions or prompts can be used to generate professional, complex context dialogues, but instruction-f following models are limited by closed- data and computing power .

Alpaca is a model obtained by fully fine-tuning LLaMA using data generated by self-instruction. It only needs 175 qa data, generates 52k pairs of qa data through self-instruction, and its performance is close to gpt3.5.

The authors point out that fully fine-tuning LLaMA is still time-consuming, does not support multimodality, and is bloated for different downstream tasks.

In this article, the author adds a learnable adaptation prompts to the higher layers of the model as a prefix to inject into the new instruction. To avoid noise at the early stage of training, the ordinary attention mechanism of the insertion layer is modified to zero initial attention, with a learnable gating factor.

In short, it has the following advantages:

  1. The parameters of 1.2M achieve a similar ability to Alpaca's full fine-tuning.
  2. 1 hour of fine-tuning time.
  3. Flexible switching of downstream tasks.
  4. Multimodal support.

insert image description here

LLaMA-Adapter

Learnable Adaption Prompts

Given 52k qa data, n-layer PLM: LLaMA, the definition of prompt for L-layer transformer is expressed as:

insert image description here

Among them, K represents the length of the prompt, and C represents the hidden_dimension of the model.
L<N, L indicates the layer that inserts the prefix at most, and N indicates the number of layers of the model. The author said that this can better adjust the language representation with advanced semantics.

The definition model originally has M tokens in each layer, and the combination of the two is expressed as:
insert image description here
The author believes that in this way, Pl can efficiently guide the output of each layer Tl. (I feel a bit like P-tuningV2 here).

Zero-init Attention

insert image description here
As shown in the figure above, in the original method, to predict the yellow token, you need to consider the yellow color and all the previous tokens. After adding a prefix to this layer of network (using random initialization), noise may be introduced in the early stage of model training. Causes pre-training instability.

The author replaced the original Vanilla Attention with Zero-init Attention. In order to reduce noise, the author introduced a learnable Zero Gating, initialized it to zero, and multiplied it with the prefix.

The second improvement is that when calculating the attention score, the softmax will be calculated for all tokens (for convenience, it is actually based on the formula of q*k), and the author calculates the prefix and the original token separately.

insert image description here

Then get the result of the next layer through linear.

insert image description here

experiment

The number of trainable parameters is less than that of lora, and the training time is reduced by 3 times compared with the original alpaca-llama. Increase the
insert image description here
number of training layers of the adapter, and it can be seen that the accuracy of adding more training layers will be better.
insert image description here
The zero-initialization effect
insert image description here
method is more effective for over-fitting problems. Robust, acc is the highest when training for 60 epochs.

Guess you like

Origin blog.csdn.net/qq_18555105/article/details/130224392