Efficient fine-tuning practice of LLaMA, ChatGLM, BLOOM parameters

Author: Thomas X (Tencent NLP Algorithm Engineer)

Project address: https://zhuanlan.zhihu.com/p/635710004

1. Comparison of open source base models

The training of the large language model is divided into two stages:

(1) Unsupervised pre-training on massive text corpora to learn general semantic representation and world knowledge.

(2) On small-scale data, perform instruction fine-tuning and reinforcement learning based on human feedback to better align the final task with human preferences. LIMA [1] proves that almost all knowledge of LLM is learned during pre-training, and only limited instruction fine-tuning data is needed to generate high-quality replies. Therefore, the performance of the pedestal model is crucial. If the performance of the pedestal model is not good enough, instruction fine-tuning and reinforcement learning are difficult to achieve good results.

At present, there are three mainstream open source large language models: LLaMA, ChatGLM and BLOOM . Based on these three open source models, the industry has carried out instruction fine-tuning or reinforcement learning, and many different large models have been derived. The following three large language models are compared in terms of training data, tokenizer and model structure.

1.1 LLaMA

LLaMA[2] is a large language model proposed by Meta. The training data is mainly English Latin , and also contains code data from GitHub. The training data is mainly in English, excluding Chinese, Korean and Japanese. All training data is open source, and there are about 1400B tokens after word segmentation.

According to the amount of model parameters, the LLaMA model has four model versions with different parameter sizes: 7B, 13B, 33B, and 65B. The 7B and 13B versions use 1T tokens for training, and the 33B and 65B versions use 1.4T tokens for training. [3] proved that in the case of a given training budget, even if the number of model parameters is reduced, as long as the pre-trained data size and training time are increased (more training tokens), the effect of the original size model can be achieved or even exceeded. For comparison, the 280B Gopher model only trained 300B tokens, the 176B BLOOM model only trained 350B tokens, the GLM-130B only trained 400B tokens, and the LLaMA model trained 1T/1.4T tokens, significantly increasing the amount of training data. From the results, although the parameter amount of LLaMA-13B model is only less than 1/10 of that of GPT3, the effect on most tasks exceeds that of GPT3.

In terms of model structure, the same as GPT, LLaMA uses a causal decoder-only transformer model structure. In terms of model details, the following changes have been made:

  • layer normalization: In order to improve the stability of training, the traditional post layer norm is not used , but the pre layer norm is used . Specifically, the bias term in layer normalization is removed, and RMS Norm (ie root mean square Norm) is used.

  • Activation function: The ReLU activation function is not used, but the SwiGLU activation function is used . FFN usually has two weight matrices. First, the vector is raised from the dimension d to the intermediate dimension 4d, and then the dimension is reduced from 4d to d. The FFN using the SwiGLU activation function adds a weight matrix. There are three weight matrices in total. In order to keep the parameter quantity consistent, the middle dimension is used instead of 4d.

    , not 4d.

  • Position encoding: The absolute position encoding is removed, and the rotary position encoding RoPE is adopted.

In terms of training goals, the training goal of LLaMA is the language model, which is to predict the next word based on the existing above.

Regarding the tokenizer, LLaMA's training corpus is mainly in English, using Sentence Piece as the tokenizer, and the vocabulary size is only 32,000. There are very few Chinese tokens in the vocabulary, only a few hundred, and the encoding efficiency of LLaMA tokenizer for Chinese word segmentation is relatively low.

Here are some large models derived from LLaMA:

  • Alpaca (https://github.com/tatsu-lab/stanford_alpaca): Stanford University fine-tuned 7B-scale LLaMA on the 52k English instruction follow dataset.

  • Vicuna (https://github.com/lm-sys/FastChat): UC Berkeley fine-tuned 13B-scale LLaMA on the user-shared conversation data collected by ShareGPT.

  • baize (https://github.com/project-baize/baize-chatbot): On the data generated by 100k ChatGPT, the model obtained by fine-tuning LLaMA through LoRA.

  • StableLM (https://github.com/Stability-AI/StableLM): The model obtained by Stability AI fine-tuning based on LLaMA.

  • BELLE (https://github.com/LianjiaTech/BELLE): Using only data produced by ChatGPT, Lianjia fine-tuned LLaMA and optimized it for Chinese.

1.2 Vocabulary expansion: Chinese LLaMA

Necessity of vocabulary expansion. The vocabulary size of the original LLaMA model is 32000. The tokenizer is mainly trained on English corpus, and the effect on Chinese and multilingual is relatively poor. The effect of LLaMA on Chinese is poor. On the one hand, the LLaMA model is trained on the Latin-based corpus mainly in English, and the training corpus does not contain Chinese; on the other hand, it is related to the tokenizer, and the vocabulary is small. Chinese characters are divided into multiple tokens, the encoding efficiency is low, and the model learning is difficult. The LLaMA vocabulary contains only a few Chinese characters. When segmenting Chinese text, it will divide the Chinese into more fragments. Multiple tokens are required to represent a Chinese character, and the encoding efficiency is very low. After expanding the Chinese vocabulary, a single Chinese character tends to be divided into 1 token, which avoids the problem of a Chinese character being divided into multiple tokens and improves the efficiency of Chinese encoding.

How to expand the vocabulary? [6] Try to expand the vocabulary and add Chinese tokens to the vocabulary to improve the efficiency of Chinese coding. The specific method is as follows.

1. Use Sentence Piece to train a Chinese tokenizer on the Chinese corpus, using 20,000 Chinese words. Then the Chinese tokenizer is merged with the original LLaMA tokenizer, and by combining the vocabularies of the two, a merged tokenizer is finally obtained, called the Chinese LLaMA tokenizer. The vocabulary size is 49953.

2. In order to adapt to the new tokenizer, the embedding matrix of the transformer model is extended from V*h ×ℎ to V'*h×ℎ, and the newly added Chinese token is appended to the end of the original embedding matrix to ensure the embedding matrix of the original vocabulary Not affected.

3. Further pre-train on the Chinese corpus, freeze and fix the model parameters of the transformer, only train the embedding matrix, learn the word vector representation of the newly added Chinese token, and minimize the interference to the original model.

4. In the instruction fine-tuning stage, all model parameters can be released for training.

The effect of expanding the vocabulary. From the results of Chinese-LLaMA-Alpaca (https://github.com/ymcui/Chinese-LLaMA-Alpaca) and BELLE (https://github.com/LianjiaTech/BELLE), expanding the Chinese vocabulary can improve Chinese encoding efficiency and improved model performance.

1.3 ChatGLM-6B

ChatGLM-6B is a dialogue language model proposed by Tsinghua University that supports Chinese-English bilingual question answering. ChatGLM-6B adopts the same model structure as GLM-130B [4]. As of July 2022, GLM-130B has only trained 400B tokens, with a Chinese-English ratio of 1:1. ChatGLM-6B uses more training data, up to 1T tokens, the training corpus only contains Chinese and English, and the ratio of Chinese and English is 1:1.

In terms of model structure, ChatGLM-6B adopts a prefix decoder-only transformer model framework, adopts a two-way attention mechanism on the input, and uses a one-way attention mechanism on the output. In terms of model details, the following changes have been made:

  • Embedding layer gradient reduction: In order to improve the training stability, the gradient of the embedding layer is reduced. specifically,

  • ����_��������=����_��������∗��+����_����������.���� �ℎ()∗(1−�) where α=0.1 =0.1 , the detach() function here is to return a new tensor and separate it from the calculation graph. The effect of gradient reduction is equivalent to reducing the gradient of the embedding layer by 10 times, reducing the norm of the gradient.

  • layer normalization: The post layer norm based on Deep Norm is adopted.

  • Activation function: GeGLU activation function is used. Compared with the ordinary FFN, the GLU using the linear gating unit adds a new weight matrix. There are three weight matrices in total. In order to keep the parameters consistent, the middle dimension uses

     , not 4d.

  • Position encoding: The absolute position encoding is removed, and the rotary position encoding RoPE is adopted.

On the training target, the training task of ChatGLM-6B is autoregressive text filling. Compared with the large language model with causal decoder-only structure, ChatGLM-6B with prefix decoder-only structure has a disadvantage: low training efficiency. The causal decoder structure will calculate the loss on all tokens, while the prefix decoder will only calculate the loss on the output, not the input loss. In the case of the same number of training tokens, the effect of the prefix decoder is worse than that of the causal decoder, because the number of tokens actually used in the training process is less. In addition, the success of ChatGPT has proved that the large language model of causal decoder structure can obtain very good few-shot and zero-shot generation capabilities, and the ability of the model can be further stimulated through instruction fine-tuning. As for whether the large language model of the prefix decoder structure can obtain considerable few-shot and zero-shot capabilities, there is still a lack of sufficient verification.

Regarding the tokenizer, ChatGLM trained SentencePiece as a tokenizer on 25GB of Chinese-English bilingual data, with a vocabulary size of 130,528.

The following are some large-scale model applications derived from ChatGLM:

  • langchain-ChatGLM (https://github.com/imClumsyPanda/langchain-ChatGLM): ChatGLM application based on langchain, which realizes Q&A based on scalable knowledge base.

  • Wenda (https://github.com/l15y/wenda): a large-scale language model calling platform, which implements ChatPDF-like functions based on ChatGLM-6B.

1.4 BLOOM

The BLOOM[5] series of models are large language models trained by the BigScience team . The training data includes a total of 46 languages ​​including English, Chinese, French, Spanish, and Portuguese, as well as 13 programming languages. 1.5TB of deduplicated and cleaned text is converted into 350B of tokens. The language distribution of the training data is shown in the figure below. It can be seen that the Chinese corpus accounts for 16.2%.

According to the amount of model parameters, the BLOOM model has several models with different parameter sizes: 560M, 1.1B, 1.7B, 3B, 7.1B and 176B. The BLOOMZ series of models are fine-tuned on the xP3 dataset and are recommended for English prompt scenarios. The BLOOMZ-MT series models are fine-tuned on the xP3mt dataset and are recommended for non-English prompt scenarios.

In terms of model structure, the same as GPT, BLOOM adopts the causal decoder-only transformer model structure. In terms of model details, the following changes have been made:

  • embedding layer norm: A layer normalization is added after the embedding layer to make the training more stable.

  • layer normalization: In order to improve the stability of training, the traditional post layer norm is not used, but the pre layer norm is used.

  • Activation function: The GeLU activation function is used.

  • Position coding: The absolute position coding is removed, and the relative position coding ALiBi is adopted. Compared with absolute position encoding, ALiBi has better extrapolation, that is, although the maximum sequence length in the training phase is 2048, the model can handle longer sequences during inference.

In terms of training goals, the training goal of BLOOM is the language model, which is to predict the next word based on the existing above.

Regarding the tokenizer, BLOOM uses the Byte Pair Encoding (BPE) algorithm to train the multilingual corpus to obtain a tokenizer with a vocabulary size of 250,880.

The following are some large-scale model applications derived from BLOOM:

  • Xuanyuan (https://arxiv.org/pdf/2305.12002.pdf): A large model in the financial field, Du Xiaoman has carried out targeted pre-training and fine-tuning for the Chinese general field and the financial field on the basis of BLOOM-176B.

  • BELLE:(https://github.com/LianjiaTech/BELLE) Lianjia only uses the data produced by ChatGPT to fine-tune the instructions of BLOOMZ-7B1-mt.

2. Language model details

2.1 tokenizer comparison

The vocabulary sizes of the tokenizers of the above base models are different, and the word segmentation results of the same Chinese text will produce different results. Segmentation processing was performed on 69,000 Chinese and English parallel corpora of news_commentary, and the result and time-consuming of word segmentation were compared. The results are as follows. "Average number of tokens in Chinese" indicates the average number of tokens corresponding to each Chinese character after tokenizer segmentation.

From the results,

  1. The vocabulary of LLaMA is the smallest, and LLaMA has the largest average number of tokens in both Chinese and English, which means that LLaMA's word segmentation in Chinese and English is relatively fragmented and fine-grained. Especially in Chinese, the average number of tokens is as high as 1.45, which means that LLaMA has a high probability of dividing Chinese characters into more than 2 tokens.

  2. After Chinese LLaMA expands the vocabulary, the average number of tokens in Chinese is significantly reduced. One Chinese character or two Chinese characters will be divided into one token, which improves the efficiency of Chinese coding.

  3. ChatGLM-6B is the tokenizer with the best balance between Chinese and English word segmentation . Due to the relatively large vocabulary, the Chinese processing time has also increased.

  4. Although BLOOM has the largest vocabulary, but because it is multilingual, the word segmentation efficiency in Chinese and English is basically the same as that of ChatGLM-6B. It should be noted that the tokenizer of BLOOM is implemented with BloomTokenizerFast of transformers, and the word segmentation speed is faster.

From two examples, let's intuitively compare the word segmentation results of different tokenizers. "Why don't men take Wu hooks and charge Guanshan fifty states." There are 16 characters in total. The word segmentation results of several tokenizers are as follows:

  • LLaMA word segmentation is 24 tokens:

 [ '男', '<0xE5>', '<0x84>', '<0xBF>', '何', '不', '<0xE5>', '<0xB8>', '<0xA6>', '<0xE5>', '<0x90>', '<0xB4>', '<0xE9>', '<0x92>', '<0xA9>', ',', '收', '取', '关', '山', '五', '十', '州', '。'] 
  • Chinese LLaMA word segmentation is 14 tokens:

[ '男', '儿', '何', '不', '带', '吴', '钩', ',', '收取', '关', '山', '五十', '州', '。']
  • ChatGLM-6B word segmentation is 11 tokens:

[ '男儿', '何不', '带', '吴', '钩', ',', '收取', '关山', '五十', '州', '。'] 
  • Bloom word segmentation is 13 tokens:

 ['男', '儿', '何不', '带', '吴', '钩', ',', '收取', '关', '山', '五十', '州', '。'] 

"Miscellaneous peppers and fungus osmanthus, Qi Weiren's husband Hui Chou." The length is 15 characters. The word segmentation results of several tokenizers are as follows:

  • LLaMA word segmentation is 37 tokens:

[ '<0xE6>', '<0x9D>', '<0x82>', '<0xE7>', '<0x94>', '<0xB3>', '<0xE6>', '<0xA4>', '<0x92>', '与', '<0xE8>', '<0x8F>', '<0x8C>', '<0xE6>', '<0xA1>', '<0x82>', '<0xE5>', '<0x85>', '<0xAE>', ',', '<0xE5>', '<0xB2>', '<0x82>', '<0xE7>', '<0xBB>', '<0xB4>', '<0xE7>', '<0xBA>', '<0xAB>', '夫', '<0xE8>', '<0x95>', '<0x99>', '<0xE8>', '<0x8C>', '<0x9D>', '。']
  • Chinese LLaMA word segmentation is 17 tokens:

[ '杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '纫', '夫', '蕙', '<0xE8>', '<0x8C>', '<0x9D>', '。'] 
  • ChatGLM-6B word segmentation is 17 tokens:

 [ '杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '纫', '夫', '蕙', '<0xE8>', '<0x8C>', '<0x9D>', '。'] 
  • Bloom word segmentation is 17 tokens:

 ['杂', '申', '椒', '与', '菌', '桂', '兮', ',', '岂', '维', '�', '�', '夫', '蕙', '�', '�', '。'] 

As can be seen from the above example, the LLaMA vocabulary contains a very small number of Chinese characters, and the common character "er" is also divided into 3 tokens. The vocabulary of Chinese LLaMA, ChatGLM-6B and Bloom covers most of the common Chinese characters, and also contains some common Chinese words, for example, the word "receive" is divided into a token; for some rare words, such as "茝" will also be divided into 2-3 tokens. In general, LLaMA usually divides a Chinese character into two or more tokens, and the Chinese coding efficiency is low; Chinese LLaMA, ChatGLM-6B and Bloom have higher coding efficiency for Chinese word segmentation.

2.2 Layer Normalization

As shown in the figure below, according to the position of layer normalization, it can be divided into post layer norm and pre layer norm.

post layer norm. In the original transformer, layer normalization is placed after the residual connection, called post LN. The deep transformer model using Post LN is prone to training instability. As shown in the figure below, as the number of transformer layers deepens in post LN, the gradient norm gradually increases, leading to instability in training.

pre layer norm. [7] Change the position of layer normalization and place it in the process of residual connection, before the self-attention or FFN block, called "Pre LN". As shown in the figure below, the gradient norm of the Pre layer norm in each transformer layer is approximately equal, which is conducive to improving training stability. Compared with post LN, the deep transformer training using pre LN is more stable, which can alleviate the problem of training instability. But the disadvantage is that the pre LN may slightly affect the performance of the transformer model. One challenge of large language models is how to improve the stability of training. In order to improve training stability, large language models such as GPT3, PaLM, BLOOM, and OPT all use pre-layer norm.

The calculation process of layer normalization is as follows:

The two important parts of layer normalization are translation invariance and scaling invariance. [8] believes that the success of layer normalization is the scaling invariance, not the translation invariance. Therefore, the translation in the calculation process is removed, only the scaling is retained, and simplified, and the RMS Norm (Root Mean Square Layer Normalization) is proposed, which is the root mean square norm. The calculation process is as follows

Compared with the normal layer normalization, RMS norm removes the part of calculating the mean value for translation, the calculation speed is faster, and the effect is basically the same, or even slightly improved . Large language models such as Gopher, LLaMA, and T5 all use RMS norm.

[9] proposed that Deep Norm can alleviate the problem of explosive model update, and limit the model update to a constant, making the model training process more stable. Specifically, the Deep Norm method up-scales the residual connection (α>11) before implementing Layer Norm; in addition, it down-scales the model parameters (β<11 ) in the initialization phase. ChatGLM-6B uses Deep Norm-based post LN.

2.3 Activation function

Each transformer layer is divided into two parts : self attention block and FFN block . FFN usually first increases the dimension of the vector from dimension d to the intermediate dimension 4d, and then reduces the dimension from 4d to d. The calculation formula of FFN is as follows:

Among them, f()) is a nonlinear activation function. The widely used activation functions are gelu (Gaussian Error Linear Unit) function and swish function. The swish function is a self-gated activation function.

Gelu is also an activation function that adjusts the output value through a gating mechanism. Similar to the swish function, it can be approximated by the tanh function or the δ() function.

[10] proposed the gated linear unit GLU (Gated Linear Units). Compared with the normal FFN, there are only two weight matrices. The FFN using GLU adds an additional weight matrix, that is, V in the following formula, and there are three weights in total. matrix for better model performance.

The GLU calculation formula using the gelu activation function is:

The GLU calculation formula using the swish activation function is:

2.4 Position coding

For transformer models, positional encoding is essential. Because the attention module cannot capture the input sequence, it cannot distinguish tokens in different positions. Position coding is divided into absolute position coding and relative position coding.

The most direct way is the training-type position encoding, which takes the position encoding as a trainable parameter and trains a position encoding vector matrix. GPT3 takes this approach. The disadvantage of training position coding is that there is no extrapolation, that is, if the maximum sequence length is 2048 during training, it can only process sequences with a length of 2048 at most during inference, and beyond this length, it cannot be processed.

Su Shen [11] proposed RoPE for rotational position encoding. The training-style position encoding acts on the token embedding, while the rotational position encoding RoPE acts on the self-attention block of each transformer layer. After calculating Q/K, the rotational position encoding acts on Q/K, and then calculates the attention score . Rotary position encoding realizes relative position encoding through absolute encoding, which has good extrapolation. It is worth mentioning that RoPE does not contain trainable parameters. Large language models such as LLaMA, GLM-130B, and PaLM use rotational position encoding RoPE.

ALiBi (Attention with Linear Biases) [12] also acts on the self-attention block of each transformer layer, as shown in the figure below, after calculating the attention score, directly add a preset bias matrix to the attention score matrix . The bias matrix here is preset, fixed, and not trainable. This bias penalizes the attention score according to the relative distance between q and k. The larger the relative distance, the larger the penalty term. It is equivalent to the farther the distance between two tokens, the smaller the mutual contribution. ALiBi position coding has good extrapolation. BLOOM uses this positional encoding.

3. Efficient parameter fine-tuning method

As the parameters of the large language model become larger and larger, the cost of full fine-tuning of the large model is very high. The high cost is mainly reflected in the high requirements for hardware resources and high video memory usage; slow training speed and long time-consuming; and high storage costs. Efficient parameter fine-tuning (parameter-efficient finetuning techniques, PEFT) only trains a small number of parameters when fine-tuning a large model, instead of training the full multi-model parameters. The efficient parameter fine-tuning method has the following advantages:

  • Less video memory usage, low requirements on hardware resources

  • Training is faster and takes less time

  • Lower storage cost, different tasks can share most of the weight parameters

  • Potentially better model performance, alleviating the overfitting problem

3.1 prompt tuning

The original meaning of prompt tuning [13] refers to obtaining a better model effect by modifying the input prompt . The prompt here is a "hard prompt". We directly modify the input prompt, which is not derivable.

Corresponding to "hard prompt", " soft prompt tuning " splices a trainable tensor with embeddings of the input text . This trainable tensor can be optimized by backpropagation to improve the performance of the target task. model effect. The trainable tensor here can be understood as the embedding corresponding to the prompt text, which is a soft prompt. As shown in the figure below, the shape of this trainable tensor is

prompt tuning freezes the original parameters of the large model, and only trains this newly added prompt tensor. The effect of prompt tuning will become better as the parameter amount of the base model increases.

3.2 prefix tuning

Prefix tuning [14] is similar to prompt tuning, adding a task-specific tensor to the input, this tensor is trainable, keeping the parameters of the pre-trained model unchanged . The main differences are as follows:

1. prefix tuning adds the prefix parameter (trainable tensor ) to all transformer layers , while prompt tuning only adds the trainable matrix to the input embedding . Specifically, prefix tuning will add the prefix tensor as past_key_value to all transformer layers.

2. Use a separate FFN to encode and optimize the prefix parameter instead of directly optimizing the soft prompt, as it may cause instability and hurt performance. After updating the soft prompt, FFN is no longer used.

The role of prefix tuning is different from that of prompt tuning, which is somewhat similar to trainable position encoding and rotational position encoding RoPE. The former acts directly on the input embedding, and the latter acts on the self-attention block of all transformer layers. After calculating K and V, it is spliced ​​with the trainable prefix tensor.

3.3 adapter

adapter [16] is somewhat similar to prefix tuning, both of which add additional trainable parameters to each transformer layer. The difference is: prefix tuning is to add the prefix to the input embedding; and the adapter inserts the adapter layer in two positions, as shown in the figure below.

3.4 LLaMA-Adapter

LLaMA-adapter [16] combines prefix tuning and adapter. Similar to prefix tuning, the LLaMA-adapter adds a trainable prompt tensor to the input embed. It should be noted that the prefix is ​​learned and maintained by an embedding matrix, not given externally. Each transformer layer has its own different learnable prefix, allowing more tailored adaptation of different model layers.

As shown in the figure above, LLaMA-adapter introduces a zero-initialized attention mechanism and gating mechanism. The motivation is that the combination of adapter and prefix tuning with randomly initialized tensors (prefix prompts and adapter layers) is likely to damage the semantic knowledge of the pre-trained language model, resulting in unstable fine-tuning and high performance loss at the initial stage of training.

Another important difference is that the LLaMA-adapter only adds learnable adaptation prompts to L deep transformer layers instead of all transformer layers. The authors believe that this method can more effectively fine-tune language representations that focus on high-level semantic information.

The basic idea of ​​LLaMA-adapter and prefix tuning is related to adding trainable soft prompts. There are some subtle differences between the two. When adding a soft prompt, only the input key and value sequences are modified, but the query is not modified. In addition, depending on the gating factor (set to 0 at the beginning of training), it is decided whether to use the prefix-modified attention mechanism.

In general, the main difference between LLaMA-adapter and prefix tuning is: LLaMA-adapter only modifies the deep L transformer layers, and introduces a gating mechanism to achieve stable training. Although the LLaMA-adapter is experimented on LLaMA, it is applicable to other large models of the GPT structure.

As shown in the figure below, each transformer layer of LLaMA-adapter has its own different learnable parameters, a prefix tensor and a gating factor. The shape of the prefix tensor is

3.5 LoRA

The transformer model contains many dense layers for matrix multiplication. Some papers believe that the pre-trained language model has a low rank, even if it is mapped to a smaller subspace, it can still learn efficiently. Based on this theory, LoRA (Low-Rank Adaptation) [18] assumes that the weight parameter update of the model during the adaptation/fine-tuning process also has a low intrinsic dimension.

������=(�(��1)⊗��)�2�

���ℎ(�)=�⋅�(��)

For LLaMA-6B, LoRA is used to inject the bypass module on the query and value of the self-attention block of each transformer layer. The trainable parameters are as follows.

4. Efficient parameter fine-tuning practice

The PEFT library of huggingface has implemented the above-mentioned efficient parameter fine-tuning methods. Try different efficient parameter fine-tuning methods on the three base models of LLaMA-7B, ChatGLM-6B and BLOOM-7B, and compare their fine-tuning effects.

Training is performed on a 40GB A100 with 8 cards in a single machine. In order to save video memory and speed up training, ZeRO stage3 and CPU offload of the deepspeed framework are used, and float16 mixed-precision training is enabled without activation recalculation. Set the batch_size of a single card to 4, the maximum sequence length to 512, the number of gradient accumulation steps to 4, and the learning rate to 1e-4. Train 3 epochs on alpaca chinese data.

The figure below is the loss change curve of LLaMA-7B for efficient parameter fine-tuning and full fine-tuning. The acc indicator measures the accuracy of the language model in predicting the next word. From the loss change curve, for the LLaMA-7B pedestal model, the effects of prompt tuning and prefix tuning are relatively poor. In terms of training loss, prompt tuning is even better than prefix tuning. This may be because prefix tuning combines randomly initialized tensors in each transformer layer, which is likely to damage the semantic knowledge of the pre-trained language model, resulting in poor fine-tuning. Compared with prompt tuning and prefix tuning, the effect of LLaMA-adapter has been significantly improved, which may be due to the introduction of a gating mechanism in LLaMA-adapter to achieve stable training. Among the efficient parameter fine-tuning methods, the effect of LoRA is the best. From the perspective of verification loss, the effect of LoRA is only slightly inferior to the full parameter fine-tuning. In general, for LLaMA-7B, full parameter fine-tuning > LoRA > LLaMA-adapter > prompt tuning > prefix tuning.

The figure below is the loss change curve of Chinese LLaMA-7B for efficient parameter fine-tuning and full fine-tuning. Chinese LLaMA-7B is further pre-trained on the LLaMA-7B pedestal model. The effect comparison of several lightweight fine-tuning methods is basically the same as that of LLaMA-7B. The effect of prompt tuning and prefix tuning is relatively poor, the effect of LLaMA-adapter has been significantly improved, and the effect of LoRA is basically the same as that of full parameter fine-tuning.

The figure below is the loss change curve of ChatGLM-6B for efficient parameter fine-tuning and full fine-tuning.

From the perspective of the loss change curve, the effect of LoRA is even better than the full parameter fine-tuning.

The figure below is the loss change curve of BLOOM for efficient parameter fine-tuning and full fine-tuning. For the BLOOM-7B base model, the effect of prefix tuning is slightly better than prompt tuning. In terms of verification loss, the effects of prompt tuning and prefix tuning are even basically equivalent to full parameter fine-tuning. LoRA achieved a lower verification loss than full parameter fine-tuning.

In general, the fine-tuning loss change curves of the four base models are summarized. For different base models, the method of full parameter fine-tuning is prone to overfitting problems. As the training epoch increases, the training loss presents a stepwise pattern. decreased, but the validation loss started to increase from the third epoch. However, the efficient parameter fine-tuning method only trains a very small number of new parameters and freezes the parameters of the pre-trained language model, effectively avoiding the problem of over-fitting. For these several efficient parameter fine-tuning methods, the verification loss curve and the training loss curve basically overlap and have the same trend.

After comparing the effects of efficient fine-tuning methods for different parameters, which base model has a better fine-tuning effect? The tokenizer and vocabulary size of different base models are different, so the training loss between them is not comparable. The figure below is the loss change curve of LoRA fine-tuning on different base models. Since it is not comparable, it has no practical significance.

As shown in the figure below, you can try to compare the effects of different base models and efficient fine-tuning methods for different parameters from specific examples. For the LLaMA and Chinese LLaMA models, the effect of prompt tuning and prefix tuning is very poor, and cannot generate fluent Chinese responses. Although LLaMA's pre-training data does not contain Chinese, it can still generate good Chinese responses after fine-tuning the instructions. In addition, the replies generated by LoRA fine-tuning are significantly better than those of LLaMA-adapter.

Compared to LLaMA, Chinese LLaMA with extended vocabulary generates better replies after fine-tuning. This also verifies that the extended vocabulary helps to improve the effect of LLaMA on Chinese.

For BLOOM, both prompt tuning and prefix tuning can generate fluent Chinese responses, and the responses generated by prefix tuning are better than prompt tuning. The ChatGLM-6B model generates a short reply, which is relatively simple.

5. Summary

This paper first compares the three mainstream open source large language models LLaMA, ChatGLM and BLOOM from the training data, tokenizer and model structure details, and introduces the derivative models of these three base models; then introduces different large language models in detail In the model details of tokenizer, layer normalization, activation function and position encoding; then it describes the efficient fine-tuning methods of prompt tuning, prefix tuning, LLaMA-adapter and LoRA parameters; finally, it compares the effects of different base language models and different fine-tuning methods.

6. Reference Links

1. Zhou C, Liu P, Xu P, et al. Lima: Less is more for alignment[J]. arXiv preprint arXiv:2305.11206, 2023.

2. Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint arXiv:2302.13971, 2023.

3. Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint arXiv:2203.15556, 2022.

4. Zeng A, Liu X, Du Z, et al. Glm-130b: An open bilingual pre-trained model[J]. arXiv preprint arXiv:2210.02414, 2022.

5. Scao T L, Fan A, Akiki C, et al. Bloom: A 176b-parameter open-access multilingual language model[J]. arXiv preprint arXiv:2211.05100, 2022.

6. Cui Y, Yang Z, Yao X. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca[J]. arXiv preprint arXiv:2304.08177, 2023.

7. Xiong R, Yang Y, He D, et al. On layer normalization in the transformer architecture[C]//International Conference on Machine Learning. PMLR, 2020: 10524-10533.

8. Zhang B, Sennrich R. Root mean square layer normalization[J]. Advances in Neural Information Processing Systems, 2019, 32.

9. Wang H, Ma S, Dong L, et al. Deepnet: Scaling transformers to 1,000 layers[J]. arXiv preprint arXiv:2203.00555, 2022.

10. Shazeer N. Glu variants improve transformer[J]. arXiv preprint arXiv:2002.05202, 2020.

11. Su J, Lu Y, Pan S, et al. Roformer: Enhanced transformer with rotary position embedding[J]. arXiv preprint arXiv:2104.09864, 2021.

12. Press O, Smith N A, Lewis M. Train short, test long: Attention with linear biases enables input length extrapolation[J]. arXiv preprint arXiv:2108.12409, 2021.

13. Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning[J]. arXiv preprint arXiv:2104.08691, 2021.

14. Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint arXiv:2101.00190, 2021.

15. Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]//International Conference on Machine Learning. PMLR, 2019: 2790-2799.

16. Zhang R, Han J, Zhou A, et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention[J]. arXiv preprint arXiv:2303.16199, 2023.

17. Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. arXiv preprint arXiv:2106.09685, 2021.

Guess you like

Origin blog.csdn.net/qq_39970492/article/details/131246323