Large model LLM-fine-tuning experience sharing & summary

written in front

Large-scale language models are rampant. Before, they were very anxious, but now they are fully embraced. At present, there are also many open source projects for large-scale model fine-tuning, etc. The author has also made large-scale models for a while. Hereby, I would like to introduce the ChatGLM-6B model fine-tuning experience, and summarize the current open source projects & data. The author’s fine-tuning conclusion is different from many people’s. I used a single instruction to fine-tune the model, and found that after the model was fine-tuned, “there was no catastrophic forgetting phenomenon . ”

Project addressgithub.com/liucongg/ChatGLM-Finetuning

update-2023.04.18 added text generation task evaluation

ChatGLM-6B model fine-tuning

The larger the model, the higher the requirements for the graphics card. At present, there are three mainstream methods for fine-tuning large models: Freeze method, P-Tuning method and Lora method. The author also uses these three methods to fine-tune the ChatGLM-6B large model on the information extraction task. In order to prevent the data leakage of the large model, a field competition data set- automobile industry failure mode relationship extraction is used , and 50 entries are randomly selected as the test set.

See the GitHub link above for the detailed code, and it is also officially included by ChatGLM.

Freeze method

The Freeze method, that is, parameter freezing, freezes part of the parameters of the original model, and only trains part of the parameters, so that the large model can be trained on a single card or without TP or PP operations.

Fine-tuning code, see finetuning_freeze.py, the core part is as follows:

for name, param in model.named_parameters():
    if not any(nd in name for nd in ["layers.27", "layers.26", "layers.25", "layers.24", "layers.23"]):
        param.requires_grad = False

Modify for different layers of the model, you can modify it yourself. The training codes are all trained with DeepSpeed. The parameters that can be set include train_path, model_dir, num_train_epochs, train_batch_size, gradient_accumulation_steps, output_dir, prompt_text, etc., which can be configured according to your own tasks.

CUDA_VISIBLE_DEVICES=0 deepspeed finetuning_freeze.py --num_train_epochs 5 --train_batch_size 2

For the reasoning code of triplet extraction, see predict_freeze.py, other tasks can be reasoned and predicted according to their own evaluation criteria.

PT method

The PT method, namely the P-Tuning method, refers to the ChatGLM official code , which is a soft-prompt method for large models.

P-Tuning only adds new parameters to the Embedding of the large model.
P-Tuning-V2 adds new parameters to the Embedding of the large model and before each layer.

Fine-tuning code, see finetuning_pt.py, the core part is as follows:

config = ChatGLMConfig.from_pretrained(args.model_dir)
config.pre_seq_len = args.pre_seq_len
config.prefix_projection = args.prefix_projection

model = ChatGLMForConditionalGeneration.from_pretrained(args.model_dir, config=config)

for name, param in model.named_parameters():
    if not any(nd in name for nd in ["prefix_encoder"]):
        param.requires_grad = False

When prefix_projection is True, it is the P-Tuning-V2 method, and new parameters are added before the Embedding of the large model and each layer; when it is False, it is the P-Tuning method, which is only new on the Embedding of the large model parameter.

The parameters that can be set include train_path, model_dir, num_train_epochs, train_batch_size, gradient_accumulation_steps, output_dir, prompt_text, pre_seq_len, prompt_text, etc., which can be configured according to your own tasks.

CUDA_VISIBLE_DEVICES=0 deepspeed finetuning_pt.py --num_train_epochs 5 --train_batch_size 2 --pre_seq_len 16

For the reasoning code of triplet extraction, see predict_pt.py, other tasks can be reasoned and predicted according to their own evaluation criteria.

Lora method

The Lora method is to add an additional low-rank matrix to the specified parameters on a large language model, and only train the additional parameters during the model training process. When the "rank value" is much smaller than the original parameter dimension, the newly added low-rank matrix parameters are very small, so that only a small number of parameters can be trained, and better results can be obtained.

Lora Papers: Paper
Official code: Github
The peft library encapsulated by HuggingFace: Github

Fine-tuning code, see finetuning_lora.py, the core part is as follows:

model = ChatGLMForConditionalGeneration.from_pretrained(args.model_dir)
config = LoraConfig(r=args.lora_r,
                    lora_alpha=32,
                    target_modules=["query_key_value"],
                    lora_dropout=0.1,
                    bias="none",
                    task_type="CAUSAL_LM",
                    inference_mode=False,
                    )

model = get_peft_model(model, config)

The parameters that can be set include train_path, model_dir, num_train_epochs, train_batch_size, gradient_accumulation_steps, output_dir, prompt_text, lora_r, etc., which can be configured according to your own tasks.

CUDA_VISIBLE_DEVICES=0 deepspeed finetuning_lora.py --num_train_epochs 5 --train_batch_size 2 --lora_r 8

For the reasoning code of triplet extraction, see predict_lora.py, other tasks can be reasoned and predicted according to their own evaluation criteria.

Note: For tasks whose results need to be consistent (that is, turn off dropout, turn off do_sample when decoding), you need to save the model in the adapter_config.json file, change the inference_mode parameter to false, and execute the model.eval() operation. The main reason is that the Conv1D function is not used in the chatglm model code.

Triplet Extraction Experimental Results

During model training, the maximum length is 768, the batch size is 2, the number of training rounds is 5, fp16 training, using DeepSpeed Zero-1 training;
PT is the official P-Tuning V2 training method, PT-Only-Embedding means only soft-prompt for Embedding, Freeze only trains the parameters of the last five layers of the model, Lora uses a low-rank matrix method for training, and the rank is 8;
Since the previous training PT will have OOM on the 48G-A40 graphics card, the gradient_checkpointing_enable was enabled for the model during the previous PT experiment, so that the memory usage of the model is reduced, but the training time is increased.
Training example:

prompt_text：你现在是一个信息抽取模型，请你帮我抽取出关系内容为\"性能故障\", \"部件故障\", \"组成\"和 \"检测工具\"的相关三元组，三元组内部用\"_\"连接，三元组之间用\\n分割。文本：
输入：故障现象：发动机水温高，风扇始终是低速转动，高速档不工作，开空调尤其如此。
输出：发动机_部件故障_水温高\n风扇_部件故障_低速转动

Time is exchanged for space, which can be used to solve the resource problem of the graphics card very well. Simple play is fine. If you want the model to achieve the best effect or you can quickly see the effect, it is better to rent an A100 card, do a quick experiment, and then use your own in the inference stage. small broken card.

The following experimental results are all carried out on the rented 80G-A100, and there will be some differences with the experimental results of the A40 used in Github, mainly in the training time (pure training speed, excluding the time to save the model). To be honest, if you really want to train a large model, multiple A100s are essential, which can reduce the parallel operations of many models and better control the effect.

fine-tuning method	PT-Only-Embedding	PT	Freeze	Lora
graphics card usage	37g	56G	24G	39g
total parameter	6.259B	7.211B	6.255B	6.259B
The proportion of trainable parameters	0.0586%	13.26%	16.10%	0.0586%
training time	20min	52min	46min	25min
Test result F1	0.0	0.6283	0.5675	0.5359

Result analysis:

The effect is PT>Freeze>Lora>PT-Only-Embedding;
The speed is PT-Only-Embedding>Lora>Freeze>PT;
The effect of PT-Only-Embedding is very unsatisfactory. It is found that during training, the final loss can only converge to 2. few, while other mechanisms can converge to 0. few. The reason for the analysis is that the output content form is very different from the original language model task, and only adding additional Embedding parameters is not enough to change complex downstream tasks ;
The PT method takes up more video memory, because it also increases a lot of external parameters;
The test is time-consuming, and float16 is used for model inference. Since other methods add additional parameters, the inference time of other methods will be higher than that of the Freeze method. Of course, since it is a generative model, the length of the generation will also affect the time-consuming;
After the model is fine-tuned on the specified task, it does not lose its original ability. For example, if you generate "help me write a quick sorting algorithm", you can still generate - quick sorting code;
Since large model fine-tuning uses a large number of instructions for model training, when only a single instruction is used for fine-tuning, it has little effect on other original instructions, so it does not lead to the loss of the original model's ability ;
The above tests represent personal test results only.

Many students experienced catastrophic forgetting after fine-tuning, but it didn’t happen to me. I tested the “translation task”, “code task” and “question-answer task” using the freeze model, which can be tested with test_forgetting.py. The specific test results are as follows:

translation task

code task

question answering task

The generation task and classification task will be completed later, please continue to pay attention to Github, and it will be updated regularly. (Too busy, I will update as soon as possible, and the official code is also continuously updated. If you encounter a situation where the code code cannot be adjusted, please contact me in time. I also gave my code version and model version on github)

text generation

In order to prevent the data leakage of the large model, a "Wanchuang Cup" Chinese Medicine Tianchi Big Data Competition-Traditional Chinese Medicine Literature Question Generation Challenge was used , and 20 entries were randomly selected as the test set
PT is the official P-Tuning V2 training method, PT-Only-Embedding means only soft-prompt for Embedding, Freeze only trains the parameters of the last five layers of the model, Lora uses a low-rank matrix method for training, and the rank is 8;
Training example:

prompt_text：你现在是一个问题生成模型，请根据下面文档生成一个问题，文档：
输入：紫色红薯是近年从日本引进的新品种红薯，中国农业大学农学与生物技术学院副院长刘庆昌指出，紫薯中的花青素具有显著的抗生物氧化作用，在延缓人体衰老方面具有非常好的效果。紫薯中所含赖氨酸、铜、锰、钾、锌的含量高于一般红薯5-8倍，尤其是抗癌物质碘、硒的含量比其他红薯高出20倍以上，占食物中的第一位。
输出：紫薯和红薯吃哪个好？

Model training, take the freeze method as an example:

CUDA_VISIBLE_DEVICES=0 nohup deepspeed --master_port 5555 finetuning_freeze.py --train_path "data/d2q_0.json" --output_dir "output_dir_freeze/" --prompt_text "你现在是一个问题生成模型，请根据下面文档生成一个问题，文档：" > log_fz.log 2>&1 &

Since the content of the generative model cannot be evaluated like the information extraction task, it is not appropriate to use the existing BLUE or Rouge to evaluate, so the scoring rules have been formulated. The quality of the D2Q model is judged from the two perspectives of diversity and accuracy. Each sample has a total of 5 points, a total of 20 samples.

Diversity:
- Whether the questions are highly similar, 0.25 points will be deducted for each repeated question;
- Whether the answers to the questions are the same, if there is a duplicate answer or no answer can be found, 0.25 points will be deducted;
accuracy:
- Whether the question can be answered from the document, if there is no answer, 0.25 points will be deducted;
- Whether the content of the question is smooth, if there is a question that is not smooth, 0.25 points will be deducted;
- Whether the content of the question is harmful, for each harmful content, 0.25 points will be deducted;

See d2q_result_data/ for test data, see predict_d2q.py for test code

fine-tuning method	original model	PT-Only-Embedding	PT	Freeze	Lora
Fraction	51.75	73.75	87.75	79.25	86.75

Chinese open source model & project

Although many large models have been released, there are not many Open & Chinese ones that can be used directly. The following is a summary of Chinese open source large models, datasets and projects.

Chinese open source model

It can be fine-tuned directly without instruction incremental training:

ChatGLM-6B: Model address
ChatYuan-large-v2: model address

The original model is multilingual or English, and requires incremental training of Chinese instruction data sets:

BloomZ: model address
LLama: model address
Flan-T5: model address
OPT: model address

Chinese open source instruction data

Most of the following Chinese instruction sets are translated from Alpaca, please see the data directory in the following project. At present, it is a good idea to use ChatGPT or GPT4 as a cheap labeler to label your own data.

open source project

Summarize the following popular open source projects:

BELLE: project address
ChatGLM: project address
Luotuo-Chinese-LLM: Project address
stanford_alpaca: project address

Summarize

At present, the large-scale models of major manufacturers have been released one after another, which can be called a hundred schools of thought! Individual players are also fully embracing, trying every means to train and fine-tune the large model. I only hope that everyone can realize the freedom of "big model" in the future. May there be no more "model-as-a-service".