ChatGLM2 of LLMs: Introduction and usage of ChatGLM-Finetuning (based on DeepSpeed) (four fine-tuning methods (Freeze method/Lora method/P-Tuning method/full parameters) + single-card/multi-card training

ChatGLM2 of LLMs: Introduction and usage of ChatGLM-Finetuning (based on DeepSpeed ) (four fine-tuning methods (Freeze method/Lora method/P-Tuning method/full parameters) + single-card/multi-card training settings + video memory resource usage comparison) , Detailed strategy for case application (based on 4 A800-80G + using ChatGLM-6B model + full parameters + pipeline implementation based on DeepSpeed ​​framework (ZeRO3's model splitting technology))

Table of contents

related articles

Related papers

GLM-130B/ChatGLM of LLMs: Translation and interpretation of "GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL"

ChatGLM2 of LLMs: Detailed guide to the introduction, installation and usage of ChatGLM2-6B

Practical cases

LLMs: Teach you step by step from beginning to end how to use the ChatGLM-6B model to achieve training, deployment, inference (CLI/Gradio interactive interface), fine-tuning (two efficiency improvement techniques [mixed precision + ZeRO zero redundancy efficiency] + three fine-tuning methods [fine-tuning/P-tuning v2 changes parameter distribution/LoRA low-rank approximation reduces the parameter amount and requires updating the parameter amount]) Detailed strategy of graphic tutorial

ChatGLM of LLMs: Based on the Langchain framework, use text2vec-large-chinese+ChatGLM large model (Docker deployment) to access the local knowledge base (generate local knowledge base/segmentation/vectorization + problem-based [Embdding+vectorization+matching TopK as context]= Generate Prompt to feed the large model → LLMs response) Detailed strategy for implementing the question and answer response project (CLI/WebUI/VUE) graphic tutorial

ChatGLM of LLMs: Detailed guide on the introduction, installation and usage of ChatGLM Efficient Tuning (a tool for efficient fine-tuning of ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tunin])

ChatGLM2 of LLMs: Single-machine inference of ChatGLM2-6B local deployment (API/CLI/GUI), low-cost deployment (GPU quantitative deployment/CPU and its quantitative deployment/Mac deployment/multi-card deployment), efficient fine-tuning under limited resources (full parameters /P-tuning v2), detailed guide to graphic tutorials on model evaluation and inference

ChatGLM2 of LLMs: Based on ChatGLM Efficient Tuning (fine-tuning toolkit), detailed strategy for LoRA fine-tuning and inference testing of ChatGLM2 with graphic tutorials

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training] 】) Detailed introduction, installation and usage instructions

LLMs: Detailed guide to the introduction, installation and usage of LangChain-Chachat (a question-and-answer application that can implement local knowledge base)

ChatGLM2 of LLMs: Introduction to ChatGLM-Finetuning, usage methods (four fine-tuning methods (Freeze method/Lora method/P-Tuning method/full parameters) + single-card/multi-card training settings + video memory resource usage comparison), case applications Detailed strategy

ChatGLM2 of LLMs: ChatGLM-Finetuning source code interpretation (train.py file) - parse command line (model path + data set related [maximum sequence length/maximum input length] + training parameter related [batch size/learning rate/weight attenuation] Coefficient/number of training rounds/number of gradient cumulative steps/learning rate warm-up ratio] + result output related [output path/training method [four methods of fine-tuning, such as Freeze/Lora/P-Tuning/full parameters]/process flag/loss Frequency/save model frequency] + whether to enable gradient checkpoint + DeepSpeed ​​configuration + LoRA/Freeze/P-tuning configuration) and initialization settings (whether to enable distributed GPU + load DeepSpeed ​​configuration parameters + log writer) → load data (load tokenizer /Training set)→Model training (load the optimizer and learning rate scheduler and set parameters + judge to enable gradient checkpoint + package the model and optimizer into deepspeed [DeepSpeed ​​encapsulates data parallelism] + execute model training [training epoch loop] + Iterative training data + Calculate loss + Backpropagation + Gradient clipping technology]) + Model saving (regularly display training loss and save the model, judge that the model parameters need to be merged and saved during zero3 training)

Introduction to ChatGLM-Finetuning

Installation and usage of ChatGLM-Finetuning

1. Configure the environment

1.1. Download project code

1.2. Configuration environment, requirements for each package version

2. Fine-tuning method: Fine-tuning training based on 4-card A800-80G, up to about 2 hours

T1, Freeze method: modify train_type=freeze and its freeze_module_name parameters

(1), ChatGLM single card training

(2) ChatGLM four-card training: Set the CUDA_VISIBLE_DEVICES parameter

(3), ChatGLM2 single card training

(4), ChatGLM2 four-card training: set the CUDA_VISIBLE_DEVICES parameter

(5) Comparison of video memory resource consumption—Freeze method: Compare ChaGLM and ChaGLM2

T2, PT method (i.e. P-Tuning and P-Tuning V2): modify train_type to ptuning, prefix_projection=True

(1), ChatGLM single card training

(2), ChatGLM four-card training

(3), ChatGLM2 single card training

(4), ChatGLM2 four-card training

(5) Comparison of video memory resource consumption—PT method: Compare ChaGLM and ChaGLM2

T3, Lora method: modify train_type=lora and its lora_dim=16, lora_alpha=64, lora_dropout=0.1, lora_module_name="query_key_value"

(1), ChatGLM single card training

(2), ChatGLM four-card training

(3), ChatGLM2 single card training

(4), ChatGLM2 four-card training

(5) Comparison of video memory resource consumption—LoRA method: Comparison of ChaGLM and ChaGLM2

T4, full parameter method (based on the DeepSpeed-Zero3 method [multi-card segmentation of model parameters] + Offload method [offloading optimizer parameters to the CPU]): modify train_type=all

(1), ChatGLM four-card training

(2), ChatGLM2 four-card training

(3) Comparison of video memory resource consumption - full parameter method: Comparison of ChaGLM and ChaGLM2

Comparison of results of various fine-tuning techniques: PT-only-Embedding (37G [graphics card usage], 53m [training time] + 3m [test time]), PT (30G, 135m+3m), Freeze (24G, 112m+3m ), Lora (39G, 65m+4m) four fine-tuning results

Case application of ChatGLM-Finetuning

1. Pipeline parallel training case: based on 4 A800-80G + ChatGLM-6B model + full parameters + pipeline implementation based on DeepSpeed ​​framework (ZeRO3 model splitting technology)

(1) Training stage: Use triplet data for model training and testing

(2) Inference stage: model conversion script

(3) Comparison of results

Real-time graphics card proportion: each card accounts for 55/80G, 80/80G, 80/80G, 70/80G in sequence

F1 score is around 0.59


related articles

Related papers

GLM-130B/ChatGLM of LLMs: Translation and interpretation of "GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL"

GLM-130B/ChatGLM of LLMs: Translation and Interpretation of "GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL"_A Virgo Programmer's Blog-CSDN Blog

ChatGLM2 of LLMs: Detailed guide to the introduction, installation and usage of ChatGLM2-6B

ChatGLM2 of LLMs: Detailed guide to the introduction, installation and usage of ChatGLM2-6B_A Virgo Programmer’s Blog-CSDN Blog

Practical cases

LLMs: Teach you step by step from beginning to end how to use the ChatGLM-6B model to achieve training, deployment, inference (CLI/Gradio interactive interface), fine-tuning (two efficiency improvement techniques [mixed precision + ZeRO zero redundancy efficiency] + three fine-tuning methods [fine-tuning/P-tuning v2 changes parameter distribution/LoRA low-rank approximation reduces the parameter amount and requires updating the parameter amount]) Detailed strategy of graphic tutorial

https://yunyaniu.blog.csdn.net/article/details/120249551

ChatGLM of LLMs: Based on the Langchain framework, use text2vec-large-chinese+ChatGLM large model (Docker deployment) to access the local knowledge base (generate local knowledge base/segmentation/vectorization + problem-based [Embdding+vectorization+matching TopK as context]= Generate Prompt to feed the large model → LLMs response) Detailed strategy for implementing the question and answer response project (CLI/WebUI/VUE) graphic tutorial

https://yunyaniu.blog.csdn.net/article/details/130998758

ChatGLM of LLMs: Detailed guide on the introduction, installation and usage of ChatGLM Efficient Tuning (a tool for efficient fine-tuning of ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tunin])

ChatGLM of LLMs: Introduction, installation and usage of ChatGLM Efficient Tuning (a tool for efficient fine-tuning of ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tunin])_A Virgo Programmer's Blog-CSDN blog

ChatGLM2 of LLMs: Single-machine inference of ChatGLM2-6B local deployment (API/CLI/GUI), low-cost deployment (GPU quantitative deployment/CPU and its quantitative deployment/Mac deployment/multi-card deployment), efficient fine-tuning under limited resources (full parameters /P-tuning v2), detailed guide to graphic tutorials on model evaluation and inference

ChatGLM2 of LLMs: Single-machine inference of ChatGLM2-6B local deployment (API/CLI/GUI), low-cost deployment (GPU quantitative deployment/CPU and its quantitative deployment/Mac deployment/multi-card deployment), efficient fine-tuning under limited resources (full parameters /P-t_A Virgo programmer’s blog-CSDN blog

ChatGLM2 of LLMs: Based on ChatGLM Efficient Tuning (fine-tuning toolkit), detailed strategy for LoRA fine-tuning and inference testing of ChatGLM2 with graphic tutorials

ChatGLM2 of LLMs: Based on ChatGLM Efficient Tuning (fine-tuning toolkit) to implement LoRA fine-tuning of ChatGLM2 and conduct inference testing graphic tutorial detailed guide_A Virgo programmer's blog-CSDN blog

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training] 】) Detailed introduction, installation and usage instructions

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning +_A Virgo programmer's blog -CSDN Blog

LLMs: Detailed guide to the introduction, installation and usage of LangChain-Chachat (a question-and-answer application that can implement local knowledge base)

LLMs: Introduction, installation and usage of LangChain-Chachat (a local knowledge base question and answer application)_A Virgo Programmer’s Blog-CSDN Blog

ChatGLM2 of LLMs: Introduction to ChatGLM-Finetuning, usage methods (four fine-tuning methods (Freeze method/Lora method/P-Tuning method/full parameters) + single-card/multi-card training settings + video memory resource usage comparison), case applications Detailed strategy

https://yunyaniu.blog.csdn.net/article/details/132613495

ChatGLM2 of LLMs: ChatGLM-Finetuning source code interpretation (train.py file) - parse command line (model path + data set related [maximum sequence length/maximum input length] + training parameter related [batch size/learning rate/weight attenuation] Coefficient/number of training rounds/number of gradient cumulative steps/learning rate warm-up ratio] + result output related [output path/training method [four methods of fine-tuning, such as Freeze/Lora/P-Tuning/full parameters]/process flag/loss Frequency/save model frequency] + whether to enable gradient checkpoint + DeepSpeed ​​configuration + LoRA/Freeze/P-tuning configuration) and initialization settings (whether to enable distributed GPU + load DeepSpeed ​​configuration parameters + log writer) → load data (load tokenizer /Training set)→Model training (load the optimizer and learning rate scheduler and set parameters + judge to enable gradient checkpoint + package the model and optimizer into deepspeed [DeepSpeed ​​encapsulates data parallelism] + execute model training [training epoch loop] + Iterative training data + Calculate loss + Backpropagation + Gradient clipping technology]) + Model saving (regularly display training loss and save the model, judge that the model parameters need to be merged and saved during zero3 training)

ChatGLM2 of LLMs: ChatGLM-Finetuning source code interpretation (train.py file) - parse command → load data → model training (four ways of fine-tuning + DeepSpeed ​​encapsulated data parallelism) + model saving (regular output lo_a Virgo programmer Blog-CSDN Blog

Introduction to ChatGLM-Finetuning

This project mainly focuses on fine-tuning the ChatGLM and ChatGLM2 models in different ways (Freeze method, Lora method, P-Tuning method, full parameters, etc.). The training codes are all trained using DeepSpeed , and the effects of different fine-tuning methods on large models are compared . Mainly aimed at information extraction tasks, generation tasks, classification tasks, etc. This project supports single-card training & multi-card training. Due to the single instruction set method for fine-tuning, there is no serious catastrophic forgetting after model fine-tuning. Since the official code and model are constantly being updated, the current code and model use the latest version (20230806). PS: Trainer is not used (although the Trainer code is simple, it is not easy to modify. In the era of large models, algorithm engineers have become data engineers, so they need to understand the training process)

Address : GitHub - liucongg/ChatGLM-Finetuning: Based on the ChatGLM-6B and ChatGLM2-6B models, fine-tuning downstream specific tasks involves Freeze, Lora, P-tuning, full-parameter fine-tuning, etc.

Installation and usage of ChatGLM-Finetuning

1. Configure the environment

1.1. Download project code

Code address : GitHub - liucongg/ChatGLM-Finetuning: Based on the ChatGLM-6B and ChatGLM2-6B models, fine-tuning downstream specific tasks involves Freeze, Lora, P-tuning, full-parameter fine-tuning, etc.

1.2. Configuration environment, requirements for each package version

cpm_kernels==1.0.11
deepspeed==0.9.0
numpy==1.24.2
peft==0.3.0
sentencepiece==0.1.96
tensorboard==2.11.0
tensorflow==2.13.0
torch==1.13.1+cu116
tqdm==4.64.1
transformers==4.27.1

2. Fine-tuning method: Fine-tuning training based on 4-card A800-80G, up to about 2 hours

When fine-tuning the model, if you encounter insufficient video memory, you can turn on gradient_checkpointing, zero3, offload and other parameters to save video memory . The model_name_or_path parameter below is the model path. Please modify it according to your actual model saving address.

T1, Freeze method: modify train_type=freeze and its freeze_module_name parameters

Freeze method, that is, parameter freezing, freezes some parameters of the original model and only trains some parameters, so that large models can be processed on a single card or multiple cards without performing TP (Tensor Parallel) or PP (Pipeline Parallel) operations. Conduct training.

For fine-tuning code, see train.py. The core part is as follows:

freeze_module_name = args.freeze_module_name.split(",")
for name, param in model.named_parameters():
	if not any(nd in name for nd in freeze_module_name):
		param.requires_grad = False

To modify different layers of the model, you can modify the freeze_module_name parameter configuration yourself, such as "layers.27.,layers.26.,layers.25.,layers.24.". The training codes are all trained using DeepSpeed. The parameters that can be set include train_path, model_name_or_path, mode, train_type, freeze_module_name, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., which can be configured according to your own tasks.

(1), ChatGLM single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM-6B/ \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm \
                --train_type freeze \
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --output_dir ./output-glm

(2) ChatGLM four-card training: Set the CUDA_VISIBLE_DEVICES parameter

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM-6B/ \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm \
                --train_type freeze \
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --output_dir ./output-glm

(3), ChatGLM2 single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM2-6B/ \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm2 \
                --train_type freeze \
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --output_dir ./output-glm2

(4), ChatGLM2 four-card training: set the CUDA_VISIBLE_DEVICES parameter

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM2-6B/ \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm2 \
                --train_type freeze \
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --output_dir ./output-glm2

(5) Comparison of consumption of video memory resources - Freeze method : Comparison of ChaGLM and ChaGLM 2

PS: ChatGLM uses more video memory than ChatGLM2 during fine-tuning. The detailed video memory ratio is as follows:

Model DeepSpeed-Stage Offload Gradient Checkpointing Batch Size Max Length GPU-A40 Number Video memory consumed
ChaGLM zero2 No Yes 1 1560 1 36G
ChaGLM zero2 No No 1 1560 1 38G
ChaGLM zero2 No Yes 1 1560 4 24G
ChaGLM zero2 No No 1 1560 4 29G
ChaGLM2 zero2 No Yes 1 1560 1 35G
ChaGLM2 zero2 No No 1 1560 1 36G
ChaGLM2 zero2 No Yes 1 1560 4 22G
ChaGLM2 zero2 No No 1 1560 4 27G

T2, PT method (i.e. P-Tuning and P-Tuning V2): modify train_type to ptuning, prefix_projection=True

The PT method, that is, the P-Tuning method, refers to the official code of ChatGLM  , which is a soft-prompt method for large models.

  • P-Tuning only adds new parameters to the Embedding of large models. paper
  • P-Tuning-V2 adds new parameters to the Embedding of the large model and before each layer. paper

For fine-tuning code, see train.py. The core part is as follows:

config = MODE[args.mode]["config"].from_pretrained(args.model_name_or_path)
config.pre_seq_len = args.pre_seq_len
config.prefix_projection = args.prefix_projection
model = MODE[args.mode]["model"].from_pretrained(args.model_name_or_path, config=config)
for name, param in model.named_parameters():
	if not any(nd in name for nd in ["prefix_encoder"]):
		param.requires_grad = False

When prefix_projection is True, it is the P-Tuning-V2 method, and new parameters are added to the Embedding of the large model and before each layer; when it is False, it is the P-Tuning method, and new parameters are added only to the Embedding of the large model. parameter.

The training codes are all trained using DeepSpeed. The parameters that can be set include train_path, model_name_or_path, mode, train_type, pre_seq_len, prefix_projection, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., which can be configured according to your own tasks.

(1), ChatGLM single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM-6B \
                --per_device_train_batch_size 1 \
                --max_len 768 \
                --max_src_len 512 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm \
                --train_type ptuning \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --pre_seq_len 16 \
                --prefix_projection True \
                --output_dir ./output-glm

(2), ChatGLM four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM-6B \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm \
                --train_type ptuning \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --pre_seq_len 16 \
                --prefix_projection True \
                --output_dir ./output-glm

(3), ChatGLM2 single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM2-6B \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm2 \
                --train_type ptuning \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --pre_seq_len 16 \
                --prefix_projection True \
                --output_dir ./output-glm2

(4), ChatGLM2 four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
                --train_path data/spo_0.json \
                --model_name_or_path ChatGLM2-6B \
                --per_device_train_batch_size 1 \
                --max_len 1560 \
                --max_src_len 1024 \
                --learning_rate 1e-4 \
                --weight_decay 0.1 \
                --num_train_epochs 2 \
                --gradient_accumulation_steps 4 \
                --warmup_ratio 0.1 \
                --mode glm2 \
                --train_type ptuning \
                --seed 1234 \
                --ds_file ds_zero2_no_offload.json \
                --gradient_checkpointing \
                --show_loss_step 10 \
                --pre_seq_len 16 \
                --prefix_projection True \
                --output_dir ./output-glm2

(5) Comparison of video memory resource consumption - PT method : Compare ChaGLM and ChaGLM 2

PS: ChatGLM uses more video memory than ChatGLM2 during fine-tuning. The detailed video memory ratio is as follows:

Model DeepSpeed-Stage Offload Gradient Checkpointing Batch Size Max Length GPU-A40 Number Video memory consumed
ChaGLM zero2 No Yes 1 768 1 43G
ChaGLM zero2 No No 1 300 1 44G
ChaGLM zero2 No Yes 1 1560 4 37G
ChaGLM zero2 No No 1 1360 4 44G
ChaGLM2 zero2 No Yes 1 1560 1 20G
ChaGLM2 zero2 No No 1 1560 1 40G
ChaGLM2 zero2 No Yes 1 1560 4 19G
ChaGLM2 zero2 No No 1 1560 4 39G

T3, Lora method: modify train_type=lora and its lora_dim=16, lora_alpha=64, lora_dropout=0.1, lora_module_name="query_key_value"

The Lora method is to add additional low-rank matrices in parallel to specified parameters (weight matrices) on a large language model, and during the model training process, only train the parameters of the additional parallel low-rank matrices. When the "rank value" is much smaller than the original parameter dimension, the amount of new low-rank matrix parameters will be very small. When tuning downstream tasks, only small parameters need to be trained , but better performance results can be obtained.

  • Paper: paper
  • Official code: Github
  • peft library encapsulated by HuggingFace: Github

For fine-tuning code, see train.py. The core part is as follows:

model = MODE[args.mode]["model"].from_pretrained(args.model_name_or_path)
lora_module_name = args.lora_module_name.split(",")
config = LoraConfig(r=args.lora_dim,
					lora_alpha=args.lora_alpha,
					target_modules=lora_module_name,
					lora_dropout=args.lora_dropout,
					bias="none",
					task_type="CAUSAL_LM",
					inference_mode=False,
					)
model = get_peft_model(model, config)
model.config.torch_dtype = torch.float32

The training codes are all trained using DeepSpeed. The parameters that can be set include train_path, model_name_or_path, mode, train_type, lora_dim, lora_alpha, lora_dropout, lora_module_name, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., which can be configured according to your own tasks.

(1), ChatGLM single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm \
              --train_type lora \
              --lora_dim 16 \
              --lora_alpha 64 \
              --lora_dropout 0.1 \
              --lora_module_name "query_key_value" \
              --seed 1234 \
              --ds_file ds_zero2_no_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm

(2), ChatGLM four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm \
              --train_type lora \
              --lora_dim 16 \
              --lora_alpha 64 \
              --lora_dropout 0.1 \
              --lora_module_name "query_key_value" \
              --seed 1234 \
              --ds_file ds_zero2_no_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm

(3), ChatGLM2 single card training

CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM2-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm2 \
              --train_type lora \
              --lora_dim 16 \
              --lora_alpha 64 \
              --lora_dropout 0.1 \
              --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" \
              --seed 1234 \
              --ds_file ds_zero2_no_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm2

(4), ChatGLM2 four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM2-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm2 \
              --train_type lora \
              --lora_dim 16 \
              --lora_alpha 64 \
              --lora_dropout 0.1 \
              --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" \
              --seed 1234 \
              --ds_file ds_zero2_no_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm2

(5) Comparison of memory resource consumption - LoRA method : Comparison of ChaGLM and ChaGLM 2

PS: ChatGLM uses more video memory than ChatGLM2 during fine-tuning. The detailed video memory ratio is as follows:

Model DeepSpeed-Stage Offload Gradient Checkpointing Batch Size Max Length GPU-A40 Number Video memory consumed
ChaGLM zero2 No Yes 1 1560 1 20G
ChaGLM zero2 No No 1 1560 1 45G
ChaGLM zero2 No Yes 1 1560 4 20G
ChaGLM zero2 No No 1 1560 4 45G
ChaGLM2 zero2 No Yes 1 1560 1 20G
ChaGLM2 zero2 No No 1 1560 1 43G
ChaGLM2 zero2 No Yes 1 1560 4 19G
ChaGLM2 zero2 No No 1 1560 4 42G

Note: The Lora method only saves the Lora training parameters when saving the model, so the model parameters need to be merged when predicting the model. For details, refer to merge_lora.py.

T4, full parameter method (based on the DeepSpeed-Zero3 method [multi-card segmentation of model parameters] + Offload method [offloading optimizer parameters to the CPU]): modify train_type=all

The full-parameter method performs full parameter training on large models. It mainly uses the DeepSpeed-Zero3 method to divide the model parameters into multiple cards , and uses the Offload method to offload the optimizer parameters to the CPU to solve the problem of insufficient graphics cards.

For fine-tuning code, see train.py. The core part is as follows:

model = MODE[args.mode]["model"].from_pretrained(args.model_name_or_path)

The training codes are all trained using DeepSpeed. The parameters that can be set include train_path, model_name_or_path, mode, train_type, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., which can be configured according to your own tasks.

(1), ChatGLM four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm \
              --train_type all \
              --seed 1234 \
              --ds_file ds_zero3_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm

(2), ChatGLM2 four-card training

Use CUDA_VISIBLE_DEVICES to control which specific cards are used for training. If this parameter is not added, it means that all cards on the running machine are used for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py \
              --train_path data/spo_0.json \
              --model_name_or_path ChatGLM2-6B \
              --per_device_train_batch_size 1 \
              --max_len 1560 \
              --max_src_len 1024 \
              --learning_rate 1e-4 \
              --weight_decay 0.1 \
              --num_train_epochs 2 \
              --gradient_accumulation_steps 4 \
              --warmup_ratio 0.1 \
              --mode glm2 \
              --train_type all \
              --seed 1234 \
              --ds_file ds_zero3_no_offload.json \
              --gradient_checkpointing \
              --show_loss_step 10 \
              --output_dir ./output-glm2

(3) Comparison of video memory resource consumption - full parameter method : Comparison of ChaGLM and ChaGLM 2

PS: ChatGLM uses more video memory than ChatGLM2 during fine-tuning. The detailed video memory ratio is as follows:

Model DeepSpeed-Stage Offload Gradient Checkpointing Batch Size Max Length GPU-A40 Number Video memory consumed
ChaGLM zero3 Yes Yes 1 1560 4 33G
ChaGLM2 zero3 No Yes 1 1560 4 44G
ChaGLM2 zero3 Yes Yes 1 1560 4 26G

The relevant content of DeepSpeed’s Zero-Stage will be added later.

Comparison of results of various fine-tuning techniques: PT-only-Embedding (37G [graphics card usage], 53m [training time] + 3m [test time]), PT (30G, 135m+3m), Freeze (24G, 112m+3m ), Lora (39G, 65m+4m) four fine-tuning results

Case application of ChatGLM-Finetuning

1. Pipeline parallel training case: based on 4 A800-80G + ChatGLM-6B model + full parameters + pipeline implementation based on DeepSpeed ​​framework (ZeRO3 model splitting technology)

The code of this project uses 4 A800-80G for training ( actually 2 cards will work ), and the results show that the F1 score is between 0.55~0.65 ; although Lora, P-tuing and other technologies allow the 13b model to be trained on a single card Training on A100 , but if the model parameters are too large and a single graphics card is not enough to load the training model, pipeline parallelism is essential.

(1) Training stage: Use triplet data for model training and testing

training script

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 5524 train_pipeline.py \

 --train_path data/spo_0.json \

 --model_name_or_path ./ChatGLM-6B/ \

 --per_device_train_batch_size 14 \

 --max_len 1024 \

 --max_src_len 512 \

 --num_train_epochs 5 \

 --gradient_accumulation_steps 1 \

 --seed 1234 \

 --show_loss_step 20 \

 --num_stages 4 \

 --save_model_step 100 \

 --output_dir ./output-glm-pp

(2) Inference stage: model conversion script

Since the variable names in the model are different from the original model during pipeline training, during the inference phase, if the original model structure is used, the model needs to be saved in the convert_model_to_hf.py file for variable name conversion.

Model conversion code

The model conversion code is as follows

model_static_dict = {}

for path in Path(pipeline_model_dir).iterdir():

    print("File processed: {}".format(path))

    if not path.name.startswith('layer'):

        continue

    small_static_dict = torch.load(path, map_location="cpu")

    layer_i = int(path.name.split('-')[0].replace('layer_', ''))

    if layer_i == 0:

        model_static_dict["transformer.word_embeddings.weight"] = small_static_dict["word_embeddings.weight"]

    elif layer_i == 30:

        model_static_dict["lm_head.weight"] = small_static_dict["word_embeddings.weight"]

    elif layer_i == 29:

        for k, v in small_static_dict.items():

            model_static_dict["transformer." + k] = v

    else:

        for k, v in small_static_dict.items():

            model_static_dict["transformer." + k.replace("layer.", "layers.{}.".format(layer_i - 1))] = v

torch.save(model_static_dict, join(save_model_dir, "pytorch_model.bin"))

推理脚本

python3 convert_model_to_hf.py \

 --ori_model_dir ./ChatGLM-6B/ \

 --pipeline_model_dir output-glm-pp/global_step300/ \

 --save_model_dir output-glm-pp/gs300/

(3)、结果对比

显卡实时占比:每张卡依次占比55/80G、80/80G、80/80G、70/80G

F1得分在0.59左右

步数 300 400 500
F1 value 0.5882 0.5793 0.5874

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/132613495