LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training ]) detailed introduction, installation and usage
Table of contents
Introduction to LLaMA Efficient Tuning
Installation of LLaMA Efficient Tuning
1. Configuration environment dependency
3. Data preparation: build a custom data set
(1), one-click fine-tuning/testing of the browser
(3), Multi-GPU distributed training: T1, using Huggingface Accelerate, T2, using DeepSpeed
5. Multiple reasoning methods: API, CLI, GUI
6. Index evaluation and model prediction
How to use ChatGLM Efficient Tuning
related articles
ChatGLM of LLMs: ChatGLM Efficient Tuning (an efficient fine-tuning tool for ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tuning V2/Freeze Tuning/full amount of fine-tuning]) introduction, installation and detailed guide on how to use it
https://yunyaniu.blog.csdn.net/article/details/131427931
LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training ]) detailed introduction, installation and usage
https://yunyaniu.blog.csdn.net/article/details/132012771
Introduction to LLaMA Efficient Tuning
LLaMA Efficient Tuning, released in June 2023, is an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.], including pre-training, instruction supervision fine-tuning, reward model Training, PPO training, DPO training and other functions. The project is still being updated.
1. Supported models
model name | model size | default module | Template |
---|---|---|---|
LLaMA | 7B/13B/33B/65B | q_proj,v_proj | - |
LLaMA-2 | 7B/13B/70B | q_proj,v_proj | llama2 |
BLOOM | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
BLOOMZ | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
Falcon | 7B/40B | query_key_value | - |
Baichuan | 7B/13B | W_pack | baichuan |
InternLM | 7b | q_proj,v_proj | intern |
Qwen | 7b | c_attn | chatml |
XVERSE | 13b | q_proj,v_proj | xverse |
ChatGLM2 | 6b | query_key_value | chatglm2 |
- Default modules are
--lora_target
partially optional for parameters. Please usepython src/train_bash.py -h
to view all available options. - For all "Base" models,
--template
the parameter can be any value such asdefault
,alpaca
,vicuna
etc. But please be sure to use the corresponding template for the "Chat" model.
2. Supported training methods
method | Full parameter training | partial parameter training | LoRA | QLoRA |
---|---|---|---|---|
pre-training | ✅ | ✅ | ✅ | ✅ |
Instruction supervision fine-tuning | ✅ | ✅ | ✅ | ✅ |
Reward Model Training | ✅ | ✅ | ||
PPO training | ✅ | ✅ | ||
DPO training | ✅ | ✅ | ✅ |
- Use
--quantization_bit 4/8
parameters to enable QLoRA training.
3. Available data sets: for pre-training, for instruction supervised fine-tuning, for reward model or DPO training
- For pre-training:
- For instruction supervised fine-tuning:
- Stanford Alpaca (one)
- Stanford Alpaca (zh)
- GPT-4 Generated Data (en&zh)
- Open Assistant (multilingual)
- Self-cognition (zh)
- ShareGPT(zh)
- Guanaco Dataset (multilingual)
- BEAUTIFUL 2M (zh)
- BELLE 1M (zh)
- BEAUTIFUL 0.5M (zh)
- BEAUTIFUL Dialog 0.4M (zh)
- BELLE School Math 0.25M (zh)
- BELLE Multiturn Chat 0.8M (zh)
- Firefly 1.1M (zh)
- LIME (in)
- CodeAlpaca 20k (one)
- Alpaca CoT (multilingual)
- Web QA (zh)
- UltraChat
- WebNovel (zh)
- For reward model or DPO training:
Please refer to the data/README.md file for usage .
The use of some data sets requires confirmation, and we recommend using the following commands to log in to your Hugging Face account.
pip install --upgrade huggingface_hub
huggingface-cli login
Installation of LLaMA Efficient Tuning
1. Configuration environment dependency
(1), Python dependencies
Python dependencies |
Python 3.8+, PyTorch 1.13.1 Transformers, Datasets, Accelerate, PEFT, TRL protobuf, cpm-kernels, sentencepiece jieba, rouge-chinese, nltk (for evaluation) gradio, matplotlib (for web-side interaction) uvicorn, fastapi, sse-starlette (for API) |
2. Environment construction
universal |
git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git conda create -n llama_etuning python=3.10 conda activate llama_etuning cd LLaMA-Efficient-Tuning pip install -r requirements.txt |
Windows platform + QLoRA |
If you want to enable quantized LoRA (QLoRA) on the Windows platform, you need to install the precompiled bitsandbytes library, which supports CUDA 11.1 to 12.1. pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.39.1-py3-none-win_amd64.whl |
3. Data preparation : build a custom data set
For the format of the dataset file, please refer to the contents of the data/example_dataset folder. When building a custom dataset, you can use either a single .json file, or a data load script and multiple files.
Note: When using a custom dataset, please update the data/dataset_info.json file. For the format of the file, please refer to data/README.md.
Source code address :
https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/dataset_info.json
# 定位到数据集文件目录(data/dataset_info.json )修改对应的配置信息
{
"dataset_DIY": {
"file_name": "dataset_DIY.json",
# "file_sha1": "607f94a7f581341e59685aef32f531095232cf23"
},
4、微调/测试
(1)、浏览器一键微调/测试
浏览器一键微调/测试 |
CUDA_VISIBLE_DEVICES=0 python src/train_web.py 我们极力推荐新手使用浏览器一体化界面,因为它还可以自动生成运行所需的命令行脚本。 目前网页 UI 仅支持单卡训练。 |
(2)、单 GPU 训练:预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练
预训练 | 预训练 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage pt \ --model_name_or_path path_to_llama_model \ --do_train \ --dataset wiki_demo \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir path_to_pt_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16 |
指令监督微调 |
指令监督微调 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage sft \ --model_name_or_path path_to_llama_model \ --do_train \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir path_to_sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16 |
奖励模型训练 |
奖励模型训练 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage rm \ --model_name_or_path path_to_llama_model \ --do_train \ --dataset comparison_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --resume_lora_training False \ --checkpoint_dir path_to_sft_checkpoint \ --output_dir path_to_rm_checkpoint \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-6 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16 |
PPO 训练 |
PPO 训练 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage ppo \ --model_name_or_path path_to_llama_model \ --do_train \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --resume_lora_training False \ --checkpoint_dir path_to_sft_checkpoint \ --reward_model path_to_rm_checkpoint \ --output_dir path_to_ppo_checkpoint \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss |
DPO 训练 | DPO 训练 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage dpo \ --model_name_or_path path_to_llama_model \ --do_train \ --dataset comparison_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --resume_lora_training False \ --checkpoint_dir path_to_sft_checkpoint \ --output_dir path_to_dpo_checkpoint \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16 |
(3)、多 GPU 分布式训练:T1、使用 Huggingface Accelerate、T2、使用 DeepSpeed
T1、使用 Huggingface Accelerate |
accelerate config # 首先配置分布式环境 accelerate launch src/train_bash.py # 参数同上 使用 DeepSpeed ZeRO-2 进行全参数微调的 Accelerate 配置示例 compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 4 gradient_clipping: 0.5 offload_optimizer_device: none offload_param_device: none zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false |
T2、使用 DeepSpeed | deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \ --deepspeed ds_config.json \ ... # 参数同上 使用 DeepSpeed ZeRO-2 进行全参数微调的 DeepSpeed 配置示例 { "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "reduce_scatter": true, "reduce_bucket_size": 5e8, "overlap_comm": false, "contiguous_gradients": true } } |
5、多种推理方式:API、CLI、GUI
API 服务 |
API 服务 |
命令行测试 |
命令行测试 python src/cli_demo.py \ --model_name_or_path path_to_llama_model \ --template default \ --finetuning_type lora \ --checkpoint_dir path_to_checkpoint |
浏览器测试 |
浏览器测试 python src/web_demo.py \ --model_name_or_path path_to_llama_model \ --template default \ --finetuning_type lora \ --checkpoint_dir path_to_checkpoint |
6、指标评估、模型预测
指标评估 | 指标评估(BLEU 分数和汉语 ROUGE 分数) |
模型预测 | 模型预测 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage sft \ --model_name_or_path path_to_llama_model \ --do_predict \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --checkpoint_dir path_to_checkpoint \ --output_dir path_to_predict_result \ --per_device_eval_batch_size 8 \ --max_samples 100 \ --predict_with_generate |
7、导出微调模型
导出微调模型 |
导出微调后的模型 |
How to use ChatGLM Efficient Tuning
updating……