LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning +

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training ]) detailed introduction, installation and usage

Table of contents

related articles

ChatGLM of LLMs: ChatGLM Efficient Tuning (an efficient fine-tuning tool for ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tuning V2/Freeze Tuning/full amount of fine-tuning]) introduction, installation and detailed guide on how to use it

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training ]) detailed introduction, installation and usage

Introduction to LLaMA Efficient Tuning

1. Supported models

2. Supported training methods

3. Available data sets: for pre-training, for instruction supervised fine-tuning, for reward model or DPO training

Installation of LLaMA Efficient Tuning

1. Configuration environment dependency

(1), Python dependencies

2. Environment construction

3. Data preparation: build a custom data set

4. Fine-tuning/testing

(1), one-click fine-tuning/testing of the browser

(2), single GPU training: pre-training, instruction supervised fine-tuning, reward model training, PPO training, DPO training

(3), Multi-GPU distributed training: T1, using Huggingface Accelerate, T2, using DeepSpeed

5. Multiple reasoning methods: API, CLI, GUI

6. Index evaluation and model prediction

7. Export fine-tuning model

How to use ChatGLM Efficient Tuning


related articles

ChatGLM of LLMs: ChatGLM Efficient Tuning (an efficient fine-tuning tool for ChatGLM-6B/ChatGLM2-6B [LoRA/P-Tuning V2/Freeze Tuning/full amount of fine-tuning]) introduction, installation and detailed guide on how to use it

https://yunyaniu.blog.csdn.net/article/details/131427931

LLMs: LLaMA Efficient Tuning (an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.] [pre-training + instruction supervision fine-tuning + reward model training + PPO training + DPO training ]) detailed introduction, installation and usage

https://yunyaniu.blog.csdn.net/article/details/132012771

Introduction to LLaMA Efficient Tuning

      LLaMA Efficient Tuning, released in June 2023, is an efficient tool that can efficiently fine-tune [full parameters/LoRA/QLoRA] mainstream large models [ChatGLM2/LLaMA2/Baichuan, etc.], including pre-training, instruction supervision fine-tuning, reward model Training, PPO training, DPO training and other functions. The project is still being updated.

官方地址
GitHub - hiyouga/LLaMA-Efficient-Tuning: Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, ChatGLM2)

1. Supported models

model name model size default module Template
LLaMA 7B/13B/33B/65B q_proj,v_proj -
LLaMA-2 7B/13B/70B q_proj,v_proj llama2
BLOOM 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
BLOOMZ 560M/1.1B/1.7B/3B/7.1B/176B query_key_value -
Falcon 7B/40B query_key_value -
Baichuan 7B/13B W_pack baichuan
InternLM 7b q_proj,v_proj intern
Qwen 7b c_attn chatml
XVERSE 13b q_proj,v_proj xverse
ChatGLM2 6b query_key_value chatglm2
  • Default modules are  --lora_target partially optional for parameters. Please use  python src/train_bash.py -h to view all available options.
  • For all "Base" models, --template the parameter can be  any value such as defaultalpacavicuna etc. But please be sure to use the corresponding template for the "Chat" model.

2. Supported training methods

method Full parameter training partial parameter training LoRA QLoRA
pre-training
Instruction supervision fine-tuning
Reward Model Training
PPO training
DPO training
  • Use  --quantization_bit 4/8 parameters to enable QLoRA training.

3. Available data sets: for pre-training, for instruction supervised fine-tuning, for reward model or DPO training

Please refer to the data/README.md file for usage   .

The use of some data sets requires confirmation, and we recommend using the following commands to log in to your Hugging Face account.

pip install --upgrade huggingface_hub

huggingface-cli login

Installation of LLaMA Efficient Tuning

1. Configuration environment dependency

(1), Python dependencies

Python dependencies

Python 3.8+, PyTorch 1.13.1

Transformers, Datasets, Accelerate, PEFT, TRL

protobuf, cpm-kernels, sentencepiece

jieba, rouge-chinese, nltk (for evaluation)

gradio, matplotlib (for web-side interaction)

uvicorn, fastapi, sse-starlette (for API)

2. Environment construction

universal

git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git
conda create -n llama_etuning python=3.10
conda activate llama_etuning
cd LLaMA-Efficient-Tuning
pip install -r requirements.txt

Windows platform + QLoRA

If you want to enable quantized LoRA (QLoRA) on the Windows platform, you need to install the precompiled bitsandbytes library, which supports CUDA 11.1 to 12.1.

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.39.1-py3-none-win_amd64.whl

3. Data preparation : build a custom data set

For the format of the dataset file, please refer to the contents of the data/example_dataset folder. When building a custom dataset, you can use either a single .json file, or a data load script and multiple files.
Note: When using a custom dataset, please update the data/dataset_info.json file. For the format of the file, please refer to data/README.md.

Source code address :

https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/dataset_info.json

# 定位到数据集文件目录(data/dataset_info.json )修改对应的配置信息
{
  "dataset_DIY": {
    "file_name": "dataset_DIY.json",
    # "file_sha1": "607f94a7f581341e59685aef32f531095232cf23"
  },

4、微调/测试

(1)、浏览器一键微调/测试

浏览器一键微调/测试

CUDA_VISIBLE_DEVICES=0 python src/train_web.py

我们极力推荐新手使用浏览器一体化界面,因为它还可以自动生成运行所需的命令行脚本。

目前网页 UI 仅支持单卡训练

(2)、单 GPU 训练:预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练

预训练 预训练
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage pt \
    --model_name_or_path path_to_llama_model \
    --do_train \
    --dataset wiki_demo \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_pt_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

指令监督微调

指令监督微调
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path path_to_llama_model \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir path_to_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16
 

奖励模型训练

奖励模型训练
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage rm \
    --model_name_or_path path_to_llama_model \
    --do_train \
    --dataset comparison_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --resume_lora_training False \
    --checkpoint_dir path_to_sft_checkpoint \
    --output_dir path_to_rm_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-6 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

PPO 训练

PPO 训练
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage ppo \
    --model_name_or_path path_to_llama_model \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --resume_lora_training False \
    --checkpoint_dir path_to_sft_checkpoint \
    --reward_model path_to_rm_checkpoint \
    --output_dir path_to_ppo_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss
DPO 训练 DPO 训练
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage dpo \
    --model_name_or_path path_to_llama_model \
    --do_train \
    --dataset comparison_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --resume_lora_training False \
    --checkpoint_dir path_to_sft_checkpoint \
    --output_dir path_to_dpo_checkpoint \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

(3)、多 GPU 分布式训练:T1、使用 Huggingface Accelerate、T2、使用 DeepSpeed

T1、使用 Huggingface Accelerate

accelerate config # 首先配置分布式环境

accelerate launch src/train_bash.py # 参数同上

使用 DeepSpeed ZeRO-2 进行全参数微调的 Accelerate 配置示例

compute_environment: LOCAL_MACHINE

deepspeed_config:

  gradient_accumulation_steps: 4

  gradient_clipping: 0.5

  offload_optimizer_device: none

  offload_param_device: none

  zero3_init_flag: false

  zero_stage: 2

distributed_type: DEEPSPEED

downcast_bf16: 'no'

machine_rank: 0

main_training_function: main

mixed_precision: fp16

num_machines: 1

num_processes: 4

rdzv_backend: static

same_network: true

tpu_env: []

tpu_use_cluster: false

tpu_use_sudo: false

use_cpu: false

T2、使用 DeepSpeed

deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \

    --deepspeed ds_config.json \

    ... # 参数同上

使用 DeepSpeed ZeRO-2 进行全参数微调的 DeepSpeed 配置示例

{

  "train_micro_batch_size_per_gpu": "auto",

  "gradient_accumulation_steps": "auto",

  "gradient_clipping": "auto",

  "zero_allow_untested_optimizer": true,

  "fp16": {

    "enabled": "auto",

    "loss_scale": 0,

    "initial_scale_power": 16,

    "loss_scale_window": 1000,

    "hysteresis": 2,

    "min_loss_scale": 1

  },  

  "zero_optimization": {

    "stage": 2,

    "allgather_partitions": true,

    "allgather_bucket_size": 5e8,

    "reduce_scatter": true,

    "reduce_bucket_size": 5e8,

    "overlap_comm": false,

    "contiguous_gradients": true

  }

}

5、多种推理方式:API、CLI、GUI

API 服务

API 服务
python src/api_demo.py \
    --model_name_or_path path_to_llama_model \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint
关于 API 文档请见 http://localhost:8000/docs。

命令行测试

命令行测试
python src/cli_demo.py \
    --model_name_or_path path_to_llama_model \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint
 

浏览器测试

浏览器测试
python src/web_demo.py \
    --model_name_or_path path_to_llama_model \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint

6、指标评估、模型预测

指标评估

指标评估(BLEU 分数和汉语 ROUGE 分数)
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path path_to_llama_model \
    --do_eval \
    --dataset alpaca_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint \
    --output_dir path_to_eval_result \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate
我们建议在量化模型的评估中使用 --per_device_eval_batch_size=1 和 --max_target_length 128。

模型预测 模型预测
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path path_to_llama_model \
    --do_predict \
    --dataset alpaca_gpt4_zh \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint \
    --output_dir path_to_predict_result \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate

7、导出微调模型

导出微调模型

导出微调后的模型
python src/export_model.py \
    --model_name_or_path path_to_llama_model \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir path_to_checkpoint \
    --output_dir path_to_export
 

How to use ChatGLM Efficient Tuning

updating……

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/132012771