[AI Combat] Train your own ChatGPT

This article uses Alpaca-LoRA to train its own ChatGPT. The data set includes the open source 55w data set and my own 1000w medical question and answer data set.

Prepare

environment

  • CUDA 10.2
  • Ubuntu 20.04
  • python 3.8
  • torch 2.0.0

the code

Using the code of Alpaca-LoRA, we first clone Alpaca-LoRA:

git clone [email protected]:tloen/alpaca-lora.git

If the following error message appears:

# git clone [email protected]:tloen/alpaca-lora.git

Cloning into 'alpaca-lora'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

It can be processed according to this article:
https://blog.csdn.net/helloasimo/article/details/123778112

  • Other dependencies installed

    cd alpaca-lora
    pip install -r requirements.txt
    

    If it fails, you can try a few more times, it may be a network problem.

  • After the above is finished, it will probably look like this

    /notebooks/alpaca-lora# ls -lh
    total 44M
    -rw-r--r-- 1 root root  20K Mar 31 07:53 DATA_LICENSE
    -rw-r--r-- 1 root root  635 Mar 31 07:53 Dockerfile
    -rw-r--r-- 1 root root  12K Mar 31 07:53 LICENSE
    -rw-r--r-- 1 root root  15K Mar 31 07:53 README.md
    -rw-r--r-- 1 root root  22M Mar 31 07:53 alpaca_data.json
    -rw-r--r-- 1 root root  22M Mar 31 07:53 alpaca_data_cleaned.json
    -rw-r--r-- 1 root root  643 Mar 31 07:53 docker-compose.yml
    -rw-r--r-- 1 root root 1.5K Mar 31 07:53 export_hf_checkpoint.py
    -rw-r--r-- 1 root root 3.6K Mar 31 07:53 export_state_dict_checkpoint.py
    -rw-r--r-- 1 root root 9.5K Mar 31 07:53 finetune.py
    -rw-r--r-- 1 root root 5.8K Mar 31 07:53 generate.py
    -rw-r--r-- 1 root root  81K Mar 31 07:53 lengths.ipynb
    -rw-r--r-- 1 root root  131 Mar 31 07:53 pyproject.toml
    -rw-r--r-- 1 root root  206 Mar 31 07:53 requirements.txt
    drwxr-xr-x 2 root root 4.0K Mar 31 07:53 templates
    drwxr-xr-x 2 root root 4.0K Mar 31 07:53 utils
    

Model

Use the alpaca-lora-cn-13b model as our large model
Address: https://huggingface.co/facat/alpaca-lora-cn-13b/tree/main

clone model:

# Make sure you have git-lfs installed (https://git-lfs.github.com)

git lfs install
git clone https://huggingface.co/facat/alpaca-lora-cn-13b

Note: You may need to download adapter_model.bin manually, you can download it manually at https://huggingface.co/facat/alpaca-lora-cn-13b/tree/main

Probably something like this:

/notebooks# ls -lh alpaca-lora-cn-13b/
total 26M
-rw-r--r-- 1 root root 341 Mar 30 09:50 README.md
-rw-r--r-- 1 root root 371 Mar 30 09:50 adapter_config.json
-rw-rw-r-- 1 1003 1003 26M Mar 30 09:54 adapter_model.bin

DatasetBelle

[If you don't have your own data set, use this data set to train the model]

Download Belle's open source Chinese dataset (for research use only, commercial use is prohibited!)

  • Introduction
    This data set has 550,000 entries.
    Download data set address:
    https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN/tree/main

  • Data Format:
    insert image description here

  • json format

    {
          
          "input": "用一句话描述地球为什么是独一无二的。\\n\n", "target": "地球上有适宜生命存在的条件和多样化的生命形式。"}
    {
          
          "input": "给出一段对话,要求GPT模型使用合适的语气和回答方式继续对话。\n对话:\nA:你今天看起来很高兴,发生了什么好事?\nB:是的,我刚刚得到一份来自梅西银行的工作通知书。\nA:哇,恭喜你!你打算什么时候开始工作?\nB:下个月开始,所以我现在正为这份工作做准备。", "target": "A: 这太好了!你的新工作听起来很令人兴奋。你对接下来的日子有什么期望吗?\nB: 是啊,我非常期待能在梅西银行工作。我希望我能够尽快适应新环境,并展示出我的所有技能和才能。"}
    

Clean your dataset

Clean the data into the above json format.
There are 1000w pieces of data, which is too big, so I won’t release it here.
The data format is the same as above.

The data format after data cleaning is as follows:
all_data.train.json:

[
    {
    
    
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    },
    {
    
    
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    }
]

After the above work is completed, it will probably look like this

/notebooks# ls -lh
total 1.5G
drwxr-xr-x 6 root root 4.0K Mar 31 07:53 alpaca-lora
drwxr-xr-x 3 root root 4.0K Mar 31 08:15 alpaca-lora-cn-13b
drwxrwxr-x 2 1003 1003 4.0K Mar 30 09:14 data

The cleaned data:

/notebooks/data# du -sh all_data.train.json
6.2G    all_data.train.json

train

  • single card training
python3 finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path '../data/all_data.train.json' \
    --output_dir './lora-alpaca-zh' \
    --micro_batch_size 1 \
    --num_epochs 3

The process is as follows:
insert image description here
The download data is very large and it takes a long time (I spent about 4 hours)! ! !

  • Multi-card training
    I used 4 cards
WORLD_SIZE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
	--nproc_per_node=2 \
	--master_port=1234 \
	finetune.py \
	--base_model 'decapoda-research/llama-7b-hf' \
	--data_path '../data/all_data.train.json' \
	--output_dir './lora-alpaca-zh' \
    --micro_batch_size 1 \
    --num_epochs 3

test

python3 generate.py \
    --load_8bit \
    --base_model 'decapoda-research/llama-7b-hf' \
    --lora_weights './lora-alpaca-zh'
  • Test effect

reference

  • https://huggingface.co/facat/alpaca-lora-cn-13b/tree/main
  • https://github.com/tloen/alpaca-lora
  • https://github.com/gururise/AlpacaDataCleaned

Guess you like

Origin blog.csdn.net/zengNLP/article/details/129862250