Amazing enough, using Alpaca-Lora based on LLaMA (7B) to complete the fine-tuning in 20 minutes, the effect is comparable to Stanford Alpaca

I tried to reproduce Stanford Alpaca (Stanford Alpaca 7B ) from 0 to 1 before . Stanford Alpaca is fine-tuned on the entire LLaMA model , that is, all parameters in the pre-trained model are fine-tuned (full fine-tuning). However, this method still requires high hardware cost and low training efficiency.

Therefore, Alpaca-Lora uses Lora technology to add additional network layers to the model while freezing the LLaMA parameters of the original model , and only trains these new network layer parameters . Due to the small number of these new parameters, not only the cost of fine-tuning is significantly reduced (using an RTX 4090 graphics card, it takes only 5 hours to train a model comparable to Alpaca, which reduces the demand for computing power of this type of model to Consumer grade) can also achieve similar effects to full fine-tuning.

Principles of LoRA Technology

image.png

The principle of LoRA is actually not complicated. Its core idea is to add a bypass next to the original pre-trained language model, and perform a dimensionality reduction and then dimensionality-up operation to simulate the so-called intrinsic rank (pre-training model in various downstream tasks). The process of generalization is actually to optimize a very small number of free parameters in the common low-dimensional intrinsic (low-dimensional intrinsic) subspace of various tasks). During training, the parameters of the pre-trained language model are fixed, and only the dimensionality reduction matrix A and dimensionality enhancement matrix B are trained. While the input and output dimensions of the model remain unchanged, the parameters of BA and the pre-trained language model are superimposed when outputting. Initialize A with a random Gaussian distribution and B with a 0 matrix. This can ensure that when the training starts, the newly added path BA=0 will have no effect on the model results.

During inference, just add the results of the left and right parts together, h=Wx+BAx=(W+BA)x, so, just add the matrix product BA after training and the original weight matrix W together as the new It is enough to replace the W of the original pre-trained language model with the weight parameter, and no additional computing resources will be added.

The biggest advantage of LoRA is that it is faster and uses less memory; therefore, it can run on consumer-grade hardware.

Next, let's try to use Alpaca-Lora for parameter efficient model fine-tuning. The relevant code is placed on GitHub: llm-action .

Environment build

The basic environment configuration is as follows:

Operating system: CentOS 7
CPUs: A single node has an Intel CPU with 1TB memory, the number of physical CPUs is 64, and the number of cores per CPU is 16
GPUs: 8 cards A800 80GB GPUs
Python: 3.10 (you need to upgrade OpenSSL to version 1.1.1t first ( click to download OpenSSL ), and then compile and install Python), click to download Python
NVIDIA driver version: 515.65.01, choose different drivers according to different models, click to download .
CUDA Toolkit: 11.7, click to download
NCCL: nccl_2.14.3-1+cuda11.7, click to download
cuDNN: 8.8.1.3_cuda11, click to download

The installation of the above NVIDIA drivers, CUDA, Python and other tools will not be repeated one by one.

Create a virtual environment and activate the virtual environment alpara-lora-venv-py310-cu117:

cd /home/guodong.li/virtual-venv
virtualenv -p /usr/bin/python3.10 alpara-lora-venv-py310-cu117
source /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/bin/activate

To install PyTorch offline, click to download the corresponding cuda version of torch and torchvision.

pip install torch-1.13.1+cu117-cp310-cp310-linux_x86_64.whl
pip install pip install torchvision-0.14.1+cu117-cp310-cp310-linux_x86_64.whl

Install transformers. At present, LLaMA-related implementations have not released corresponding versions, but they have been merged into the main branch. Therefore, we need to switch to the corresponding commit and install accordingly from the source code.

cd transformers
git checkout 0041be5 
pip install .

In the Alpaca-LoRA project, the authors mentioned that for cheap and efficient fine-tuning, they used the PEFT of the Hugging Face. PEFT is a library (LoRA is one of its supported technologies, in addition to Prefix Tuning, P-Tuning, Prompt Tuning), which allows you to use various Transformer-based language models for efficient fine-tuning. Install PEFT below.

git clone https://github.com/huggingface/peft.git
cd peft/
git checkout e536616
pip install .

Install bits and bytes.

git clone [email protected]:TimDettmers/bitsandbytes.git

cd bitsandbytes
CUDA_VERSION=117 make cuda11x
python setup.py install

Install other related libraries.

cd alpaca-lora
pip install -r requirements.txt

requirements.txtThe specific content of the file is as follows:

accelerate
appdirs
loralib
black
black[jupyter]
datasets
fire
sentencepiece
gradio

Model format conversion

Convert the LLaMA original weight file to the model file format corresponding to the Transformers library. For details, please refer to the previous article: Reproducing Stanford Alpaca (Stanford Alpaca 7B) from 0 to 1 . If you don't want to convert the LLaMA model, you can also download the converted model directly from Hugging Face .

Model fine-tuning

The default values for training are as follows:

batch_size: 128
micro_batch_size: 4
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Using the default parameters, it takes about 5 hours for a single card to complete the training, and the consumption of GPU memory is really low.

 1%|█▌                                | 12/1170 [03:21<5:24:45, 16.83s/it]

In this paper, in order to speed up the training speed, the batch_size and micro_batch_size are increased and the num_epochs is adjusted to be small.

python finetune.py \
--base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
--data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
--output_dir '/home/guodong.li/output/lora-alpaca' \
--batch_size 256 \
--micro_batch_size 16 \
--num_epochs 2

Of course, hyperparameters can also be fine-tuned as needed. The reference examples are as follows:

python finetune.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir './lora-alpaca' \
    --batch_size 128 \
    --micro_batch_size 4 \
    --num_epochs 3 \
    --learning_rate 1e-4 \
    --cutoff_len 512 \
    --val_set_size 2000 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --lora_target_modules '[q_proj,v_proj]' \
    --train_on_inputs \
    --group_by_length

working process:

 python finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/lora-alpaca' \
> --batch_size 256 \
> --micro_batch_size 16 \
> --num_epochs 2

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/lora-alpaca
batch_size: 256
micro_batch_size: 16
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:10<00:00,  3.01it/s]
Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 228.95it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.2501, 'learning_rate': 2.6999999999999996e-05, 'epoch': 0.05}
...
{'loss': 0.8998, 'learning_rate': 0.000267, 'epoch': 0.46}
{'loss': 0.8959, 'learning_rate': 0.00029699999999999996, 'epoch': 0.51}
 28%|███████████████████████████████████████████▎                                                                                                               | 109/390 [32:48<1:23:14, 17.77s/it]

Memory usage:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   71C    P0   299W / 300W |  57431MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
...
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   33C    P0    71W / 300W |    951MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     55017      C   python                          57429MiB |
...
|    7   N/A  N/A     55017      C   python                            949MiB |
+-----------------------------------------------------------------------------+

It is found that the utilization rate of the GPU has increased, and the training speed has also increased, but the GPU resources are not fully utilized, and the single-card training (epoch: 3) can be completed in about 3 hours.

Therefore, in order to further improve the speed of model training, let's try to use data parallelism to train on multiple cards.

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
--base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
--data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
--output_dir '/home/guodong.li/output/lora-alpaca' \
--batch_size 256 \
--micro_batch_size 16 \
--num_epochs 2

working process:

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/lora-alpaca' \
> --batch_size 256 \
> --micro_batch_size 16 \
> --num_epochs 2
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst')}
...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/lora-alpaca
batch_size: 256
micro_batch_size: 16
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:14<00:00,  2.25it/s]
...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:20<00:00,  1.64it/s]
Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 129.11it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map:   4%|██████▎                                                                                                                                     | 2231/49942 [00:01<00:37, 1256.31 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 220.24it/s]
...
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map:   2%|██▋                                                                                                                                          | 939/49942 [00:00<00:37, 1323.94 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 362.77it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.2798, 'learning_rate': 1.7999999999999997e-05, 'epoch': 0.05}
...
{'loss': 0.853, 'learning_rate': 0.0002006896551724138, 'epoch': 1.02}
{'eval_loss': 0.8590874075889587, 'eval_runtime': 10.5401, 'eval_samples_per_second': 189.752, 'eval_steps_per_second': 3.036, 'epoch': 1.02}
{'loss': 0.8656, 'learning_rate': 0.0001903448275862069, 'epoch': 1.07}
...
{'loss': 0.8462, 'learning_rate': 6.620689655172413e-05, 'epoch': 1.69}
{'loss': 0.8585, 'learning_rate': 4.137931034482758e-06, 'epoch': 1.99}
{'loss': 0.8549, 'learning_rate': 0.00011814432989690721, 'epoch': 2.05}
{'eval_loss': 0.8465630412101746, 'eval_runtime': 10.5273, 'eval_samples_per_second': 189.983, 'eval_steps_per_second': 3.04, 'epoch': 2.05}
{'loss': 0.8492, 'learning_rate': 0.00011195876288659793, 'epoch': 2.1}
...
{'loss': 0.8398, 'learning_rate': 1.2989690721649484e-05, 'epoch': 2.92}
{'loss': 0.8473, 'learning_rate': 6.804123711340206e-06, 'epoch': 2.97}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 585/585 [23:46<00:00,  2.38s/it] 
{'train_runtime': 1426.9255, 'train_samples_per_second': 104.999, 'train_steps_per_second': 0.41, 'train_loss': 0.9613736364576552, 'epoch': 2.99}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 585/585 [23:46<00:00,  2.44s/it]

Model file:

> tree /home/guodong.li/output/lora-alpaca
/home/guodong.li/output/lora-alpaca
├── adapter_config.json
├── adapter_model.bin
└── checkpoint-200
    ├── optimizer.pt
    ├── pytorch_model.bin
    ├── rng_state_0.pth
    ├── rng_state_1.pth
    ├── rng_state_2.pth
    ├── rng_state_3.pth
    ├── rng_state_4.pth
    ├── rng_state_5.pth
    ├── rng_state_6.pth
    ├── rng_state_7.pth
    ├── scaler.pt
    ├── scheduler.pt
    ├── trainer_state.json
    └── training_args.bin

1 directory, 16 files

We can see that in the case of data parallelism, if epoch=3 (this paper epoch=2), the training only takes about 20 minutes to complete. Currently, the latest "official" Alpaca-LoRA adapter provided by tloen/Alpaca-LoRA-7b was trained on March 26 with the following hyperparameters.

-   Epochs: 10 (load from best epoch)
-   Batch size: 128
-   Cutoff length: 512
-   Learning rate: 3e-4
-   Lora r: 16
-   Lora target modules: q_proj, k_proj, v_proj, o_proj

The specific commands are as follows:

python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8

model reasoning

Run the command as follows:

python generate.py \
    --load_8bit \
    --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
    --lora_weights '/home/guodong.li/output/lora-alpaca'

Running this script will start a gradio service, and you can test it on a web page through a browser.

The running process is as follows:

python generate.py \
>     --load_8bit \
>     --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b' \
>     --lora_weights '/home/guodong.li/output/lora-alpaca'

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:12<00:00,  2.68it/s]
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
  warnings.warn(
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
  warnings.warn(value)
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

Memory usage:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   50C    P0    81W / 300W |   8877MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7837      C   python                           8875MiB |
+-----------------------------------------------------------------------------+

Open the browser and enter the IP+port for testing.

image.png

Merge LoRA weights back into the base model

The following incorporates the LoRA weights back into the base model for export to HuggingFace format and PyTorch state_dicts. to help users who want to run inference in projects like llama.cpp or alpaca.cpp .

Export to HuggingFace format :

Modify export_hf_checkpoint.pythe file:

import os

import torch
import transformers
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer  # noqa: F402

BASE_MODEL = os.environ.get("BASE_MODEL", None)

# TODO
LORA_MODEL = os.environ.get("LORA_MODEL", "tloen/alpaca-lora-7b")
HF_CHECKPOINT = os.environ.get("HF_CHECKPOINT", "./hf_ckpt")

assert (
    BASE_MODEL
), "Please specify a value for BASE_MODEL environment variable, e.g. `export BASE_MODEL=decapoda-research/llama-7b-hf`"  # noqa: E501

tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)

base_model = LlamaForCausalLM.from_pretrained(
    BASE_MODEL,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map={"": "cpu"},
)

first_weight = base_model.model.layers[0].self_attn.q_proj.weight
first_weight_old = first_weight.clone()

lora_model = PeftModel.from_pretrained(
    base_model,
    # TODO
    # "tloen/alpaca-lora-7b",
    LORA_MODEL,
    device_map={"": "cpu"},
    torch_dtype=torch.float16,
)

...

# TODO
LlamaForCausalLM.save_pretrained(
    base_model, HF_CHECKPOINT , state_dict=deloreanized_sd, max_shard_size="400MB"
)

Run the command:

BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
LORA_MODEL=/home/guodong.li/output/lora-alpaca \
HF_CHECKPOINT=/home/guodong.li/output/hf_ckpt \
python export_hf_checkpoint.py

working process:

BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
> LORA_MODEL=/home/guodong.li/output/lora-alpaca \
> HF_CHECKPOINT=/home/guodong.li/output/hf_ckpt \
> python export_hf_checkpoint.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:05<00:00,  5.99it/s]

View model output files:

> tree /home/guodong.li/output/hf_ckpt
/home/guodong.li/output/hf_ckpt
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00039.bin
├── pytorch_model-00002-of-00039.bin
...
├── pytorch_model-00038-of-00039.bin
├── pytorch_model-00039-of-00039.bin
└── pytorch_model.bin.index.json
0 directories, 42 files

Export as PyTorch state_dicts :

Modify export_state_dict_checkpoint.pythe file:

import json
import os

import torch
import transformers
from peft import PeftModel
from transformers import LlamaForCausalLM, LlamaTokenizer  # noqa: E402

BASE_MODEL = os.environ.get("BASE_MODEL", None)
LORA_MODEL = os.environ.get("LORA_MODEL", "tloen/alpaca-lora-7b")
PTH_CHECKPOINT_PREFIX = os.environ.get("PTH_CHECKPOINT_PREFIX", "./ckpt")

assert (
    BASE_MODEL
), "Please specify a value for BASE_MODEL environment variable, e.g. `export BASE_MODEL=decapoda-research/llama-7b-hf`"  # noqa: E501

tokenizer = LlamaTokenizer.from_pretrained(BASE_MODEL)

base_model = LlamaForCausalLM.from_pretrained(
    BASE_MODEL,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map={"": "cpu"},
)

lora_model = PeftModel.from_pretrained(
    base_model,
    # todo
    #"tloen/alpaca-lora-7b",
    LORA_MODEL,
    device_map={"": "cpu"},
    torch_dtype=torch.float16,
)

...

os.makedirs(PTH_CHECKPOINT_PREFIX, exist_ok=True)

torch.save(new_state_dict, PTH_CHECKPOINT_PREFIX+"/consolidated.00.pth")

with open(PTH_CHECKPOINT_PREFIX+"/params.json", "w") as f:
    json.dump(params, f)

Run the command:

BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
LORA_MODEL=/home/guodong.li/output/lora-alpaca \
PTH_CHECKPOINT_PREFIX=/home/guodong.li/output/ckpt \
python export_state_dict_checkpoint.py

View model output files:

tree /home/guodong.li/output/ckpt
/home/guodong.li/output/ckpt
├── consolidated.00.pth
└── params.json

Of course, you can also encapsulate it as a Docker image to isolate the training and inference environments.

Encapsulate as a Docker image and perform inference

Build the Docker image:

docker build -t alpaca-lora .

Run a Docker container for inference (you can also finetune.pytrain with and all the hyperparameters provided above):

docker run --gpus=all --shm-size 64g -p 7860:7860 -v ${HOME}/.cache:/root/.cache --rm alpaca-lora generate.py \
    --load_8bit \
    --base_model 'decapoda-research/llama-7b-hf' \
    --lora_weights 'tloen/alpaca-lora-7b'

Open the browser and enter the URL: https://localhost:7860to test.

epilogue

As can be seen from the above, on an 8-card A800 server, alpaca_data_cleaned.jsonefficient fine-tuning of parameters can be completed in about 20 minutes based on Alpaca-Lora for instruction data, which is significantly faster than the training speed of Stanford Alpaca.

Reference documents :

LLaMA
Stanford Alpaca : Stanford-Alpaca
Alpaca-LoRA