GPT large language model Vicuna localization deployment practice (the effect kills Alpaca in seconds) | JD Cloud technical team

background

The previous article " GPT Large Language Model Alpaca-lora Localization Deployment Practice " introduced the localization deployment of Stanford University's Alpaca-lora model and verified the actual reasoning effect.

The overall feeling is actually not particularly ideal. The original Alpaca-lora model does not support Chinese well. After fine-tuning the model with a 52k Chinese instruction set, the effect is still not as good as the reasoning effect of GPT-3.5 that is said on the Internet. Verified that sentence: "I don't see and hear things, but I guess whether they exist or not, is it okay?"

On a server with 3 Tesla P40 graphics cards, using 3 GPU graphics cards to load model parameters and calculations, it takes about 30s-1min to perform a simple reasoning (non-mathematical and logical operations), and the efficiency is amazingly slow. In the JD Cloud GPU cloud host deployment, although the reasoning efficiency has been improved a lot, the model has been fine-tuned with Chinese datasets, but the support for Chinese is not very good, and there are often garbled characters, repetitive problems, and incomprehensible words and so on.

Recently, large-scale models have sprung up like mushrooms after a spring rain. Various major factories and scientific research institutions have launched their own large-scale models, among which the most are based on LLaMA (open source and easy to use), so I decided to look at other models to see if the reasoning effect is good. A model with good Chinese support and high inference efficiency.

After screening, the inference effect of Vicuna-13B is said to have reached more than 90% of the ability of ChatGPT, which is better than the effect of LLaMA-13B and Alpaca-13B (as shown in the figure below) . The evaluation method is to input the same question to each model Alpaca, LLaMA, ChatGPT, and Bard, and then use GPT-4 as a referee to score the reasoning results. The answer of ChatGPT is used as 100 points. The closer the answer is, the higher the score (although the evaluation method does not Unscientific, but currently there is no better way to make a more scientific evaluation of the model reasoning results).

At the same time, the training cost of Vicuna is also very low. It is said that it only needs about $300, so try to deploy Vicuna-7B locally to see how the effect is, and do what you say.

Environmental preparation

Since the Alpaca-lora model has been deployed locally before, I thought I could download the open source package directly and see the effect after a simple deployment. It turns out that I am still "too young, too simple". The environment deployment and package conflict resolution The process turned out to be more laborious than deploying the Alpaca-lora model for the first time.

Briefly recap the deployment process. For details, please refer to the previous article " GPT Large Language Model Alpaca-lora Localization Deployment Practice ".

  1. Local deployment or GPU cloud host deployment: The GPU server has 4 independent GPUs, the model is P40, and the computing power of a single P40 is equivalent to the computing power of 60 CPUs with the same main frequency; the GPU cloud host needs to purchase P40https://www. jdcloud.com/cn/calculator/calHost
  2. Install graphics card driver and CUDA driver

model preparation

Since Vicuna is based on the LLaMA model, in order to comply with the LLaMA model license authorization, only the delta weights are released, so we need to combine the original llama-7b model with the delta model weights to get the vicuna weights.

The first is to download the llama-7b model. Since the file is relatively large, use lfs to download it directly from the file server. The size is 26G. Execute:

git lfsclonehttps://huggingface.co/decapoda-research/llama-7b-hf

Then download the delta model and execute:

git lfsclonehttps://huggingface.co/lmsys/vicuna-7b-delta-v1.1

After the download is complete, merge the weights and execute:

python -m fastchat.model.apply_delta \ --base ./model/llama-7b-hf \ --delta ./model/vicuna-7b-delta-v1.1 \ --target ./model/vicuna-7b-all-v1.1

This merging process will be very fast, and the final result is as follows. After merging, the parameter size becomes 13G.

There will be configuration files and data files in the merged directory.

Install dependencies

Vicuna mainly uses 3 dependent packages, fschat, tensorboardX and flash-attn, the installation of the first 2 is relatively smooth, and the installation can be completed by directly pip install fschat and tensorboardX. The flash-attn installation has encountered a problem, and the following error has been reported:

After some searching, it was found that the gcc version was too low, and gcc needed to be upgraded. First, I checked the local gcc version, and gcc -v and g++ -v found that it was 4.8.5, which was indeed too low. To upgrade, just upgrade to the latest version, download version 13.1 directly, you can
choose the version you want to install at http://ftp.gnu.org/gnu/gcc/, here is gcc-13.1.0.tar.gz.

implement:

tar -xzf gcc-13.1.0.tar.gz

cd gcc-13.1.0

./contrib/download_prerequisites

mkdir build

cd build/

../configure -enable-checking=release -enable-languages=c,c++ -disable-multilib

Then execute make to compile. Note that the make time here will be very long, and may last for several hours. You can use make -j 8 to let make run up to 8 compilation commands at the same time to speed up the compilation.

After successful completion, we will execute make install to install.

Then use gcc -v and g++ -v to verify whether the version has been updated. If the prompt is as follows, the installation is complete.

Then we need to uninstall the original gcc and g++, switch to root authority, and execute yum -y remove gcc g++.

To configure the new version to be globally available, execute ln -s /usr/local/bin/gcc /usr/bin/gcc.

To update the link library, execute:

View the original link library: strings /usr/lib64/libstdc++.so.6 | grep CXXABI

Delete the original link library: rm -f /usr/lib64/libstdc++.so.6

Establish a soft link: ln -s /usr/local/lib64/libstdc++.so.6.0.29 /usr/lib64/libstdc++.so.6

Check the new link library: strings /usr/lib64/libstdc++.so.6 | grep CXXABI

If there is a change in the latest version, congratulations, it means that the upgrade has been successful.

install cuda

Since cuda was installed with the rpm package before, some files are missing, and various strange errors will be reported when running, so I won’t go into details here (only those who have experienced it will understand), and directly introduce the process of installing cuda with binary files .

Download link:
https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=CentOS&target_version=7&target_type=runfile_local

Note that runfile(local) should be selected here.

Then execute sh
cuda_11.7.0_515.43.04_linux.run.

After the installation is complete, you need to configure environment variables, and configure the following two items in the local .bash_profile:

Next, verify whether the installation is successful, execute nvcc -V, as shown in the figure below, then congratulations, the installation is successful.

Install cudnn and nccl

To install cudnn and nccl, you need to register an account with nvidia first. After registration, you can download the corresponding rpm package from the following two addresses, and then rpm -ivh XXXXX.rpm package.

cudnn download address: https://developer.nvidia.com/cudnn

nccl download address: https://developer.nvidia.com/nccl/nccl-legacy-downloads

After the installation is complete, the rpm package has been installed successfully as shown in the figure below.

model reasoning

It's exciting time again, let's test and see how well the model works? First of all, let’s wipe off the hard sweat that hasn’t dried up. All our efforts are aimed at finally being able to talk to the robot program. The ideal situation is to make us feel that it is not a robot.

Execute the following command in the terminal, and then enter the question.

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --style rich

Of course, you can also set unused operating parameters according to different demand scenarios, as follows:

#The prediction effect of the compression model will be slightly worse, suitable for scenarios where the GPU memory is not enough

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --load-8bit --style rich

#Use cpu for reasoning, the speed will be very slow, use with caution

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --device cpu --style rich

# Use multiple GPUs for prediction

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --num-gpus 3 --style rich

1) Recommended recipe test:

2) Multilingual test:

3) Code ability test:

4) Mathematical calculation test

5) General dialogue recommendation

GPU server resource usage in the inference process. At present, using a single GPU for inference can achieve a second-level response. The GPU memory is empty and loaded with 13G, which is less than 15G during inference. The computing power of a single GPU can basically reach more than 90% during inference, or even 100%, as shown in the figure below.

in conclusion:

1) The effect of precise reasoning is not very ideal, such as recommending recipes, it feels like serious nonsense, and it is difficult to make delicious meals according to the results of reasoning;

2) Support for multiple natural languages, this is really unexpected, even Japanese and Spanish can be handled freely, which can be said to be quite amazing;

3) The coding ability is still good, and the basic requirements can be roughly given. Of course, if you want to compile and execute directly, you may need manual fine-tuning, but it should be no problem as an auxiliary tool;

4) The data computing ability is still relatively weak at present, and simple multiplication cannot give the correct answer at present;

5) Ordinary conversations are no problem at all, and whether the understanding of Chinese can fully meet expectations, and it can be covered to relieve boredom and loneliness.

Since the model has not yet done fine-tuning, judging from the current inference effect, it is already very good, and the efficiency of inference is also very good. Even if a single GPU is used for inference, it can achieve second-level response, and the inference process The memory usage of the medium memory is only more than 60%, which is not much different from the 50% at no-load time. In short, without fine-tuning, the reasoning performance and reasoning efficiency of the model can still be scored 7-8 points (out of 10 points), if there is enough corpus and fine-tuning over time, the effect can still be expected.

Model fine-tuning

In order to make the model suitable for a scene in a specific field, it is essential to obtain knowledge in a specific field. Based on the original model, fine-tuning operation must be done, so let's try to do fine-tuning to see how it works.

fine-tuning needs to execute the command in the terminal:

torchrun --nproc_per_node=3 --master_port=40001 ./FastChat/fastchat/train/train_mem.py \
    --model_name_or_path ./model/llama-7b-hf  \
    --data_path dummy.json \
    --bf16 False \
    --output_dir ./model/vicuna-dummy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True


Finally, the output of the ./model/vicuna-dummy directory is the model weight file directory after our fine-tuning.

Unfortunately, the fine-tuning in this article was not successful, and the error is as follows:

The reason is also very simple. Since the GPU model we use is Tesla P40, this graphics card uses the SM_62 architecture. At present, the fine-tuning of the model requires at least SM_75 and above architecture. See the community for fine-tuning success on 4090, A100 or A80 graphics cards. Yes, so fine-tuning can only be performed on graphics cards with higher architectures.

Follow-up

In the end, the Vicuna model can be said to kill the Alpaca model in terms of overall performance and reasoning efficiency. We used Vicuna-7b for the test in this article. If it is Vicuna-13b, the effect will be better, and it can be used for multiple natural languages ​​(including Chinese) The support is also far better than the Alpaca model. Indeed, as the community said, the current Vicuna model can be said to be the ceiling of the open source large model. If you want to carry out secondary development based on the open source large model, it is the best choice.

The localization deployment work based on the large model has now come to an end, and the follow-up work may include the following points:

1) If you have a better graphics card, you can perform fine-tuinig on vicuna to verify whether the model can learn knowledge in a specific field after fine-tuning; follow-up plans to use the trial resources provided by the company [JD Cloud GPU cloud host p . n3a100 series ], this product provides Nvidia® A100 GPU (80G video memory), with Intel® Xeon® Platinum 8338C processor and DDR4 memory, supports NVLink, single-precision floating-point calculation peak can reach 156TFlops, it can be said to be the strongest computing power up.

2) Find a suitable scenario that combines with the current application, and implement the application of the large language model;

3) Carry out secondary development based on the vicuna open source project and package it into available services;

4) More exploration and learning based on large language models.

Source: JD Cloud Developer Community

Author: Beyond_luo (do not reprint without authorization)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8805003