Deploy Vicuna-7B and Vicuna-13B models under win10 and win11, run on gpu cpu

Running Vicuna-7B requires RAM>30GB or 14GB of video memory
Running Vicuna-13B requires RAM>60GB or 28GB of video memory

If you don’t have the above hardware configuration, please go around. My notebook has 64G memory. I ran to see both of them. I used python3.9. G, to run

download llama original model 

nyanko7/LLaMA-7B at main We're on a journey to advance and democratize artificial intelligence through open source and open science. https://huggingface.co/nyanko7/LLaMA-7B/tree/main huggyllama/llama-13b at main We're on a journey to advance and democratize artificial intelligence through open source and open science. https://huggingface.co/huggyllama/llama-13b/tree/main You can also use Thunder to download the link below, note as long as 7b, 13b that's it

Magnetic link : magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA

The downloaded files are as follows:

Download vicuna-7b-delta-v1.1 and vicuna-13b-delta-v1.1 

https://huggingface.co/lmsys/vicuna-7b-delta-v1.1/tree/mainWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/lmsys/vicuna-7b-delta-v1.1/tree/mainlmsys/vicuna-13b-delta-v1.1 at mainWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main 

 Install related software

pip install fschat
pip install protobuf==3.20.0
git clone https://github.com/huggingface/transformers.git
cd transformers
python setup.py install

Convert llaMA model

7b

python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir LLaMA/  --model_size 7B  --output_dir ./output/llama-7b

13b

python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir LLaMA/  --model_size 13B  --output_dir ./output/llama-13b

 

 Combined to generate the Vicuna model, the 64g memory of 13b cannot cover it, so it is enough to set the virtual memory to about 16G-64G

python -m fastchat.model.apply_delta --base ./output/llama-7b --target ./vicuna-7b --delta ./vicuna-7b-delta-v1.1

python -m fastchat.model.apply_delta --base ./output/llama-13b --target ./vicuna-13b --delta ./vicuna-13b-delta-v1.1

 Parameter introduction:

base The path after converting the llaMA model
target The save path after the merge is generated
delta Downloaded vicuna-7b-delta-v1.1 path

 

To run the model, use half-float 16-bit

python -m fastchat.serve.cli --model-path ./vicuna-7b --device cpu

python -m fastchat.serve.cli --model-path ./vicuna-13b --device cpu

 7b occupies about 26G of memory. On 64G of memory, i9 12900h runs and responds well.

13b occupies about 50G of memory, on 64G of memory, i9 12900h runs slowly

Using the quantized version is to compress the 32-bit floating-point parameters into 8-bit, the speed will be faster, the memory usage will be smaller, and the IQ will drop

python -m fastchat.serve.cli --model-path ./vicuna-7b --device cpu --load-8bit

python -m fastchat.serve.cli --model-path ./vicuna-13b --device cpu --load-8bit

 7B occupies 7 G

 13B occupies 13 G

 Summary: Although the smaller model can run, if you want to fine-tune it yourself, you still need to use a GPU. A100 graphics card or A800 is recommended. If you don’t invest in hardware in advance, you can rent it first . Chiyun is a GPU cloud service provider focusing on the field of artificial intelligence. Provide stable artificial intelligence cloud server, artificial intelligence teaching and training environment, high-speed network disk and other services, and support professional artificial intelligence solutions such as public cloud, private cloud, dedicated cloud, and hardware direct procurement. https://matpool.com/

However, the cpu runs very slowly on the above machine, so I want to try the gpu version. I only have a 1080ti with 11g memory, so I can barely run the 6B one with quantization. How to deploy it?

Start Locally | PyTorchhttps://pytorch.org/get-started/locally/

Since the previous NVIDIA GPU Computing Toolkit for this computer is v10.0, this pytorch 2.0.1 requires 11.8, so update the CUDA Toolkit Archive | NVIDIA Developer https://developer.nvidia.com/cuda-toolkit- archive

To register an account to download, version 11.8

 After downloading this and installing it, the environment variables have also changed.

 Then you must download cuDNN again, and also download the corresponding version

cuDNN Archive | NVIDIA Developerhttps://developer.nvidia.com/rdp/cudnn-archive

The red box in the picture above is fine. After the download is complete unzip,

 Then copy everything in this directory to

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8

The directories will overlap, so there is no need to configure the environment variable again.

After installing these two applications, we go to the top to download the pytorch gpu version 

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then write a code testgpu.py to see if gup is effective

import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

When the output is true, it means that it can be used 

Run gpu execution, because my 1080ti only has 11G, only --load-8bit can be used

python -m fastchat.serve.cli --model-path ./vicuna-7b --load-8bit

If there is more than 12G video memory to run

python -m fastchat.serve.cli --model-path ./vicuna-7b 

 

Model reasoning (Web UI mode)
If you want to provide services in the web UI mode, you need to configure 3 parts.

web servers, the user interface
model workers, host the model
controller, to coordinate the web server and model worker
to start the controller

python -m fastchat.serve.controller --host 0.0.0.0

Start model worker, use cpu 

python -m fastchat.serve.model_worker  --model-path ./vicuna-7b --model-name vicuna-7b --host 0.0.0.0 --device cpu --load-8bit

use gpu 

python -m fastchat.serve.model_worker --model-path ./vicuna-7b --model-name vicuna-7b --host 0.0.0.0  --load-8bit 

Wait until the process finishes loading the model, and you'll see "Uvicorn is running...". The model worker will register itself with the controller.

To make sure your model worker is properly connected to the controller, send a test message with:

python -m fastchat.serve.test_message --model-name vicuna-7b

 The output of l is as follows

 Then start a web server

python -m fastchat.serve.gradio_web_server --port 8809

Then enter on the browser

http://localhost:8809

It's ready to use! It seems that using gpu can support multiple users to visit in parallel! The two requests in the figure below are sent at the same time

 

Guess you like

Origin blog.csdn.net/babytiger/article/details/130691846