Running Vicuna-7B requires RAM>30GB or 14GB of video memory
Running Vicuna-13B requires RAM>60GB or 28GB of video memory
If you don’t have the above hardware configuration, please go around. My notebook has 64G memory. I ran to see both of them. I used python3.9. G, to run
download llama original model
nyanko7/LLaMA-7B at main We're on a journey to advance and democratize artificial intelligence through open source and open science. https://huggingface.co/nyanko7/LLaMA-7B/tree/main huggyllama/llama-13b at main We're on a journey to advance and democratize artificial intelligence through open source and open science. https://huggingface.co/huggyllama/llama-13b/tree/main You can also use Thunder to download the link below, note as long as 7b, 13b that's it
Magnetic link : magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352&dn=LLaMA
The downloaded files are as follows:
Download vicuna-7b-delta-v1.1 and vicuna-13b-delta-v1.1
https://huggingface.co/lmsys/vicuna-7b-delta-v1.1/tree/mainWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/lmsys/vicuna-7b-delta-v1.1/tree/mainlmsys/vicuna-13b-delta-v1.1 at mainWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main
Install related software
pip install fschat
pip install protobuf==3.20.0
git clone https://github.com/huggingface/transformers.git
cd transformers
python setup.py install
Convert llaMA model
7b
python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir LLaMA/ --model_size 7B --output_dir ./output/llama-7b
13b
python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir LLaMA/ --model_size 13B --output_dir ./output/llama-13b
Combined to generate the Vicuna model, the 64g memory of 13b cannot cover it, so it is enough to set the virtual memory to about 16G-64G
python -m fastchat.model.apply_delta --base ./output/llama-7b --target ./vicuna-7b --delta ./vicuna-7b-delta-v1.1
python -m fastchat.model.apply_delta --base ./output/llama-13b --target ./vicuna-13b --delta ./vicuna-13b-delta-v1.1
Parameter introduction:
base | The path after converting the llaMA model |
---|---|
target | The save path after the merge is generated |
delta | Downloaded vicuna-7b-delta-v1.1 path |
To run the model, use half-float 16-bit
python -m fastchat.serve.cli --model-path ./vicuna-7b --device cpu
python -m fastchat.serve.cli --model-path ./vicuna-13b --device cpu
7b occupies about 26G of memory. On 64G of memory, i9 12900h runs and responds well.
13b occupies about 50G of memory, on 64G of memory, i9 12900h runs slowly
Using the quantized version is to compress the 32-bit floating-point parameters into 8-bit, the speed will be faster, the memory usage will be smaller, and the IQ will drop
python -m fastchat.serve.cli --model-path ./vicuna-7b --device cpu --load-8bit
python -m fastchat.serve.cli --model-path ./vicuna-13b --device cpu --load-8bit
7B occupies 7 G
13B occupies 13 G
Summary: Although the smaller model can run, if you want to fine-tune it yourself, you still need to use a GPU. A100 graphics card or A800 is recommended. If you don’t invest in hardware in advance, you can rent it first . Chiyun is a GPU cloud service provider focusing on the field of artificial intelligence. Provide stable artificial intelligence cloud server, artificial intelligence teaching and training environment, high-speed network disk and other services, and support professional artificial intelligence solutions such as public cloud, private cloud, dedicated cloud, and hardware direct procurement. https://matpool.com/
However, the cpu runs very slowly on the above machine, so I want to try the gpu version. I only have a 1080ti with 11g memory, so I can barely run the 6B one with quantization. How to deploy it?
Start Locally | PyTorchhttps://pytorch.org/get-started/locally/
Since the previous NVIDIA GPU Computing Toolkit for this computer is v10.0, this pytorch 2.0.1 requires 11.8, so update the CUDA Toolkit Archive | NVIDIA Developer https://developer.nvidia.com/cuda-toolkit- archive
To register an account to download, version 11.8
After downloading this and installing it, the environment variables have also changed.
Then you must download cuDNN again, and also download the corresponding version
cuDNN Archive | NVIDIA Developerhttps://developer.nvidia.com/rdp/cudnn-archive
The red box in the picture above is fine. After the download is complete unzip,
Then copy everything in this directory to
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8
The directories will overlap, so there is no need to configure the environment variable again.
After installing these two applications, we go to the top to download the pytorch gpu version
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Then write a code testgpu.py to see if gup is effective
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
When the output is true, it means that it can be used
Run gpu execution, because my 1080ti only has 11G, only --load-8bit can be used
python -m fastchat.serve.cli --model-path ./vicuna-7b --load-8bit
If there is more than 12G video memory to run
python -m fastchat.serve.cli --model-path ./vicuna-7b
Model reasoning (Web UI mode)
If you want to provide services in the web UI mode, you need to configure 3 parts.
web servers, the user interface
model workers, host the model
controller, to coordinate the web server and model worker
to start the controller
python -m fastchat.serve.controller --host 0.0.0.0
Start model worker, use cpu
python -m fastchat.serve.model_worker --model-path ./vicuna-7b --model-name vicuna-7b --host 0.0.0.0 --device cpu --load-8bit
use gpu
python -m fastchat.serve.model_worker --model-path ./vicuna-7b --model-name vicuna-7b --host 0.0.0.0 --load-8bit
Wait until the process finishes loading the model, and you'll see "Uvicorn is running...". The model worker will register itself with the controller.
To make sure your model worker is properly connected to the controller, send a test message with:
python -m fastchat.serve.test_message --model-name vicuna-7b
The output of l is as follows
Then start a web server
python -m fastchat.serve.gradio_web_server --port 8809
Then enter on the browser
http://localhost:8809
It's ready to use! It seems that using gpu can support multiple users to visit in parallel! The two requests in the figure below are sent at the same time