[AI Combat] vLLM Application of Large Model LLM Deployment Reasoning Framework
Introduction to vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
vLLM is flexible and easy to use:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
vLLM seamlessly supports most Huggingface models, including:
- BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
- GPT-2 (gpt2, gpt2-xl, etc.)
- GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
- GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
- GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
- LLaMA (lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
- MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
- OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
Environment configuration
Environmental requirements
-
OS: Linux
-
Python: 3.8 or higher
-
CUDA: 11.0 – 11.8
-
GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
install vllm
- pip install
pip install vllm
- Source installation
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e . # This may take 5-10 minutes.
Computing power requirement
Hashrate query method
- Open bing query address: https://cn.bing.com/
- Select the international version for the query method
- Enter query content:
My GPU is T4, just change t4 to yourst4 GPUs compute capability
- The query results are as follows:
Calculation problem
vllm requires that the compute capability of the GPU must be greater than or equal to 7.0, otherwise an error will be reported, and the error message is as follows:
RuntimeError: GPUs with compute capability less than 7.0 are not supported.
Quickstart
Offline Batch Inference
Sample code:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
API Server
FastAPI server
For example, the service uses AsyncLLMEngine
classes to support asynchronous requests.
- Start the service:
python -m vllm.entrypoints.api_server
Default Interface: http://localhost:8000
Default Model:OPT-125M model
- test:
curl http://localhost:8000/generate \
-d '{
"prompt": "San Francisco is a",
"use_beam_search": true,
"n": 4,
"temperature": 0
}'
Compatible with OpenAI Server
- Start the service:
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m
Optional arguments: --host
,--port
- Query service:
curl http://localhost:8000/v1/models
- test:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
Serving
Distributed Reasoning and Serving
Install dependent libraries:
pip install ray
- Multi-GPU inference
4 GPU inference:
from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")
Use tensor_parallel_size to specify the number of GPUs
- Multi-GPU service
python -m vllm.entrypoints.api_server \
--model facebook/opt-13b \
--tensor-parallel-size 4
- Scale to multi-node
Enable before running vllmRay runtime
:
# On head node
ray start --head
# On worker nodes
ray start --address=<ray-head-address>
Run the service with SkyPilot
Install SkyPilot:
pip install skypilot
sky check
serving.yaml:
resources:
accelerators: A100
envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer
setup: |
conda create -n vllm python=3.9 -y
conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py
Start the service:
sky launch serving.yaml
Other optional parameters:
sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf
Test:
browser opens:https://<gradio-hash>.gradio.live
Model
Models supported by vLLM
https://vllm.readthedocs.io/en/latest/models/supported_models.html#supported-models
add your own model
This document provides a high-level guide for integrating HuggingFace Transformers models into vLLM.
https://vllm.readthedocs.io/en/latest/models/adding_model.html
reference
1.https://vllm.readthedocs.io/en/latest/
2.https://github.com/vllm-project/vllm
3.https://vllm.ai/
4.https://github.com/vllm-project/vllm/discussions
5.https://github.com/skypilot-org/skypilot/blob/master/llm/vllm