[AI Combat] vLLM Application of Large Model LLM Deployment Reasoning Framework

Introduction to vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM seamlessly supports most Huggingface models, including:

  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • LLaMA (lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)

Environment configuration

Environmental requirements

  • OS: Linux

  • Python: 3.8 or higher

  • CUDA: 11.0 – 11.8

  • GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)

install vllm

  • pip install
pip install vllm
  • Source installation
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .  # This may take 5-10 minutes.

Computing power requirement

Hashrate query method

  1. Open bing query address: https://cn.bing.com/
  2. Select the international version for the query method
  3. Enter query content:
    t4   GPUs  compute capability
    
    My GPU is T4, just change t4 to yours
  4. The query results are as follows:
    insert image description here

Calculation problem

vllm requires that the compute capability of the GPU must be greater than or equal to 7.0, otherwise an error will be reported, and the error message is as follows:

RuntimeError: GPUs with compute capability less than 7.0 are not supported.

Quickstart

Offline Batch Inference

Sample code:

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

API Server

FastAPI serverFor example, the service uses AsyncLLMEngineclasses to support asynchronous requests.

  • Start the service:
python -m vllm.entrypoints.api_server

Default Interface: http://localhost:8000
Default Model:OPT-125M model

  • test:
curl http://localhost:8000/generate \
    -d '{
        "prompt": "San Francisco is a",
        "use_beam_search": true,
        "n": 4,
        "temperature": 0
    }'

Compatible with OpenAI Server

  • Start the service:
python -m vllm.entrypoints.openai.api_server   --model facebook/opt-125m

Optional arguments: --host,--port

  • Query service:
curl http://localhost:8000/v1/models
  • test:
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Serving

Distributed Reasoning and Serving

Install dependent libraries:

pip install ray
  • Multi-GPU inference
    4 GPU inference:
from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

Use tensor_parallel_size to specify the number of GPUs

  • Multi-GPU service
python -m vllm.entrypoints.api_server \
    --model facebook/opt-13b \
    --tensor-parallel-size 4
  • Scale to multi-node
    Enable before running vllm Ray runtime:
# On head node
ray start --head

# On worker nodes
ray start --address=<ray-head-address>

Run the service with SkyPilot

Install SkyPilot:

pip install skypilot
sky check

serving.yaml:

resources:
    accelerators: A100

envs:
    MODEL_NAME: decapoda-research/llama-13b-hf
    TOKENIZER: hf-internal-testing/llama-tokenizer

setup: |
    conda create -n vllm python=3.9 -y
    conda activate vllm
    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    pip install .
    pip install gradio

run: |
    conda activate vllm
    echo 'Starting vllm api server...'
    python -u -m vllm.entrypoints.api_server \
                    --model $MODEL_NAME \
                    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
                    --tokenizer $TOKENIZER 2>&1 | tee api_server.log &
    echo 'Waiting for vllm api server to start...'
    while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
    echo 'Starting gradio server...'
    python vllm/examples/gradio_webserver.py

Start the service:

sky launch serving.yaml

Other optional parameters:

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf

Test:
browser opens:https://<gradio-hash>.gradio.live

Model

Models supported by vLLM

insert image description here

https://vllm.readthedocs.io/en/latest/models/supported_models.html#supported-models

add your own model

This document provides a high-level guide for integrating HuggingFace Transformers models into vLLM.
https://vllm.readthedocs.io/en/latest/models/adding_model.html

reference

1.https://vllm.readthedocs.io/en/latest/
2.https://github.com/vllm-project/vllm
3.https://vllm.ai/
4.https://github.com/vllm-project/vllm/discussions
5.https://github.com/skypilot-org/skypilot/blob/master/llm/vllm

おすすめ

転載: blog.csdn.net/zengNLP/article/details/131764968