【ChatGLM3】(9):使用fastchat和vllm部署chatlgm3-6b模型,并简单的进行速度测试对比。vllm确实速度更快些。

1,视频演示

https://www.bilibili.com/video/BV1ei4y1v7LY/

【chatglm】(9):使用fastchat和vllm部署chatlgm3-6b模型,并简单的进行速度测试对比。vllm确实速度更快些。

2,安装vllm环境,执行vllm即可

在这里插入图片描述

选择cuda 至少要要11.8 以上的版本:
然后直接执行:

pip3 install vllm

下载模型:

apt update && apt install git-lfs

git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

3,在autodl上选择设备

python -m vllm.entrypoints.api_server --trust-remote-code --model /root/autodl-tmp/chatglm3-6b
INFO 12-11 08:09:34 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-tmp/chatglm3-6b', tokenizer='/root/autodl-tmp/chatglm3-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING 12-11 08:09:34 tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 12-11 08:09:46 llm_engine.py:222] # GPU blocks: 20255, # CPU blocks: 9362
INFO:     Started server process [2175]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

测试接口:
test_throughput.py
https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py

# coding=utf-8
"""

代码测试工具:

python3 test_throughput.py --api-address http://localhost:8000 --model-name chatglm3-6b --n-thread 10


"""
import argparse
import json

import requests
import threading
import time


def main():

    headers = {
    
    "User-Agent": "openai client", "Content-Type": "application/json"}
    ploads = {
    
    
        "model": args.model_name,
        "messages": [{
    
    "role": "user", "content": "生成一个50字的故事"}],
        "temperature": 0.7,
    }
    thread_api_addr = args.api_address

    def send_request(results, i):
        print(f"thread {i} goes to {thread_api_addr}")
        response = requests.post(
            thread_api_addr + "/v1/chat/completions",
            headers=headers,
            json=ploads,
            stream=False,
        )
        print(response.text)
        response_new_words = json.loads(response.text)["usage"]["completion_tokens"]
        #error_code = json.loads(response.text)["error_code"]
        print(f"=== Thread {i} ===, words: {response_new_words} ")
        results[i] = response_new_words

    # use N threads to prompt the backend
    tik = time.time()
    threads = []
    results = [None] * args.n_thread
    for i in range(args.n_thread):
        t = threading.Thread(target=send_request, args=(results, i))
        t.start()
        # time.sleep(0.5)
        threads.append(t)

    for t in threads:
        t.join()

    print(f"Time (POST): {time.time() - tik} s")
    n_words = sum(results)
    time_seconds = time.time() - tik
    print(
        f"Time (Completion): {time_seconds}, n threads: {args.n_thread}, "
        f"throughput: {n_words / time_seconds} words/s."
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--api-address", type=str, default="http://127.0.0.1:8000")
    parser.add_argument("--model-name", type=str, default="chatglm3-6b")
    parser.add_argument("--n-thread", type=int, default=10)
    args = parser.parse_args()

    main()

4,总结

使用fastchat 和 vllm 简单的对比了下。
没有做量化,也没有其他配置。

fastchat 是 20 t/s 左右,vllm 是 200+ t/s 速度上确实还是非常不错的。
但是发现 vllm 在返回的内容上不如 fastchat 好。

猜你喜欢

转载自blog.csdn.net/freewebsys/article/details/134917274
今日推荐