NLP (fifty-nine) using FastChat to deploy the Baichuan large model

This article will introduce how to use FastChat to deploy a large domestic model - the Baichuan model.

Before that, let's understand two concepts - 百川模型and FastChat.

Baichuan model

On June 15, 2023, Baichuan Intelligent, known as the "China ChatGPT Dream Team", launched a Chinese and English pre-trained large model with 7 billion parameters baichuan-7B.

baichuan-7BIt is an open source large-scale pre-training model developed by Baichuan Intelligence. Based on the Transformer structure, the 7 billion parameter model trained on about 1.2 trillion tokens supports both Chinese and English, and the context window length is 4096. It achieves the best results of the same size on the standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).

In terms of constructing the pre-training corpus, Baichuan Intelligence is based on high-quality Chinese corpus and incorporates high-quality English data. Compared with other open source Chinese pre-training models with the same parameter scale, the amount of data has increased by more than 50%.

In terms of data quality, the data is scored through the quality model, and the original data set is accurately screened at the chapter level and sentence level
In terms of content diversity, the self-developed ultra-large-scale locality-sensitive hash clustering system and semantic clustering system are used to cluster the data at multiple levels and at multiple granularities
Finally, a pre-training data containing 1.2 trillion tokens with both quality and diversity was constructed.

Unlike LLaMA, which completely prohibits commercial use, baichuan-7Bthe code uses a more relaxed open source agreement-Apache-2.0 agreement, which allows commercial use

FastChat

FastChatIt is an open platform for dialogue robot model training, deployment, and evaluation. Its core features include:

Model weights, training code, and evaluation code can be used for SOTA models (such as Vicuna, FastChat-T5)
Distributed multi-model deployment system with built-in Web UI and OpenAI compatible RESTful APIs

FastChatIntegrate open source models such as Vicuna, Koala, alpaca, LLaMA, etc. Among them, Vicuna claims to be able to achieve 90% of the quality of GPT-4, which is the better answering effect in the open source chatGPT model.

FastChatThe access address is: https://chat.lmsys.org/ , FastChatand the installation method is: pip3 install fschat.

CLI deployment

Download the model on Huggingface Hub baichuan-7B, visit the URL: https://huggingface.co/baichuan-inc/Baichuan-7B , put it in the local path on the GPU machine.

The author's GPU machine is 4 * RTX6000, and the video memory of each RTX6000 is 80G.

FastChatThe command to deploy the Baichuan large model using the CLI is:

python3 -m fastchat.serve.cli --model-path path_of_Baichuan-7B --num-gpus 2

During CLI deployment, if you encounter the following error: trust_remote_code=True, refer to the issue website: https://github.com/lm-sys/FastChat/issues/1789 , then in the corresponding Python path, add FastChat's fastchat/model/ Lines 57 to 61 in the code in the model_adapter.py file:

			tokenizer = AutoTokenizer.from_pretrained(
                model_path,
                use_fast=self.use_fast_tokenizer,
                revision=revision,
            )

and lines 69 to 71

		model = AutoModelForCausalLM.from_pretrained(
            model_path, low_cpu_mem_usage=True, **from_pretrained_kwargs
        )

Add the code: ` trust_remote_code=True` , then it can be deployed smoothly.

The interface after successful deployment is as follows:
User interface after CLI deployment

Web deployment

FastChatIt also supports WEB deployment, Web UI and OpenAI compatible RESTful APIs.

Here we mainly introduce how to implement a deployment method with RESTful APIs compatible with OpenAI. The reference URL is: https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md .

The deployment is divided into three steps:

python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path path_of_Baichuan-7B
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

During the deployment process, if you encounter PydanticImportErrora problem with the pydantic version, just downgrade the pydantic version to 1.*the latest version.

After successful deployment, the service can provide RESTful APIs similar to OpenAI, as follows:

view model

The curl command is:

curl http://localhost:8000/v1/models

The output is:

{
    
    
  "object": "list",
  "data": [
    {
    
    
      "id": "baichun_7b",
      "object": "model",
      "created": 1689004839,
      "owned_by": "fastchat",
      "root": "baichun_7b",
      "parent": null,
      "permission": [
        {
    
    
          "id": "modelperm-UERow2kYwq5B2M8aVQkwdk",
          "object": "model_permission",
          "created": 1689004839,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": true,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Text Completions

The curl command is:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baichun_7b",
    "prompt": "Once upon a time",
    "max_tokens": 40,
    "temperature": 0.5
  }' | jq .

The output is:

{
    
    
  "id": "cmpl-izbe3cRRiY4zAbJueBAyxZ",
  "object": "text_completion",
  "created": 1689004991,
  "model": "baichun_7b",
  "choices": [
    {
    
    
      "index": 0,
      "text": ", you could find a variety of different types of chocolate in stores. But now, many chocolate companies are focusing on creating vegan chocolate that is not only delicious but also cruelty-free. Here are",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    
    
    "prompt_tokens": 4,
    "total_tokens": 43,
    "completion_tokens": 39
  }
}

Dialog (Chat Completions)

The curl command is:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baichun_7b",
    "messages": [{"role": "user", "content": "请用中文简单介绍三国演义？"}]
  }' | jq .

The output is:

{
    
    
  "id": "chatcmpl-3SiRqRgbZR8v6gLnQYo9eJ",
  "object": "chat.completion",
  "created": 1689005219,
  "model": "baichun_7b",
  "choices": [
    {
    
    
      "index": 0,
      "message": {
    
    
        "role": "assistant",
        "content": " 三国演义是中国古代长篇小说，讲述了东汉末年至晋朝初年的历史故事。主要人物包括曹操、刘备、孙权和关羽等。故事情节曲折复杂，涉及政治、军事、文化等多个方面，被誉为中国古代小说的经典之作。《三国演义》不仅是一部文学作品，也是中国文化的重要组成部分，对中国历史和文化产生了深远的影响。"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    
    
    "prompt_tokens": 533,
    "total_tokens": 629,
    "completion_tokens": 96
  }
}

multiple rounds of dialogue

The curl command is:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "baichun_7b",
    "messages": [{"role": "user", "content": "请用中文简单介绍西游记？"}, {"role": "assistant", "content": "三国演义是中国古代长篇小说，讲述了东汉末年至晋朝初年的历史故事。主要人物包括曹操、刘备、孙权和关羽等。故事情节曲折复杂，涉及政治、军事、文化等多个方面，被誉为中国古代小说的经典之作。《三国演义》不仅是一部文学作品，也是中国文化的重要组成部分，对中国历史和文化产生了深远的影响。"}, {"role": "user", "content": "它的作者是谁？"}]
  }' | jq .

The output is:

{
    
    
  "id": "chatcmpl-8oE57oXC862wKYyrPLnSGM",
  "object": "chat.completion",
  "created": 1689005374,
  "model": "baichun_7b",
  "choices": [
    {
    
    
      "index": 0,
      "message": {
    
    
        "role": "assistant",
        "content": " 《三国演义》的作者是明代小说家罗贯中。罗贯中是明代文学家，他的代表作品还有《水浒传》和《西游记》等。他在创作《三国演义》时，参考了大量的历史资料和传说，将这些内容融合在一起，创造了一个虚构的世界，成为了中国文学史上的经典之作。"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    
    
    "prompt_tokens": 640,
    "total_tokens": 724,
    "completion_tokens": 84
  }
}

use python code

import openai
openai.api_key = "EMPTY" # Not support yet
openai.api_base = "http://localhost:8000/v1"

model = "baichun_7b"
prompt = "Once upon a time"

# create a completion
completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)

# create a chat completion
completion = openai.ChatCompletion.create(
  model=model,
  messages=[{
    
    "role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)

The above two deployment methods both support streaming output, and the model inference speed is fast. The author's inference time in the above test example is generally 5-7 seconds, and supports distributed deployment, with high concurrency.

Summarize

This article mainly introduces how to use FastChat to deploy the domestic large model - Baichuan model, and demonstrates two deployment methods - WEB deployment and CLI deployment, as well as problems and solutions that arise during the deployment process, hoping to bring readers enlightenment.

references

Geekpark released an open source Chinese and English model with 7 billion parameters: https://www.geekpark.net/news/320721
baichuan-inc/Baichuan-7B in Huggingface Hub: https://huggingface.co/baichuan-inc/Baichuan-7B
OpenAI-Compatible RESTful APIs & SDK: https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md
report error while i execute python -m fastchat.serve.openai_api_server --host localhost --port 8000: https://github.com/lm-sys/FastChat/issues/1641
FastChat in Github: https://github.com/lm-sys/FastChat