background
In the current context of the popularity of open source large language models, a large number of developers hope to deploy open source LLM locally for research on LLM or to build their own LLM applications based on open source LLM. The author is also trying to build his own LLM application through a series of related excellent projects in the open source community and through localized deployment services. So what preparations are needed to locally deploy an open source LLM to build a chat application?
Preparing the local environment:
Because we need to deploy a large open source model locally, you need to prepare a fairly hard-core local environment. The hardware requires an NVDIA graphics card with high performance and large video memory, large-capacity high-speed memory, and a large-capacity solid-state drive. The software requires the installation of graphics card drivers, CUDA, and Python environments. This time I chose to run the Baichuan-chat-13B model as an example. My basic configuration is CPU i9-13900K, GTX3090 24GB dual cards, 64GB memory and 2TB solid state drive.
A large language model (LLM):
This is the basis on which we build LLM applications. Different LLMs have different model structures and learned knowledge based on different pre-training data and target tasks. AI applications built based on different models will also perform differently. You can find the open source LLMs you are interested in through the popular AI community Hugging Face to try and compare capabilities.
An inference service that deploys LLM locally: The inference service can load the pre-trained LLM model to the local server and provide a model prediction interface, so that the LLM model can be used locally to perform various NLP tasks without relying on cloud services. You can use some excellent GitHub open source projects, which provide one-click deployment of inference services for popular open source LLM. The ones that are relatively well-known and have many stars include LocalAI, openLLM, etc.
A simple and easy-to-use "LLM operating system" Dify.AI: If you want to build a chat application based on LLM capabilities, you may need to study the full set of LLM technology stacks, such as: API calls of different models, vector database selection, embedding technology research and so on. If you use the open source project Dify.AI, you can save these research and learning tasks and help you quickly create AI applications based on different LLM capabilities through a visual interface. The latest version of Dify has added support for open source LLMs. All models hosted on HuggingFace and Replicate can be quickly called and switched. It also supports local deployment and can build AI applications based on openLLM and Xorbits inference inference services. .
The author will try to use the open source LLMOps platform Dify.AI, the open source inference service Xinference and the open source model baichuan-chat-13B as examples to teach you step by step how to build an LLM chat application in a windows environment. Without further ado, let’s get straight to work.
Environmental preparation
Basic conda and Python should generally be available. However, this article will introduce the environment configuration from scratch!
Configure python environment
Generally, it is recommended to use conda for python version management. First install conda according to the conda official website documentation [1] . Then use conda to initialize the Python 3.11 environment:
conda create --name python-3-11 python=3.11
conda activate python-3-11
Install CUDA
It is recommended to install directly from the official website [2] . Windows 11 Select the version pictured below.
After installing according to the boot, open "NVDIA Control Panel -> System Information" and see that it is installed.
WSL2 preparation
It is recommended to use WSL2 environment for Dify's docker deployment. So now install WSL2 first. Refer to Microsoft official guidelines [3] .
The first step is to run CMD as administrator:
The second step is to install using the command in CMD:
wsl --install
As a result, we can see the various supported system versions
适用于 Linux 的 Windows 子系统已安装。
以下是可安装的有效分发的列表。
请使用“wsl --install -d <分发>”安装。
NAME FRIENDLY NAME
Ubuntu Ubuntu
Debian Debian GNU/Linux
kali-linux Kali Linux Rolling
Ubuntu-18.04 Ubuntu 18.04 LTS
Ubuntu-20.04 Ubuntu 20.04 LTS
Ubuntu-22.04 Ubuntu 22.04 LTS
OracleLinux_7_9 Oracle Linux 7.9
OracleLinux_8_7 Oracle Linux 8.7
OracleLinux_9_1 Oracle Linux 9.1
openSUSE-Leap-15.5 openSUSE Leap 15.5
SUSE-Linux-Enterprise-Server-15-SP4 SUSE Linux Enterprise Server 15 SP4
SUSE-Linux-Enterprise-15-SP5 SUSE Linux Enterprise 15 SP5
openSUSE-Tumbleweed openSUSE Tumbleweed
I installed the default Ubuntu version using select:
wsl --install -d Ubuntu
After that, you can use the "wsl" command in CMD to enter Ubuntu.
Step 3. Install Docker Desktop
Go to Docker official documentation [4] to download "Docker Desktop". When installing, be sure to check the "Use WSL 2 instead of Hyper-V" option. After the installation is complete, restart the computer. Check whether it is installed normally through CMD.
wsl -l --verbose
NAME STATE VERSION
* Ubuntu Running 2
docker-desktop Running 2
docker-desktop-data Running 2
You can see that Ubuntu and Docker are running in WSL, and they are confirmed to be WSL2 versions.
Step 4: Configure the proxy for WSL
Since the IP address of WSL will change after each restart, we can write a script to solve it. Change line 4 to your own port number.
#!/bin/sh
hostip=$(cat /etc/resolv.conf | grep nameserver | awk '{ print $2 }')
wslip=$(hostname -I | awk '{print $1}')
port=7890
PROXY_HTTP="http://${hostip}:${port}"
set_proxy(){
export http_proxy="${PROXY_HTTP}"
export HTTP_PROXY="${PROXY_HTTP}"
export https_proxy="${PROXY_HTTP}"
export HTTPS_proxy="${PROXY_HTTP}"
export ALL_PROXY="${PROXY_SOCKS5}"
export all_proxy=${PROXY_SOCKS5}
git config --global http.https://github.com.proxy ${PROXY_HTTP}
git config --global https.https://github.com.proxy ${PROXY_HTTP}
echo "Proxy has been opened."
}
unset_proxy(){
unset http_proxy
unset HTTP_PROXY
unset https_proxy
unset HTTPS_PROXY
unset ALL_PROXY
unset all_proxy
git config --global --unset http.https://github.com.proxy
git config --global --unset https.https://github.com.proxy
echo "Proxy has been closed."
}
test_setting(){
echo "Host IP:" ${hostip}
echo "WSL IP:" ${wslip}
echo "Try to connect to Google..."
resp=$(curl -I -s --connect-timeout 5 -m 5 -w "%{http_code}" -o /dev/null www.google.com)
if [ ${resp} = 200 ]; then
echo "Proxy setup succeeded!"
else
echo "Proxy setup failed!"
fi
}
if [ "$1" = "set" ]
then
set_proxy
elif [ "$1" = "unset" ]
then
unset_proxy
elif [ "$1" = "test" ]
then
test_setting
else
echo "Unsupported arguments."
fi
The fifth step is to enter Ubuntu, install conda and configure python
As with the previous environment preparation, follow the official documentation to install conda and configure python, but install the Linux version.
Step 6, install CUDA for WSL
Go to the official website, select the WSL-Ubuntu version, and follow the instructions to install using the command line.
Step 7, install PyTorch
Enter the PyTorch official website [5] and install PyTorch according to the environment.
This completes the environment preparation.
Deploy the inference service Xinference
According to Dify's deployment documentation [6] , Xinference supports quite a few models. This time, let’s choose Xinference and try baichuan-chat-3B.
Xorbits inference is a powerful and versatile distributed inference framework designed to serve large language models, speech recognition models, and multi-modal models, even on a laptop. It supports a variety of GGML-compatible models, such as ChatGLM, Baichuan, Whisper, Vicuna, Orca, etc. Dify supports local deployment to access the large language model reasoning and embedding capabilities deployed by Xinference.
Install Xinfernece
Execute the following command in WSL:
$ pip install "xinference"
The above command will install Xinference's basic dependencies for inference. Xinference also supports "ggml inference" and "PyTorch inference". You need to install the following dependencies:
$ pip install "xinference[ggml]"
$ pip install "xinference[pytorch]"
$ pip install "xinference[all]"
Start Xinference and download and deploy the baichuan-chat-3B model
Execute the following command in WSL:
$ xinference -H 0.0.0.0
Xinference will start a worker locally by default, and the endpoint is:
" http://127.0.0.1:9997 ", the default port is "9997". By default, it can only be accessed by the local computer. If "-H 0.0.0.0" is configured, non-local clients can access it at will. If you need to further modify "host" or "port", you can view the help information of xinference: "xinference --help".
2023-08-25 18:08:31,204 xinference 27505 INFO Xinference successfully started. Endpoint: http://0.0.0.0:9997
2023-08-25 18:08:31,204 xinference.core.supervisor 27505 INFO Worker 0.0.0.0:53860 has been added successfully
2023-08-25 18:08:31,205 xinference.deploy.worker 27505 INFO Xinference worker successfully started.
Open in the browser: http://localhost:9997, select baichuan-chat, pytorch, 13B, 4bit, click create to deploy.
Or deploy using CLI:
xinference launch --model-name baichuan-chat --model-format pytorch --size-in-billions 13 --quantization 4
Since different models have different compatibility on different hardware platforms, please check the Xinference built-in model [7] to determine whether the created model supports the current hardware platform.
Use Xinference to manage models
To view all deployed models, on the command line, execute the following command:
$ xinference list
Information similar to the following will be displayed:
UID Type Name Format Size (in billions) Quantization
------------------------------------ ------ ------------- -------- -------------------- --------------
0db1e250-4330-11ee-b9ef-00155da30d2d LLM baichuan-chat pytorch 13 4-bit
"0db1e250-4330-11ee-b9ef-00155da30d2d" is the uid of the model just deployed.
Deploy Dify.AI
For the main process, please refer to the official website deployment document [8] .
Clone Dify
Clone Dify source code to local
git clone https://github.com/langgenius/dify.git
Start Dify
Enter the docker directory of the differentiate source code and execute the one-click startup command:
cd dify/docker
docker compose up -d
Deployment results:
[+] Running 7/7
✔ Container docker-weaviate-1 Running 0.0s
✔ Container docker-web-1 Running 0.0s
✔ Container docker-redis-1 Running 0.0s
✔ Container docker-db-1 Running 0.0s
✔ Container docker-worker-1 Running 0.0s
✔ Container docker-api-1 Running 0.0s
✔ Container docker-nginx-1 Started 0.9s
Finally check if all containers are running properly:
docker compose ps
Operating status:
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
docker-api-1 langgenius/dify-api:0.3.16 "/bin/bash /entrypoi…" api 24 hours ago Up 3 hours 5001/tcp
docker-db-1 postgres:15-alpine "docker-entrypoint.s…" db 33 hours ago Up 3 hours 0.0.0.0:5432->5432/tcp
docker-nginx-1 nginx:latest "/docker-entrypoint.…" nginx 24 hours ago Up 4 minutes 0.0.0.0:80->80/tcp
docker-redis-1 redis:6-alpine "docker-entrypoint.s…" redis 33 hours ago Up 3 hours 6379/tcp
docker-weaviate-1 semitechnologies/weaviate:1.18.4 "/bin/weaviate --hos…" weaviate 33 hours ago Up 3 hours
docker-web-1 langgenius/dify-web:0.3.16 "/bin/sh ./entrypoin…" web 33 hours ago Up 3 hours 3000/tcp
docker-worker-1 langgenius/dify-api:0.3.16 "/bin/bash /entrypoi…" worker 33 hours ago Up 3 hours 5001/tcp包括 3 个业务服务「 api / worker / web 」,以及 4 个基础组件「 weaviate / db / redis / nginx 」。
Includes 3 business services "api/worker/web" and 4 basic components "weaviate/db/redis/nginx".
After Docker starts successfully, visit: http://127.0.0.1/ in the browser. After setting a password and logging in, you will enter the application list page.
At this point, Dify Community Edition has been successfully deployed using Docker.
Connect to Xinference at Dify
Configure model provider
Fill in the model information in "Settings > Model Supplier > Xinference":
- Model Name is the name you give your model deployment.
- Server URL is the end point address of xinference.
- Model UID is the UID of the deployed model obtained through xinference list
It should be noted that Sever Url cannot use localhost. Because if you fill in localhost, you are accessing the localhost in docker, which will cause the access to fail. The solution is to change the Sever Url to the LAN IP. In a WSL environment, you need to use the WSL IP address.
Use the command in WSL to get:
hostname -I
172.31.157.121
Use baichuan-chat
After creating an application, you can use the baichuan-chat-3B model configured in the previous step in the application. In Dify's prompt word arrangement interface, select the baichuan-chat model, design your application prompt word (prompt), and then publish an accessible AI application.
The above is the whole process of locally deploying Dify and connecting to baichuan-chat deployed by Xinference. At this point, our chat application based on baichuan-chat-13B is basically completed.
postscript
Of course, for a production-level LLM application, it is not enough to just complete the access, inference, and chat interaction of the large model. We also need to specifically tune LLM's prompts, add private data as context, or fine-tune LLM itself. This requires long-term iteration and optimization to make LLM application performance better and better. As a middleware tool platform, Dify.AI provides a visual operating system for a complete LLM App technology stack. After the above basic service deployment is completed, subsequent application iterations and improvements can be completed based on Dify, making the construction and management of LLM applications simpler and easier to use. The cleaning can be automatically completed by directly uploading the business data. After processing, data annotation and improvement services will also be provided in the future, and even your business team can participate in the collaboration.
At present, the development and application of LLM are still in a very early stage. I believe that in the near future, whether it is the release of LLM capabilities or the continuous improvement of the capabilities of various tools based on LLM, the threshold for developers to explore LLM capabilities will continue to be lowered. Let more AI applications with rich scenarios emerge.
If you like Dify, welcome:
- Contribute code on GitHub and build a better Dify with us;
- Share Dify and your experience with your friends through online and offline activities and social media;
- Give us a shoutout on GitHub ⭐️
You can also contact the Dify assistant and join our friend group chat to share experience with each other: