Send the open source model Falcon-7B to the cloud based on Truss+Docker+Kubernetes (translation)

background

So far, we have seen the capabilities of ChatGPT and the great features it has to offer. However, for enterprise applications, a closed-source model like ChatGPT can pose risks, since the enterprises themselves have no control over their data. Although OpenAI claims that user data will not be stored or used to train models, this does not guarantee that the data will not be leaked in some way.

To address some of the issues associated with closed-source models, researchers are rushing to build open-source large language models (LLMs) that compete with models like ChatGPT. With open source models, businesses can host models in a secure cloud environment, reducing the risk of data breaches. Most importantly, you get complete transparency into the inner workings of the model, which helps users build more trusting relationships with AI systems.

With recent advances in open-source LLM, it is tempting to try out new models and see how they compare to closed-source models like ChatGPT.

However, there are huge barriers to running an open source model today. For example, it is much easier to call the ChatGPT API than to understand how to run an open source LLM.

In this post, I aim to overcome the aforementioned difficulties by showing how an open source model like the Falcon-7B model can be run in the cloud in a production-like environment. Eventually, we will be able to access these models through API endpoints similar to ChatGPT.

challenge

A significant challenge in running open-source models is the lack of computing resources. Even a "small" model like the Falcon-7B requires a GPU to run.

To solve this problem, we can utilize GPUs in the cloud. However, this presents another challenge. How do we containerize LLM? How do we enable GPU support? Enabling GPU support can be tricky as it requires knowledge of CUDA. Using CUDA can be a pain because you have to figure out how to install the correct CUDA dependencies and which versions are compatible.

[Translator's Note] CUDA (Compute Unified Device Architecture) is a computing platform launched by graphics card manufacturer NVIDIA. CUDA™ is a general-purpose parallel computing architecture launched by NVIDIA, which includes the CUDA instruction set architecture (ISA) and the parallel computing engine inside the GPU. Developers can use C, C++, and FORTRAN to write programs for the CUDA™ architecture.

Therefore, to avoid the CUDA death trap, many companies have created solutions that can easily containerize models while supporting GPUs. In this blog post, we'll use an open source tool called Truss to help us easily containerize LLM without too much hassle.

Truss allows developers to easily containerize models built with any framework.

Why use Truss?

Truss — https://truss.baseten.co/e2e。

Truss — https://truss.baseten.co/e2e。

Truss has many useful functions out of the box, such as:

  • Convert Python models to microservices with production-ready API endpoints
  • Freeze dependencies via Docker
  • Support for GPU inference
  • Simple preprocessing and postprocessing of the model
  • Easy and secure secret management

I've used Truss before to deploy machine learning models and the process was smooth and easy. Truss automatically creates dockerfiles and manages Python dependencies. All we have to do is provide code for our model.

Actually, the main reason we want to use a tool like Truss is that it makes it easier to deploy our models with GPU support.

plan

Here are the main ones I will cover in this blog post:

  1. Setting up the Falcon 7B locally with Truss
  2. If you have a GPU (I have an RTX 3080), run the model locally
  3. Containerize the model and run it with Docker
  4. Create a GPU-enabled Kubernetes cluster in Google Cloud to run our model

However, don't worry, if step 2 doesn't have a GPU, you can still run the model in the cloud.

The following is the Github code warehouse address, which contains all the relevant code described later in this article (if you want to continue reading):

https://github.com/htrivedi99/falcon-7b-truss

let's start!

Step 1: Falcon 7B Local Setup Using Truss

First, we need to create a project with Python version ≥ 3.8.

Then, we will download the model from the HuggingFace official website and use Truss for packaging. Here are the dependencies we need to install:

pip install truss

Then, create a script called main.py in your Python project. This is a temporary script that we will use to process Truss.

Next, we will set up the Truss package by running the following command in the terminal:

truss init falcon_7b_truss

Press 'y' if prompted to create a new Truss. Once complete, you should see a new directory called falcon_7b_truss. In that directory, there will be some auto-generated files and folders. We need to fill in the following items: model.py, which is located under the model folder and is also referenced by the file config.yaml.

As I mentioned before, Truss only needs the code for our model, it takes care of everything else automatically. We will write the code in model.py, but it must be written in a specific format.

Truss expects each model to support at least three functions: __init__, load, and predict.

  • __init__ is mainly used to create class variables
  • load is where we download the model from the HuggingFace official website
  • predict is where we call the model Here is the full code of model.py:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from typing import Dict
 
MODEL_NAME = "tiiuae/falcon-7b-instruct"
DEFAULT_MAX_LENGTH = 128
 
 
class Model:
 def __init__(self, data_dir: str, config: Dict, **kwargs) -> None:
 self._data_dir = data_dir
 self._config = config
 self.device = "cuda" if torch.cuda.is_available() else "cpu"
 print("THE DEVICE INFERENCE IS RUNNING ON IS: ", self.device)
 self.tokenizer = None
 self.pipeline = None
 
 def load(self):
 self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
 model_8bit = AutoModelForCausalLM.from_pretrained(
 MODEL_NAME,
 device_map="auto",
 load_in_8bit=True,
 trust_remote_code=True)
 
 self.pipeline = pipeline(
 "text-generation",
 model=model_8bit,
 tokenizer=self.tokenizer,
 torch_dtype=torch.bfloat16,
 trust_remote_code=True,
 device_map="auto",
 )
 
 def predict(self, request: Dict) -> Dict:
 with torch.no_grad():
 try:
 prompt = request.pop("prompt")
 data = self.pipeline(
 prompt,
 eos_token_id=self.tokenizer.eos_token_id,
 max_length=DEFAULT_MAX_LENGTH,
 **request
 )[0]
 return {"data": data}
 
 except Exception as exc:
 return {"status": "error", "data": None, "message": str(exc)}

Here's what's happening:

  • MODEL_NAME is the model we are going to use, in our case the falcon-7b-instruct model
  • Inside load, we download the 8-bit model from the HuggingFace official website. The reason we want 8-bit is that the model uses significantly less memory on the GPU when quantized.
  • Also, if you want to run the model natively on a GPU with less than 13GB of VRAM, you will need to load the model in 8-bit.
  • The predict function accepts a JSON request as an argument and invokes the model using self.pipeline. torch.no_grad tells Pytorch that we are in inference mode, not training mode. It's cool! That's all we need to set up our model.

Step 2: Run the model locally (optional)

If you have an Nvidia GPU with more than 8GB of VRAM, then, you can run the model locally.

If not, proceed to the next step.

We need to download more dependencies to run the model locally. Before downloading dependencies, you need to make sure you have CUDA and the correct CUDA driver installed.

Since we were trying to run the model locally, Truss couldn't help us manage the power of CUDA.

pip install transformers
pip install torch
pip install peft
pip install bitsandbytes
pip install einops
pip install scipy 

Next, in the script main.py created outside the falcon_7b_truss directory, we need to load our Truss.

The following is the code of main.py:

import truss
from pathlib import Path
import requests
tr = truss.load("./falcon_7b_truss")
output = tr.predict({"prompt": "Hi there how are you?"})
print(output)

Here's what's happening:

  • If you recall, the falcon_7b_truss directory was automatically created by Truss. We can load the whole package including models and dependencies using truss.load
  • Once we have loaded our package, we can simply call the predict method to get the model output and run main.py to get the output from the model. The size of this model file is approximately 15 GB, so it may take 5-10 minutes to download the model. After running the script, you should see output like this:
• {'data': {'generated_text': "Hi there how are you?\nI'm doing well. I'm in the middle of a move, so I'm a bit tired. I'm also a bit overwhelmed. I'm not sure how to get started. I'm not sure what I'm doing. I'm not sure if I'm doing it right. I'm not sure if I'm doing it wrong. I'm not sure if I'm doing it at all.\nI'm not sure if I'm doing it right. I'm not sure if I'm doing it wrong. I"}}

Step 3: Package the model with Docker

Typically, when people containerize a model, they take the model binary and Python dependencies and package it with Flask or a Fast API server.

A lot of it is boilerplate and we don't have to trouble ourselves. Truss handles these tasks automatically. We've provided the model and Truss will create the server, so the only thing left to do is provide the Python dependencies.

config.yaml holds the configuration of our model. This is where we can add dependencies to our models. The configuration file already provides most of what we need, but we still need to add some things.

Here's what you need to add to your config.yaml:

apply_library_patches: true
bundled_packages_dir: packages
data_dir: data
description: null
environment_variables: {}
examples_filename: examples.yaml
external_package_dirs: []
input_type: Any
live_reload: false
model_class_filename: model.py
model_class_name: Model
model_framework: custom
model_metadata: {}
model_module_dir: model
model_name: Falcon-7B
model_type: custom
python_version: py39
requirements:
- torch
- peft
- sentencepiece
- accelerate
- bitsandbytes
- einops
- scipy
- git+https://github.com/huggingface/transformers.git
resources:
 use_gpu: true
 cpu: "3"
 memory: 14Gi
secrets: {}
spec_version: '2.0'
system_packages: []

So, the main things we add are around requirements. All listed dependencies are required to download and run the model.

Another important thing we added was resources. use_gpu:true is very important because this tells Truss to create a Dockerfile for us with GPU support enabled. This is for configuration tasks.

Next, we will containerize our model. If you don't know how to package your model with Docker, don't worry, Truss has you covered.

In the main.py file, we'll tell Truss to package everything together. Here is the code you need:

import truss
from pathlib import Path
import requests
tr = truss.load("./falcon_7b_truss")
command = tr.docker_build_setup(build_dir=Path("./falcon_7b_truss"))
print(command)

what happened:

  • First, we load falcon_7b_truss.
  • Next, the docker_build_setup function handles all the complicated stuff like creating the Dockerfile and setting up the Fast API server.
  • If you look in your falcon_7b_truss directory, you'll see more files generated. We don't need to worry about how these files work as they will all be managed behind the scenes.
  • At the end of the run, we get a Docker command to build our Docker image:
docker build falcon_7b_truss -t falcon-7b-model:latest

If you want to build a Docker image, go ahead and run the build command. The image is about 9 GB in size, so it may take a while to build. If you don't want to build it, but want to keep reading, you can take a closer look at the pictures I provided:

htrivedi05/truss-falcon-7b:latest .

If you build the image yourself, you need to tag it and push it to dockerhub so our containers in the cloud can pull the image. Here are the commands that need to be run after building the image:

docker tag falcon-7b-model <docker_user_id>/falcon-7b-model
docker push <docker_user_id>/falcon-7b-model

Amazingly, at this point we are ready to run our model in the cloud!

[Description] The following optional steps (before step 4) are used to run the image locally using the GPU.

If you have an Nvidia GPU and want to run a containerized model locally with GPU support, you need to make sure Docker is configured to use your GPU.

To do this, all you need is to open a terminal and run the following command:

distributinotallow=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

apt-get update
apt-get install -y nvidia-docker2

sudo systemctl restart docker

Now that your Docker is configured to access the GPU, here's how to run the container:

docker run --gpus all -d -p 8080:8080 falcon-7b-model

Again, it will take a while to download the model. To make sure everything is working you can check the container logs and you should see "THE DEVICE INFERENCE IS RUNNING ON IS: cuda".

You can make calls to the model through the API endpoint as follows:

import requests

data = {"prompt": "Hi there, how's it going?"}
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", jsnotallow=data)
print(res.json())

Step 4: Deploy the model to production

I'm using the word "production" quite loosely here. We will run our model in Kubernetes because it can easily scale and handle variable amounts of traffic in this environment.

Having said that, Kubernetes provides a lot of configuration such as network policies, storage, configuration maps, load balancing, secrets management, etc.

Although Kubernetes was built to "scale" and run "production" workloads, many of the production-grade configurations you need aren't available out-of-the-box. A discussion covering those advanced Kubernetes topics is beyond the scope of this article and a distraction from what we are trying to achieve here. So, for this blog post, we'll create a minimal cluster of the basic type.

Without further ado, let's get down to creating our cluster!

prerequisites:

  1. Have a corresponding Google Cloud account that created a project
  2. Successfully installed gcloud CLI on your computer
  3. Please make sure you have enough quota to run your GPU enabled computer. You can check your quotas under the "IAM & Admin" command group.

Create our GKE cluster

We will use Google's Kubernetes Engine to create and manage our cluster. Learn some important information below:

Google's Kubernetes Engine is not free. Google won't let us use a powerful GPU for free. Having said that, we are creating a single node cluster with less powerful GPUs. This experiment should cost no more than $1 to $2.

Here is the configuration of the Kubernetes cluster we will be running on:

  • 1 node, standard Kubernetes cluster
  • 1 Nvidia T4 GPU
  • n1-standard-4 machine (4 vCPU, 15GB memory)
  • All of this will run on a Spot Instance

Note: If you're in another region and don't have access to the exact same resource, feel free to edit it.

Steps to create a cluster:

1. Go to Google Cloud Console and search for a service called Kubernetes Engine:

2. Click the "CREATE" button:

  • Make sure you are creating a standard cluster, not an autopilot cluster. It should say "Create a kubernetes cluster" at the top of the page.

3. Cluster foundation:

  • In the "Cluster basics" tab, we don't want to make too many changes. Just give the cluster a name. You don't need to change regions or control planes.

4. Click on the default-pool tab and change the number of nodes to 1.

5. Under the "default-pool" tab, click the "Nodes" tab in the left sidebar:

  • Change machine configuration (General purpose) from general purpose to GPU
  • Select Nvidia T4 as the GPU type and set the quantity to 1
  • Enable GPU Time-sharing (even though we will not use this feature)
  • Set the maximum number of shared clients per GPU to 8
  • For machine type, choose n1-standard-4 (4 vCPU, 15 GB memory)
  • Change boot disk size to 50
  • Scroll down to the very bottom and check the box that says: Enable nodes on spot VMs

After the cluster is configured, go ahead and create the cluster.

Google takes a few minutes to set everything up. Once your cluster is up and running, we need to connect to it. To do this, open your terminal and run the following command:

gcloud config set compute/zone us-central1-c
gcloud container clusters get-credentials gpu-cluster-1

If you used different cluster name zones, please update those zones accordingly. To check that we are connected, run the following command:

kubectl get nodes

You should see 1 node appear in your terminal. Although our cluster has a GPU, it is missing some Nvidia drivers that we have to install. Thankfully, installing them is a snap. Run the following command to install the driver:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Let's celebrate, we are finally ready to deploy our model.

deployment model

In order to deploy our model to a cluster, we need to create a Kubernetes deployment. Kubernetes deployments allow us to manage instances of the containerized model. Here, I won't discuss Kubernetes or how to write yaml files in depth, because this is beyond the scope of this article's topic.

You need to create a file called truss-falcon-deployment.yaml. Open that file and paste the following:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: truss-falcon-7b
 namespace: default
spec:
 replicas: 1
 selector:
 matchLabels:
 component: truss-falcon-7b-layer
 template:
 metadata:
 labels:
 component: truss-falcon-7b-layer
 spec:
 containers:
 - name: truss-falcon-7b-container
 image: <your_docker_id>/falcon-7b-model:latest
 ports:
 - containerPort: 8080
 resources:
 limits:
 nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
 name: truss-falcon-7b-service
 namespace: default
spec:
 type: ClusterIP
 selector:
 component: truss-falcon-7b-layer
 ports:
 - port: 8080
 protocol: TCP
 targetPort: 8080

what happened:

  • We tell Kubernetes that we want to create pods with our falcon-7b-model image. Make sure to replace <your_docker_id> with your actual id. If you didn't create your own Docker image and want to use mine, replace it with the following: htrivedi05/truss-falcon-7b:latest.
  • We enable GPU access for containers by setting a resource limit nvidia.com/GPU:1. This tells Kubernetes to request only one GPU for our container.
  • In order to interact with our model, we need to create a Kubernetes service that will run on port 8080.

Create a deployment by running the following command in a terminal:

kubectl create -f truss-falcon-deployment.yaml

If you run this command:

kubectl get deployments

You should see a display similar to the following:

NAME READY UP-TO-DATE AVAILABLE AGE
truss-falcon-7b 0/1 1 0 8s

It will take a few minutes for the deployment to change to the ready state. Remember that the model has to be downloaded from the HuggingFace page every time the container is restarted. You can check the progress of the container by running:

kubectl get pods
 
kubectl logs truss-falcon-7b-8fbb476f4-bggts

Change the pod name accordingly.

You need to look for the following in the logs:

  • Look for the print statement THE DEVICE INFERENCE IS RUNNING ON IS: cuda. This confirms that our container is properly connected to the GPU.

Next, you should see some print statements about the model file being downloaded.

Downloading (…)model.bin.index.json: 100%|██████████| 16.9k/16.9k [00:00<00:00, 1.92MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|██████████| 9.95G/9.95G [02:37<00:00, 63.1MB/s]
Downloading (…)l-00002-of-00002.bin: 100%|██████████| 4.48G/4.48G [01:04<00:00, 69.2MB/s]
Downloading shards: 100%|██████████| 2/2 [03:42<00:00, 111.31s/it][01:04<00:00, 71.3MB/s]

After downloading the model and creating the microservice, you should see the following output at the end of the log:

{"asctime": "2023-06-29 21:40:40,646", "levelname": "INFO", "message": "Completed model.load() execution in 330588 ms"}

From this message, we can confirm that the model is loaded and ready for inference tasks.

model reasoning

We cannot call the model directly; instead, we must call the model's service.

You can get the name of the service by running the following command:

kubectl get svc

The output is as follows:

AME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.80.0.1 <none> 443/TCP 46m
truss-falcon-7b-service ClusterIP 10.80.1.96 <none> 8080/TCP 6m19s

What we want to call is the truss-falcon-7b service. In order to make the service accessible, we need to port forward it with the following command:

kubectl port-forward svc/truss-falcon-7b-service 8080

The output is as follows:

Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

Beautiful, our model is served as a REST API endpoint at 127.0.0.1:8080. Open any Python script, such as main.py, and run the following code:

import requests

data = {"prompt": "Whats the most interesting thing about a falcon?"}
res = requests.post("http://127.0.0.1:8080/v1/models/model:predict", jsnotallow=data)
print(res.json())

The output is as follows:

{'data': {'generated_text': 'Whats the most interesting thing about a falcon?\nFalcons are known for their incredible speed and agility in the air, as well as their impressive hunting skills. They are also known for their distinctive feathering, which can vary greatly depending on the species.'}}

Wow! We have successfully containerized the Falcon 7B model and successfully deployed it as a microservice in production!

Feel free to use different hints to see what the model returns.

Shut down the cluster

Once you're happy with your Falcon 7B, you can delete your deployment by running:

kubectl delete -f truss-falcon-deployment.yaml

Next, go to Kubernetes Engine in Google Cloud and delete the Kubernetes cluster.

Note: Unless otherwise stated; otherwise, all images in this article are provided by the author himself.

in conclusion

Although it is not easy to run and manage a production-grade model like ChatGPT; however, developers can better deploy their own models to the cloud over time.

In this blog post, we covered everything needed to deploy LLM into production at a basic level. To sum it up, we first need to package the model using Truss, then containerize it using Docker, and finally deploy it in the cloud using Kubernetes. I know it's a lot to do in detail, and while it wasn't the easiest thing in the world to do, we did it anyway.

In conclusion, I hope you learned something interesting from this blog post. Thanks for reading!


See more great tools

Space elevators, MOSS, ChatGPT, etc. all indicate that 2023 is not destined to be an ordinary year. Any new technology is worthy of scrutiny, and we should have this sensitivity.

In the past few years, I have vaguely encountered low-code, and it is relatively popular at present, and many major manufacturers have joined in one after another.

Low-code platform concept: Through automatic code generation and visual programming, only a small amount of code is needed to quickly build various applications.

What is low-code, in my opinion, is dragging, whirring, and one-pass operation to create a system that can run, front-end, back-end, and database, all in one go. Of course this may be the end goal.

Link: www.jnpfsoft.com/?csdn, if you are interested, try it too.

The advantage of JNPF is that it can generate front-end and back-end codes, which provides great flexibility and can create more complex and customized applications. Its architectural design also allows developers to focus on the development of application logic and user experience without worrying about the underlying technical details.

Guess you like

Origin blog.csdn.net/wangonik_l/article/details/131922375