Model service grid: model service management under cloud native

Model Service Mesh is an architectural pattern used to deploy and manage machine learning model services in a distributed environment. It provides a scalable, high-performance infrastructure for managing, deploying, and scheduling multiple model services to better handle model deployment, version management, routing, and load balancing of inference requests.

The core idea of ​​the model service grid is to deploy models as scalable services and manage and route these services through the grid, simplifying the management and operation of model services. It makes model deployment, extension, and version control easier by abstracting model services into orchestrated, scalable units. It also provides some core functions, such as load balancing, automatic scaling, fault recovery, etc., to ensure high availability and reliability of model services.

Models can be automatically scaled and load balanced based on actual inference request load, enabling efficient model inference. The model service grid also provides some advanced functions, such as traffic segmentation, A/B testing, grayscale publishing, etc., to better control and manage the traffic of model services, and can easily switch and roll back different model versions. It also supports dynamic routing, which can route requests to the appropriate model service based on the properties of the request, such as model type, data format, or other metadata.

Alibaba Cloud Service Grid ASM has provided a scalable, high-performance model service grid basic capability for managing, deploying, and scheduling multiple model services to better handle model deployment and version management. , routing and load balancing of inference requests. By using a model service mesh, developers can more easily deploy, manage, and scale machine learning models while providing high availability, resiliency, and flexibility to meet varying business needs.

01 Use model service grid for multi-model inference service

The Model Service Grid is implemented based on KServe ModelMesh and is optimized for large-volume, high-density, and frequently changing model use cases. It can intelligently load or unload models into memory to strike a balance between responsiveness and computation. .

The model service mesh provides the following capabilities:

  • Cache management
  • Pods are managed as a distributed least recently used (LRU) cache.
  • Load and unload copies of the model based on frequency of use and current request volume.
  • Smart placement and loading
  • Model placement is balanced by cache age and request load among Pods.
  • Use queues to handle concurrent model loading and minimize impact on runtime traffic.
  • elasticity
  • Failed model loading is automatically retried in different Pods.
  • Ease of operation
  • Handle rolling model updates automatically and seamlessly.

The following is an example of a deployment model. For usage prerequisites, please refer to [1].

1.1 Create storage declaration PVC

In the ACK cluster, create the storage declaration my-models-pvc using the following YAML:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: my-models-pvc
  namespace: modelmesh-serving
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: alibabacloud-cnfs-nas
  volumeMode: Filesystem

Then run the following command:

kubectl get pvc -n modelmesh-serving

You will get expected results similar to the following:

NAME STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS            AGE
my-models-pvc    Bound    nas-379c32e1-c0ef-43f3-8277-9eb4606b53f8   1Gi        RWX            alibabacloud-cnfs-nas   2h

1.2 Create Pod to access PVC

In order to use the new PVC, we need to mount it as a volume to the Kubernetes Pod. We can then use this pod to upload the model files to a persistent volume.

Let's deploy a pvc-access Pod and ask the Kubernetes controller to declare the PVC we requested earlier by specifying "my-models-pvc":

kubectl apply  -n modelmesh-serving  -f - <<EOF
---
apiVersion: v1
kind: Pod
metadata:
  name: "pvc-access"
spec:
  containers:
    - name: main
      image: ubuntu
      command: ["/bin/sh", "-ec", "sleep 10000"]
      volumeMounts:
        - name: "my-pvc"
          mountPath: "/mnt/models"
  volumes:
    - name: "my-pvc"
      persistentVolumeClaim:
        claimName: "my-models-pvc"
EOF

Confirm that the pvc-access Pod should be running:

kubectl get pods -n modelmesh-serving | grep pvc-access

You will get expected results similar to the following:

pvc-access 1/1     Running

1.3 Store the model on a persistent volume

Now, we need to add our AI model to the storage volume, we will use the MNIST handwritten digit character recognition model trained by scikit-learn. A copy of the mnist-svm.joblib model file can be downloaded from the kserve/modelmesh-minio-examples repository [2].

Copy the mnist-svm.joblib model file to the /mnt/models folder on the pvc-access pod with the following command:

kubectl -n modelmesh-serving cp mnist-svm.joblib pvc-access:/mnt/models/

Execute the following command to confirm that the model has been loaded successfully:

kubectl -n modelmesh-serving exec -it pvc-access -- ls -alr /mnt/models/

You should get something like this:

-rw-r--r-- 1 501 staff 344817 Oct 30 11:23 mnist-svm.joblib

1.4 Deploy inference service

Next, we need to deploy a sklearn-mnist inference service:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-mnist
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storage:
        parameters:
          type: pvc
          name: my-models-pvc
        path: mnist-svm.joblib

After dozens of seconds (depending on the image pull speed), the new inference service sklearn-mnist should be ready.

Run the following command:

kubectl get isvc -n modelmesh-serving

You will get expected results similar to the following:

NAME URL                  READY
sklearn-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True

1.5 Running the inference service

Now we can use curl to send inference requests to our sklearn-mnist model. The requested data in the form of an array represents the grayscale values ​​of 64 pixels in the digital image scan to be classified.

MODEL_NAME="sklearn-mnist"
ASM_GW_IP="ASM网关IP地址"
curl -X POST -k "http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer" -d '{"inputs": [{"name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": {"fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}}]}'

The JSON response should look like this, inferring that the scanned number is "8":

{
"modelName": "sklearn-mnist__isvc-3c10c62d34",
 "outputs": [
  {
   "name": "predict",
   "datatype": "INT64",
   "shape": [
    "1",
    "1"
   ],
   "contents": {
    "int64Contents": [
     "8"
    ]
   }
  }
 ]
}

02 Customize the model runtime using the model service mesh

Model Service Mesh (ModelMesh for short) is optimized for the deployment and operation of large-capacity, high-density, and frequently changing model inference services. It can intelligently load or unload models into memory to ensure that they are Get the best balance between responsiveness and computing.

ModelMesh integrates the following model server running environments by default, such as

  • Triton Inference Server, NVIDIA's server, is suitable for frameworks such as TensorFlow, PyTorch, TensorRT or ONNX.
  • MLServer, Seldon's Python-based server, suitable for frameworks such as SKLearn, XGBoost or LightGBM.
  • OpenVINO Model Server, Intel's server for frameworks such as Intel OpenVINO or ONNX.
  • TorchServe, supports PyTorch models including eager mode.

If these model servers cannot meet your specific requirements, for example, if you need to handle custom logic for inference, or if your model requires a framework that is not in the above support list, you can customize the service runtime for extended support.

For details, please refer to [3].

03 Provide services for large language model LLM

Large Language Model (LLM) refers to a neural network language model with hundreds of millions of parameters, such as GPT-3, GPT-4, PaLM, PaLM2, etc. The following describes how to provide services for large language model LLM.

The prerequisites for use can be referred to [4] for details.

3.1 Building a custom runtime

Build a custom runtime that provides HuggingFace LLM with prompts to adjust the configuration. The default settings in this example are our pre-built custom runtime image and pre-built prompt tuning configuration.

3.1.1 Implement a class that inherits from MLServer MLModel

The peft_model_server.py file in the kfp-tekton/samples/peft-modelmesh-pipeline directory [5] contains all the code on how to provide HuggingFace LLM with prompt adjustment configuration.

The _load_model function below shows that we will select the pre-trained LLM model for the PEFT prompt tuning configuration. Tokenizers are also defined as part of the model and can therefore be used to encode and decode raw string inputs in inference requests without requiring users to preprocess their inputs into tensor bytes.

from typing import List

from mlserver import MLModel, types
from mlserver.codecs import decode_args

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

class PeftModelServer(MLModel):
    async def load(self) -> bool:
        self._load_model()
        self.ready = True
        return self.ready

    @decode_args
    async def predict(self, content: List[str]) -> List[str]:
        return self._predict_outputs(content)

    def _load_model(self):
        model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
        peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
        config = PeftConfig.from_pretrained(peft_model_id)
        self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
        self.model = PeftModel.from_pretrained(self.model, peft_model_id)
        self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
        return

    def _predict_outputs(self, content: List[str]) -> List[str]:
        output_list = []
        for input in content:
            inputs = self.tokenizer(
                f'{self.text_column} : {input} Label : ',
                return_tensors="pt",
            )
            with torch.no_grad():
                inputs = {k: v for k, v in inputs.items()}
                outputs = self.model.generate(
                    input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                )
                outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
            output_list.append(outputs[0])
        return output_list

3.1.2 Build Docker image

After implementing the model class, we need to package its dependencies (including MLServer) into an image that supports the ServingRuntime resource. Refer to the following Dockerfile to build the image.

# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver peft transformers datasets
# ...

# The custom `MLModel` implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
ENV PYTHONPATH=/opt/

# environment variables to be compatible with ModelMesh Serving
# these can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
 MLSERVER_GRPC_PORT=8001 \
    MLSERVER_HTTP_PORT=8002 \
    MLSERVER_LOAD_MODELS_AT_STARTUP=false \
    MLSERVER_MODEL_NAME=peft-model

# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer

CMD mlserver start ${MLSERVER_MODELS_DIR}

3.1.3 Create a new ServingRuntime resource

You can use the YAML template in the following code block to create a new ServingRuntime resource and point it to the image you just created.

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: peft-model-server
  namespace: modelmesh-serving
spec:
  supportedModelFormats:
    - name: peft-model
      version: "1"
      autoSelect: true
  multiModel: true
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  containers:
    - name: mlserver
      image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
      env:
        - name: MLSERVER_MODELS_DIR
          value: "/models/_mlserver_models/"
        - name: MLSERVER_GRPC_PORT
          value: "8001"
        - name: MLSERVER_HTTP_PORT
          value: "8002"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "true"
        - name: MLSERVER_MODEL_NAME
          value: peft-model
        - name: MLSERVER_HOST
          value: "127.0.0.1"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "-1"
        - name: PRETRAINED_MODEL_PATH
          value: "bigscience/bloomz-560m"
        - name: PEFT_MODEL_ID
          value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
        # - name: "TRANSFORMERS_OFFLINE"
        #   value: "1" 
        # - name: "HF_DATASETS_OFFLINE"
        #   value: "1"   
      resources:
        requests:
          cpu: 500m
          memory: 4Gi
        limits:
          cpu: "5"
          memory: 5Gi
  builtInAdapter:
    serverType: mlserver
    runtimeManagementPort: 8001
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000

Then use the kubectl apply command to create the ServingRuntime resource and you will see your new custom runtime in the ModelMesh deployment.

3.2 Deploy LLM service

In order to deploy the model using your newly created runtime, you need to create an InferenceService resource to serve the model. This resource is the main interface used by KServe and ModelMesh to manage models, and represents the logical endpoint of the model in inference.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: peft-demo
  namespace: modelmesh-serving
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: peft-model
      runtime: peft-model-server
      storage:
        key: localMinIO
        path: sklearn/mnist-svm.joblib

In the preceding code block, the InferenceService is named peft-demo and its model format is declared as peft-model, the same format used by the example custom runtime created earlier. An optional field runtime is also passed, explicitly telling ModelMesh to use the peft-model-server runtime to deploy this model.

3.3 Running the inference service

Now we can use curl to send inference requests to the LLM model service we deployed above.

MODEL_NAME="peft-demo"
ASM_GW_IP="ASM网关IP地址"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

Where input.json represents the request data:

{
 "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

bytes_contents corresponds to the base64 encoding of the string "Every day is a new beginning, filled with opportunities and hope".

The JSON response should look like this, inferring that the scanned number is "8":

{
"modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

The base64 decoded content of bytesContents is:

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint

So far, it shows that the model service request of the above large language model LLM has obtained the expected results.

04 Summary

Alibaba Cloud Service Grid ASM has provided a scalable, high-performance model service grid basic capability for managing, deploying, and scheduling multiple model services to better handle model deployment and version management. , routing and load balancing of inference requests.

Welcome to try: https://www.aliyun.com/product/servicemesh

Related Links:

[1] The following is an example of a deployment model. Please refer to the prerequisites for use.

https://help.aliyun.com/zh/asm/user-guide/multi-model-inference-service-using-model-service-mesh?spm=a2c4g.11186623.0.0.7c4e6561k1qyJV#213af6d078xu7

[2] kserve/modelmesh-minio-examples warehouse

https://github.com/kserve/modelmesh-minio-examples/blob/main/sklearn/mnist-svm.joblib

[3] For details, please refer to

https://help.aliyun.com/zh/asm/user-guide/customizing-the-model-runtime-using-the-model-service-mesh?spm=a2c4g.11186623.0.0.1db77614Vw96Eu

[4] Please refer to the prerequisites for use.

https://help.aliyun.com/zh/asm/user-guide/services-for-the-large-language-model-llm?spm=a2c4g.11186623.0.0.29777614EEBYWt#436fc73079euz

[5] kfp-tekton/samples/peft-modelmesh-pipeline 目录

https://github.com/kubeflow/kfp-tekton

Author: Wang Xining

Original link

This article is original content from Alibaba Cloud and may not be reproduced without permission.

The author of a well-known open source project lost his job due to mania - "Seeking money online" No Star, No Fix 2023 The world's top ten engineering achievements are released: ChatGPT, Hongmeng Operating System, China Space Station and other selected ByteDance were "banned" by OpenAI. Google announces the most popular Chrome extension in 2023 Academician Ni Guangnan: I hope domestic SSD will replace imported HDD to unlock Xiaomi mobile phone BL? First, do a Java programmer interview question. Arm laid off more than 70 Chinese engineers and planned to reorganize its Chinese software business. OpenKylin 2.0 reveals | UKUI 4.10 double diamond design, beautiful and high-quality! Manjaro 23.1 released, codenamed “Vulcan”
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10324103