Nvidia Triton Getting Started Tutorial

1 Relevant preliminaries

  • Model : A network (parameters + structure) containing a large number of parameters, ranging in size from 10MB to 10GB.
  • Model format : The same model can have different storage formats (similar to audio and video files). At present, the mainstream ones are torch, tf, onnx and trt, and tf includes three formats.
  • Model inference : various operations are performed on the input and parameters in the network to obtain an output, a computationally intensive task that requires GPU acceleration.
  • Model reasoning engine : A model reasoning tool that can make model reasoning faster. Using this tool often requires a specific model format. Currently, mainstream reasoning engines include trt and ort.
  • Model reasoning framework : The process of model reasoning is encapsulated to make it more convenient to add, delete, and replace models; more advanced frameworks also have functions such as load balancing, model monitoring, and automatic generation of grpc and http interfaces, which are for deployment born. 

The triton to be introduced next is an excellent model reasoning framework at present .

2 entry triton

Next, I will teach you how to run through the triton step by step, so that you can understand what the triton does.

2.1 Register NGC platform

NGC can be understood as an official software warehouse of NV, which contains a lot of compiled software, docker images, etc. We need to register NGC and generate the corresponding API key, which is used to log in to NGC on docker and download the image inside.

The address of NGC official website is:  https://ngc.nvidia.com

After registering and logging in, you can generate API keys in the user interface (profile UI). This key allows you to log into NGC in Docker or other appropriate places, and download resources such as Docker images from this repository.

2.2 Login

CLI input  docker login nvcr.io

Then enter the user name and the key you generated in the previous step, the user name is $oauthtoken , don't forget the $ symbol, don't use your own user name.
Finally, the words Login Succeeded will appear, which means the login is successful.

2.3 Pull the image

docker pull nvcr.io/nvidia/tritonserver:22.04-py3

You can also choose to pull other versions of triton. The image size is about several gigabytes, so you need to wait patiently. This image does not distinguish between gpu and cpu, and is universal.

2.4 Build the model directory

Execute the command mkdir -p /home/triton/model_repository/fc_model_pt/1.
Among them /home/triton/model_repositoryis your model repository, all models are in this model directory. When the container is started, it will be mapped to the folder in the container /model. fc_model_pt can be understood as the storage directory of a certain model, such as a model for sentiment classification, the name is not required, it is best to see the name, 1 means The version is 1.
The directory structure of the model warehouse is as follows:

  <model-repository-path>/# 模型仓库目录<model-name>/ # 模型名字[config.pbtxt] # 模型配置文件[<output-labels-file> ...] # 标签文件,可以没有<version>/ # 该版本下的模型<model-definition-file><version>/<model-definition-file>...<model-name>/[config.pbtxt][<output-labels-file> ...]<version>/<model-definition-file><version>/<model-definition-file>......

2.5 Generate a torch model for reasoning

Create a torch model and save it with torchscript:

import torch
import torch.nn as nn


class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.embedding = nn.Embedding(100, 8)
        self.fc = nn.Linear(8, 4)
        self.fc_list = nn.Sequential(*[nn.Linear(8, 8) for _ in range(4)])

    def forward(self, input_ids):
        word_emb = self.embedding(input_ids)
        output1 = self.fc(word_emb)
        output2 = self.fc_list(word_emb)

        return output1, output2


if __name__ == "__main__":
    model = SimpleModel()
    ipt = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
    
    script_model = torch.jit.trace(model, ipt, strict=True)
    torch.jit.save(script_model, "model.pt")

// 这段代码定义了一个简单的 PyTorch 模型,然后使用一个输入 Tensor 来在严格模式下 Trace 该模型,并将 Trace 后的模型保存为 "model.pt" 文件。

After generating the model, copy it to the directory you just created. Note that it should be placed in the directory corresponding to the version number, and the model file name must be model.pt.

Triton supports many model formats, here is just an example of torch.

2.6 Writing configuration files

In order for the framework to recognize the model just placed, we need to write a configuration file config.pbtxt, the specific content is as follows:

name: "fc_model_pt" # 模型名,也是目录名
platform: "pytorch_libtorch" # 模型对应的平台,本次使用的是torch,不同格式的对应的平台可以在官方文档找到
max_batch_size : 64 # 一次送入模型的最大bsz,防止oom
input [{name: "input__0" # 输入名字,对于torch来说名字于代码的名字不需要对应,但必须是<name>__<index>的形式,注意是2个下划线,写错就报错data_type: TYPE_INT64 # 类型,torch.long对应的就是int64,不同语言的tensor类型与triton类型的对应关系可以在官方文档找到dims: [ -1 ]  # -1 代表是可变维度,虽然输入是二维的,但是默认第一个是bsz,所以只需要写后面的维度就行(无法理解的操作,如果是[-1,-1]调用模型就报错)}
]
output [{name: "output__0" # 命名规范同输入data_type: TYPE_FP32dims: [ -1, -1, 4 ]},{name: "output__1"data_type: TYPE_FP32dims: [ -1, -1, 8 ]}
]

This model configuration file is estimated to be the most complicated part of the entire triton. Most of the work of the online model is estimated to be writing the configuration file. It is difficult for me to explain it clearly in a few words. I can only give you a brief introduction. For details, please refer to the official document. Note that the configuration file should not be placed in the version number directory, but in the model directory, that is to say, config.pbtxt and the version number directory are at the same level.

The official description of the default input shape is bsz:

As discussed above, Triton assumes that batching occurs along the first dimension which is not listed in in the input or output tensor dims. However, for shape tensors, batching occurs at the first shape value. For the above example, an inference request must provide inputs with the following shapes.

2.7 Create a container and start it

Excuting an order:

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 \-v /home/triton/model_repository/:/models \nvcr.io/nvidia/tritonserver:22.04-py3 \tritonserver \--model-repository=/models 

If your system has an available GPU, you can add the following parameters  --gpus=1to enable the inference framework to use GPU acceleration. This parameter should be placed after run.

2.8 Test interface

If you follow my tutorial step by step, you can definitely start the container successfully. Next, we can write a piece of code to test whether the interface is connected. The calling address is: the test code is as follows
http:\\localhost:8000/v2/models/{model_name}/versions/{version}/infer
:

import requests

if __name__ == "__main__":
    request_data = {
        "inputs": [
            {
                "name": "input__0",
                "shape": [2, 3],
                "datatype": "INT64",
                "data": [[1, 2, 3], [4,5,6]]
            }
        ],
        "outputs": [
            {"name": "output__0"}, 
            {"name": "output__1"}
        ]
    }
    
    res = requests.post(
        url="http://localhost:8000/v2/models/fc_model_pt/versions/1/infer",
        json=request_data
    ).json()
    
    print(res)


// 这段代码使用 Python 的 requests 库向本地运行的服务器发送 POST 请求。服务器托管了一个模型,即 fc_model_pt 。POST 请求包含输入数据和所期望的输出数据格式,并打印服务器的响应结果。

After executing the code, you will get the corresponding output:

{'model_name': 'fc_model_pt','model_version': '1','outputs': [{'name': 'output__0','datatype': 'FP32','shape': [2, 3, 4],'data': [1.152763843536377, 1.1349767446517944, -0.6294105648994446, 0.8846281170845032, 0.059508904814720154, -0.06066855788230896, -1.497096061706543, -1.192716121673584, 0.7339693307876587, 0.28189709782600403, 0.3425392210483551, 0.08894850313663483, 0.48277992010116577, 0.9581012725830078, 0.49371692538261414, -1.0144696235656738, -0.03292369842529297, 0.3465275764465332, -0.5444514751434326, -0.6578375697135925, 1.1234807968139648, 1.1258794069290161, -0.24797165393829346, 0.4530307352542877]}, {'name': 'output__1','datatype': 'FP32','shape': [2, 3, 8],'data': [-0.28994596004486084, 0.0626179575920105, -0.018645435571670532, -0.3376324474811554, -0.35003775358200073, 0.2884367108345032, -0.2418503761291504, -0.5449661016464233, -0.48939061164855957, -0.482677698135376, -0.27752232551574707, -0.26671940088272095, -0.2171783447265625, 0.025355860590934753, -0.3266356587409973, -0.06301657110452652, -0.1746724545955658, -0.23552510142326355, 0.10773542523384094, -0.4255935847759247, -0.47757795453071594, 0.4816707670688629, -0.16988427937030792, -0.35756853222846985, -0.06549499928951263, -0.04733048379421234, -0.035484105348587036, -0.4210450053215027, -0.07763291895389557, 0.2223128080368042, -0.23027443885803223, -0.4195460081100464, -0.21789231896400452, -0.19235755503177643, -0.16810789704322815, -0.34017443656921387, -0.05121977627277374, 0.08258339017629623, -0.2775516211986542, -0.27720844745635986, -0.25765007734298706, -0.014576494693756104, 0.0661710798740387, -0.38623639941215515, -0.45036202669143677, 0.3960753381252289, -0.20757021009922028, -0.511818528175354]}]
}

I don’t know if there is a problem with my usage. From the perspective of user experience, this reasoning interface is somewhat uncomfortable for me:

  1. Obviously, the datatype is specified in config.pbtxt, but it needs to be specified when inputting. If it is not specified, an error will be reported.
  2. The input shape also needs to be specified, otherwise an error will be reported
  3. The value of datatype is inconsistent with that in config.pbtxt. If datatype is set to TYPE_INT64, an error will be reported and it must be INT64
  4. The data in the output is a 1-dimensional array, which needs to be automatically reshaped into a corresponding array according to the returned shape

In addition to directly writing code calls like me, you can also use the official library they provide pip install tritonclient[http], the address is as follows: https://github.com/triton-inference-server/client.

3 Using the advanced features of triton

The tutorial in the previous section only used the basic functions of triton, so the rank can only be said to be gold. The following introduces some advanced features of triton.

3.1 Model Parallelism

Model parallelism can refer to launching multiple models or multiple instances of a single model at the same time. It is not complicated to implement, just by modifying the configuration parameters. By default, triton will deploy an instance of the model on each available GPU to achieve parallelism.
Next, I will test a variety of situations to let you know the effect of model parallelism. My configuration is 2 pieces of 3060 (rented) for testing multiple models.
Using the pressure test command ab -k -c 5 -n 500 -p ipt.json http://localhost:8000/v2/models/fc_model_pt/versions/1/infer
This command means that 5 processes repeatedly call the interface 500 times.
The test configuration and corresponding QPS are as follows:

  • 1 card in total; each card runs 1 instance: QPS is 603
  • 2 cards in total; each card runs 1 instance: QPS is 1115
  • 2 cards in total; each card runs 2 instances: QPS is 1453
  • 2 cards in total; each card runs 2 instances; put 2 instances on the CPU at the same time: QPS is 972

The conclusions are as follows: multi-card performance is improved; multiple instances can further improve concurrency capabilities; adding CPU will slow down the speed, mainly because the CPU speed is too slow.

The following is the configuration item corresponding to the above test, just copy it and put it in config.pbtxt

#共2个卡;每个卡运行2个实例
instance_group [
{count: 2kind: KIND_GPUgpus: [ 0 ]
},
{count: 2kind: KIND_GPUgpus: [ 1 ]
}
]
# 共2个卡;每个卡运行2个实例;同时在CPU上放2个实例
instance_group [
{count: 2kind: KIND_GPUgpus: [ 0 ]
},
{count: 2kind: KIND_GPUgpus: [ 1 ]
},
{count: 2kind: KIND_CPU
}
]

--gpusAs for how many cards to use, it is specified by the

3.2 Dynamic batch

Dynamic batch means that, for a request, do not perform reasoning first, wait for a few milliseconds, and stitch all the requests in these few milliseconds into a batch for reasoning, which can make full use of hardware and improve parallel capabilities. Of course, the disadvantage is that individual users The waiting time becomes longer, which is not suitable for low-frequency request scenarios. It is very simple to use dynamic batch. You only need to add it in config.pbtxt  dynamic_batching { }. You can refer to the document for specific parameter details. The upper limit of bsz is the upper limit of my simple writing method. max_batch_sizeThe result of my pressure test is about 50% QPS improvement. Anyway, it's just that it works.

PS: This optimization method is simply cheating for stress testing. . .

3.3 Custom backend

The so-called custom backend is to write the reasoning process by yourself. Normally, the whole reasoning process is solved directly through the model, but some reasoning processes also include some business logic. For example, the whole reasoning process requires 2 models, of which the first The output of the first model can be used as the input of the second model after making some logical judgments and then modifying the output. The easiest way is to call the triton service twice, first call the first model to obtain the output and then make business logic judgments and modifications, and then Then call the second model. However, in triton, we can customize a backend to write the entire calling process in it, which simplifies the calling process and avoids part of the HTTP transmission delay.
The example I gave is actually a pipeline that includes business logic. This approach is called BLS (Business Logic Scripting).

It is also very simple to implement a custom backend. It is basically the same as the process of putting the torch model mentioned above. First, create a model folder, then create a new one in the folder, then create a new version folder, and then put it in. This py config.pbtxtfile model.pyis Wrote the reasoning process. In order to illustrate the directory structure, I will print out the constructed model warehouse directory tree and show it:

model_repository/  # 模型仓库|-- custom_model # 我们的自定义backend模型目录|   |-- 1 # 版本|   |   |-- model.py # 模型Py文件,里面主要是推理的逻辑|   `-- config.pbtxt # 配置文件`-- fc_model_pt # 上一小节介绍的模型|-- 1|   `-- model.pt`-- config.pbtxt

If you understand the previous section, then you will find that the model directory settings of the custom backend are the same as the normal directory settings, the only difference is that the model file changes from the network weight to the code written by yourself. Let’s talk about the file content of config.pbtxt and model.py, you can copy and paste directly:

import json
import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """
    Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """
        `initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.

        Parameters
        ----------
        args : dict
            Both keys and values are strings. The dictionary keys and values are:
            * model_config: A JSON string containing the model configuration
            * model_instance_kind: A string containing model instance kind
            * model_instance_device_id: A string containing model instance device ID
            * model_repository: Model repository path
            * model_version: Model version
            * model_name: Model name
        """
        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get output__0 configuration
        output0_config = pb_utils.get_output_config_by_name(model_config, "output__0")

        # Get output__1 configuration
        output1_config = pb_utils.get_output_config_by_name(model_config, "output__1")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
        self.output1_dtype = pb_utils.triton_string_to_numpy(output1_config['data_type'])

    def execute(self, requests):
        """
        requests : list
            A list of pb_utils.InferenceRequest

        Returns
        -------
        list
            A list of pb_utils.InferenceResponse. The length of this list must
            be the same as `requests`
        """
        output0_dtype = self.output0_dtype
        output1_dtype = self.output1_dtype

        responses = []

        # Every Python backend must iterate over everyone of the requests
        # and create a pb_utils.InferenceResponse for each of them.
        for request in requests:
            # 获取请求数据
            in_0 = pb_utils.get_input_tensor_by_name(request, "input__0")

            # 第一个输出结果自己随便造一个假的,就假装是有逻辑了
            out_0 = np.array([1, 2, 3, 4, 5, 6, 7, 8])  # 为演示方便先写死
            out_tensor_0 = pb_utils.Tensor("output__0", out_0.astype(output0_dtype))

            # 第二个输出结果调用fc_model_pt获取结果
            inference_request = pb_utils.InferenceRequest(
                model_name='fc_model_pt',
                requested_output_names=['output__0', 'output__1'],
                inputs=[in_0]
            )
            inference_response = inference_request.exec()
            out_tensor_1 = pb_utils.get_output_tensor_by_name(inference_response, 'output__1')

            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0, out_tensor_1])
            responses.append(inference_response)

        return responses

    def finalize(self):
        """
        `finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

// 这段代码是一个 Python 版本的 Triton 后端模型的基本框架。模型的 initialize 方法在加载模型时被调用一次,execute 方法在收到推理请求时被调用,并返回服务器的响应,finalize 在模型被卸载时调用,可选的实现清理工作。

The file content of config.pbtxt:

name: "custom_model"
backend: "python"
input [{name: "input__0"data_type: TYPE_INT64dims: [ -1, -1 ]}
]
output [{name: "output__0" data_type: TYPE_FP32dims: [ -1, -1, 4 ]},{name: "output__1"data_type: TYPE_FP32dims: [ -1, -1, 8 ]}
]

After the above work is completed, you can start the program to view the running results. You can directly copy my code for testing:

import requests

if __name__ == "__main__":
    request_data = {
        "inputs": [
            {
                "name": "input__0",
                "shape": [1, 2],
                "datatype": "INT64",
                "data": [[1, 2]]
            }
        ],
        "outputs": [
            {"name": "output__0"}, 
            {"name": "output__1"}
        ]
    }
    
    res = requests.post(
        url="http://localhost:8000/v2/models/fc_model_pt/versions/1/infer",
        json=request_data
    ).json()
    print(res)
    
    res = requests.post(
        url="http://localhost:8000/v2/models/custom_model/versions/1/infer",
        json=request_data
    ).json()
    print(res)


// 这段代码使用 Python 的requests库向本地运行的服务器发送 POST 请求。服务器托管了两个模型,一个是fc_model_pt,另一个是custom_model。POST 请求包含输入数据和期望的输出数据格式。然后,将服务器的响应结果打印出来。

As a result of the operation, it can be found that in the output of the two times, ouput__0 is different, but output__1 is the same. This is related to the logic of the model.py we wrote, so I won’t explain it here.

PS: Custom backend avoids the transmission delay caused by repeatedly calling the NLG model for generation

Guess you like

Origin blog.csdn.net/JineD/article/details/132092086