Implementation and comparison of Triton Pipeines

In the deployment of yolov5 Triton Pipelines , the two methods of implementing Triton Pipelines, BLS and Ensemble, are briefly introduced. At the same time, in the Benchmark of high-performance deployment of Yolov5 Triton service , the performance testing of the two deployment methods of Pipelines and All in TensorRT Engine is performed. , this article will compare and introduce BLS and Ensemble, and also interpret the results of the performance test.

Related code links

1. Python Backend

1.1 Implementation method and structure

BLS is a special python backend that completes Pipelines by calling other model services in the python backend. The structure of python backend is as follows

Please add image description

  • Inter-process communication IPC

    python stub process(C++)Due to GIL limitations, python backend supports multi-instance deployment by starting a separate process ( ) for each model instance . Since it is a multi-process, it is necessary to shared memorycomplete the communication between the python model instance and the Triton main process. Specifically, each python stub process is shared memory里分配一个shm blockconnected to the shm block python backend agent(C++)for communication.

  • data flow

    shm blockThrough Request MessageQand Response MessageQscheduling and relaying Input and Output, the above two queues are implemented through the logic of the producer-consumer model

    1. Requests sent to the Triton server are python backend agent(C++)placed inRequest MessageQ
    2. The python stub process Request MessageQtakes out the Input, gives it to the python model instance to perform inference, and then puts the Output intoResponse MessageQ
    3. python backend agent(C++)Then take the Output Response MessageQfrom it , package it into a response and return it to the Triton server main process.

    Examples are as follows:

    responses = []
    for request in requests:
        input_tensor = pb_utils.get_input_tensor_by_name(
            request, 'input')
    
        # INFER_FUNC is python backend core logic
        output_tensor = INFER_FUNC(input_tensor)
    
        inference_response = pb_utils.InferenceResponse(
            output_tensors=[out_tensor])
        responses.append(inference_response)
    

1.2 Notice

  • You need to manually manage whether Tensor is on CPU or GPU. The configuration in config instance_group {kind: KIND_GPU}does not work.
  • The input will not be automatically batched. You need to manually convert the request list into batches. This is the same as all backends.
  • By default, python backend actively moves the input tensor to the CPU and then provides it for model inference. It is set FORCE_CPU_ONLY_INPUT_TENSORSto noavoid memory copies between host and device as much as possible.
  • The data exchange between python backend model instance and Triton server is completed through shared memory, so each instance requires a larger shared memory, at least 64MB.
  • If performance becomes a bottleneck, especially if it contains many loops, you need to switch to a C++ backend

2. BLS

A special python backend that calls other model services through python code. Usage scenario: dynamically combine deployed model services through some logical judgments

2.1 BLS process

Please add image description

The upper part of the dotted line indicates the general way of calling python backend, and the lower part of the dotted line indicates calling other model services in python backend. The overall process can be summarized as:

  1. python model instance processes the received Input tensor
  2. python model instance initiates a request through BLS call,
  3. The request is placed in the shm block through the python stub process.
  4. The python backend agent takes out the BLS input in the shm block and sends the BLS input to the specified model through the Triton C API to perform inference.
  5. Triton python backend agent sends the output of inference to shm block
  6. BLS Output is taken out of the shm block through the python stub process, encapsulated into a BLS response and returned to the python model instance.

2.2 Notice

  • The position of the input tensor
    . By default, the python backend actively moves the input tensor to the CPU and then provides it to the model for inference. It is set FORCE_CPU_ONLY_INPUT_TENSORSto noavoid this behavior. The position of the input tensor depends on how it is processed in the end, so turn this on. After setting up, the python backend needs to be able to handle both CPU and GPU tensors.

  • Module execution sequence
    BLS does not support step parallelism. Steps must be executed sequentially. The next step is executed only after the previous step is executed.

  • Data transmission
    : DLPackEncode and decode tensor to complete data transmission between tensor between different frameworks and python backend. This step is zero copy and very fast.


3. Ensemble

3.1 Ensemble Overview

Using Ensemble to implement Pipelines can avoid the overhead of transmitting intermediate tensors and minimize the number of requests that must be sent to the Triton server. Compared with BLS, the advantage of Ensemble is that it can parallelize the execution of multiple models (steps) ( That is, each step is executed asynchronously, which is the true meaning of Pipelines), thereby improving overall performance.

A typical Ensemble Pipelines is as follows:

name: "simple_yolov5_ensemble"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "ENSEMBLE_INPUT_0"
    data_type: TYPE_FP32
    dims: [3, 640, 640]
  }
]

output [
  {
    name: "ENSEMBLE_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "simple_yolov5"
      model_version: 1
      input_map: {
        key: "images"
        value: "ENSEMBLE_INPUT_0"
      }
      output_map: {
        key: "output"
        value: "FILTER_BBOXES"
      }
    },
    {
      model_name: "nms"
      model_version: 1
      input_map: {
        key: "candidate_boxes"
        value: "FILTER_BBOXES"
      }
      output_map: {
        key: "BBOXES"
        value: "ENSEMBLE_OUTPUT_0"
      }
    }
  ]
}

The above Pipelines include two independently deployed model services, simple_yolov5 and nms . The two model services are connected in series through Ensemble . The output of simple_yolov5 is used as the input of nms, and the output of nms is used as the output of the entire Pipelines. Each input_map and output_map is a key-value pair, the key is the input/output name of each model service, and the value is the input/output name of Ensemble.

3.2 Ensemble data transmission

  • If all sub-models of Ensemble are deployed based on Triton's built-in framework backend, data between sub-models can be transferred point-to-point through the CUDA API without going through CPU memory copying.

  • If the sub-model of Ensemble uses a custom backend or python backend, the tensor communication between the sub-models is completed through the memory copy of the system (CPU). Even if the python backend is set to , this memory copy cannot be FORCE_CPU_ONLY_INPUT_TENSORSavoided no. As shown in the following step, the previous step is the output output through tensorrt backend, which is located on the GPU. The input printed in the python backend is always located on the CPU, that is, a memory copy of Device to Host occurs here.

    for request in requests:
    
        before_nms = pb_utils.get_input_tensor_by_name(
            request, 'candidate_boxes')
    
        # always true
        print (f'nms pb_tensor is from cpu {
            
            before_nms.is_cpu()}', flush=True)
    

4. Performance analysis

Data source: Benchmark

Throughput and latency are the two main performance indicators to consider. There is little difference between the three in latency. However, in terms of throughput, the batched_nms_dynamic > Ensemble > BLSreasons are:

  • The inference and nms of batched_nms_dynamic are all included in the trt engine. The layers are transmitted through the CUDA API, which is the most efficient.
  • The inference and nms of Ensemble and BLS are two independent model instances. The Input tensor of the python backend in BLS is located on the GPU, while the Input tensor in the ensemble is forced to the CPU. The overhead caused by memory copying is higher than that of step parallelism. The benefits of execution are greater. Therefore, BLS performs better than Ensemble when including python backend

Guess you like

Origin blog.csdn.net/weixin_41817841/article/details/127658834