Triton Tutorial - Decoupling Backends and Models

insert image description here

Triton Series Tutorials:

Decoupling backends and models

Triton can support backends and models that send multiple responses for a request or zero responses for a request. Decoupled models/backends may also send responses out of order relative to the execution order of the request batch. This allows the backend to provide responses as it sees fit. This is especially useful in automatic speech recognition (ASR). A request with a large number of responses does not block the delivery of responses for other requests.

Develop decoupled backends/models

C++ backend

Read through the Triton backend API , Reasoning Requests and Responses , and Decoupling Responses . Repeating Backends and Square Backends demonstrate how to use the Triton backend API to implement decoupled backends. This example is intended to demonstrate the flexibility of the Triton API and should never be used in production. This example can handle batches of requests concurrently without increasing the number of instances . In a real deployment, the backend should not allow the caller thread to TRITONBACKEND_ModelInstanceExecutereturn until the instance is ready to handle another set of requests. If not designed properly, the backend can easily become oversubscribed. This can also lead to underutilization of features like dynamic batching, as it results in eager batching.

Python models using the Python backend

Read Python Backend carefully , and implement it .

The decoupling example demonstrates how to use the decoupling API to implement a decoupled Python model. As mentioned in the examples, these are intended to demonstrate the flexibility of decoupled APIs and should never be used in production.

Deploy a decoupled model

The decoupled model transaction policy must be set in the model configuration file provided for the model . Triton needs this information to enable special handling required for decoupled models. Deploying a decoupled model without this configuration setting will throw an error at runtime.

Run inference on decoupled models

The inference protocol and API describe various ways for clients to communicate and run inference on the server. For decoupled models, Triton's HTTP endpoint cannot be used to run inference because it only supports one response per request. Even standard ModelInfer RPCs in GRPC endpoints do not support decoupled responses. In order to run inference on a decoupled model, clients must use bidirectional streaming RPC. See here for more details. decoupled_test.py demonstrates how to use gRPC streams to infer decoupled models.

If using Triton's in-process C API , your application should recognize that the callback function you TRITONSERVER_InferenceRequestSetResponseCallbackregister can be called any number of times, each time with a new response. You can take a look at grpc_server.cc