Triton Tutorial - Model Setup

Is this your first time writing a configuration file? Check out this guide or this example!
insert image description here

Triton Series Tutorials:

Article directory

Triton Tutorial---Model Setup

Every model in the Model repository must contain a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a file specified as a ModelConfig protobuf . config.pbtxtIn some cases, as described in Automatically Generated Model Configurations , model configurations can be automatically generated by Triton and therefore do not need to be provided explicitly.

This section describes the most important model configuration properties, but you should also consult the documentation in the ModelConfig protobuf .

Minimal Model Configuration

A minimal model configuration must specify platform and/or backend properties , max_batch_sizeattributes, and model's 输入和输出张量.

For example, consider a TensorRT model that has two inputs input0and input1and one output output0, all of which are 16dims float32tensors. The minimum configuration is:

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
    
    
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    },
    {
    
    
      name: "input1"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
    
    
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]

name, platform and backend

The model configuration name attribute is optional. If the name of the model is not specified in the configuration, it is assumed to be the same as the Model repository directory containing the model. If a name is specified, it must match the name of the Model repository directory containing the model. The values required by the platform and backend are described in the backend documentation .

Model Interaction Strategy

The model_transaction_policy attribute describes the expected nature of interactions of the model.

decoupling

This boolean setting indicates whether the responses generated by the model are decoupled from the requests made to it. Using decoupling means that the number of responses generated by the model may not be the same as the number of requests made, and the order of responses relative to requests may be out of order. The default value is false, which means the model will generate a response for each request.

maximum batch size

max_batch_sizeAttribute indicating the maximum batch size for the batch types supported by the model that Triton can take advantage of. If the batch dimension of the model is the first dimension, and all inputs and outputs of the model have this batch dimension, then Triton can automatically batch the model using its dynamic batcher or sequence batcher. In this case, max_batch_sizeit should be set to a value greater than or equal to 1, indicating the maximum batch size that Triton should use with the model.

max_batch_sizeMust be set to zero for models that do not support batching, or that do not support batching in the specific way above .

input and output

Each model input and output must specify a name, data type, and shape. Names given to input or output tensors must match the names expected by the model.

Special conventions for PyTorch backends

Naming convention:

Due to insufficient metadata for inputs/outputs in TorchScript model files, the "name" attribute of inputs/outputs in configuration must follow a specific naming convention. These are detailed below.

[Only for inputs] When the input is not a dictionary of tensors, the input name in the config file should reflect the input parameter name of the forward function in the model definition.

For example, if the forward function of a Torchscript model is defined as forward(self, input0, input1), the first and second inputs should be named "input0" and "input1", respectively.
<name>__<index>: where <name>it can be any string and <index>is an integer index indicating the position of the corresponding input/output.

This means that if there are two inputs and two outputs, the first and second inputs can be named and “INPUT__0”and “INPUT__1”the first and second outputs can be named “OUTPUT__0”and respectively “OUTPUT__1”.
If all inputs (or outputs) do not follow the same naming convention, then we enforce strict ordering from the model configuration, i.e. we assume that the order of the inputs (or outputs) in the configuration is the true order of those inputs.

A dictionary of tensors as input:

The PyTorch backend supports passing input to the model as a dictionary of tensors. This feature is only supported when dictionary-type models containing mappings of strings to tensors have a single input. For example, if you have a model that requires form input:

{
    
    'A': tensor1, 'B': tensor2}

In this case the input names in the configuration cannot follow the above naming convention <name>__<index>. Instead, in this case the name of the input must map to a string value for that particular tensor “key”. For this case, the input will be “A”and “B”, where input “A”refers to the value corresponding to tensor1 and “B”refers to the value corresponding to tensor2.

The data types allowed for input and output tensors vary by model type. The Data Types section describes the allowed data types and how they map to the data types of each model type.

The input shape represents the shape of the input tensors expected by the model and Triton in inference requests. output_shape indicates the shape of the output tensors produced by the model and returned by Triton in response to inference requests. Both input and output shapes must have a rank greater than or equal to 1, ie empty shapes are not allowed [ ].

The input and output max_batch_sizeshapes dimsare specified by a combination of and dimensions specified by the input or output attributes. For models max_batch_sizegreater than 0, the full shape is formed as [ -1 ] + dims. For a model max_batch_sizeequal to 0, the complete shape is formed as dims. For example, for the following configuration , the shape of “input0”is .[-1, 16]“output0”[-1, 4]

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
    
    
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
    
    
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 4 ]
    }
  ]

For the same configuration except max_batch_sizeequal to 0 , the “input0”shape of is .[16]“output0”[4]

  platform: "tensorrt_plan"
  max_batch_size: 0
  input [
    {
    
    
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
    
    
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 4 ]
    }
  ]

For models that support input and output tensors with variable-sized dimensions, these dimensions can be listed as -1 in the input and output configurations. For example, if the model expects a 2D input tensor where the first dimension must be of size 4 and the second dimension can be of any size, the model configuration for that input will include dims: [ 4, -1 ]. Triton will then accept inference requests where the second dimension of the input tensor is any value greater than or equal to 0. Model configuration may be more restrictive than the underlying model allows. For example, even though the framework model itself allows the second dimension to be of any size, the model configuration can be specified as such dims: [ 4, 4 ]. In this case, Triton will only accept inference requests for input tensors of shape [ 4, 4 ] exactly .

The reshape attribute must be used if the input shape Triton receives in an inference request does not match the input shape expected by the model. Likewise, the reshape attribute must be used if the output shape produced by the model does not match the shape returned by Triton in the response to the inference request.

A model input can specify allow_ragged_batch to indicate that the input is a ragged input. This field is used with dynamic batchers to allow batching without forcing the input to have the same shape across all requests.

Auto-generated model configuration

A model configuration file containing the required settings must be available for each model to be deployed on Triton. In some cases, required parts of the model configuration can be automatically generated by Triton. A required part of the model configuration is the settings shown in the minimal model configuration . By default, Triton will attempt to complete these sections. However, Triton can be configured not to autocomplete model configuration on the backend by starting Triton with --disable-auto-complete-configthe option . However, even with this option, Triton fills in missing instance_group settings with default values.

Triton can automatically export all required settings for most TensorRT, TensorFlow saved models, ONNX models and OpenVINO models. For Python models, the auto_complete_config function can be implemented in the Python backend to provide the max_batch_size, input and output attributes using the set_max_batch_size, add_input and add_output functions. These properties will allow Triton to load Python models with a minimal model configuration without a configuration file . All other model types must provide a model configuration file.

When developing a custom backend, you can populate the configuration with the required settings and call the TRITONBACKEND_ModelSetConfig API to update the completed configuration with the Triton core. You can look at the TensorFlow and Onnxruntime backends as examples of how to achieve this. Currently, the backend can only populate input , output , max_batch_size and dynamic batch settings. For custom backends, your config.pbtxt file must contain a backend field, or your model name must be in <model_name>.<backend_name>the format .

You can also view the model configuration that Triton generates for a model using the model configuration endpoint . The easiest way is to use a utility like curl:

$ curl localhost:8000/v2/models/<model name>/config

This returns a JSON representation of the generated model configuration. From here you can take the max_batch_size, input and output parts of the JSON and convert it to a config.pbtxt file. Triton generates only the smallest part of the model configuration . You must still provide optional parts of the model configuration by editing the config.pbtxt file.

Default maximum batch size and dynamic batcher

When a model uses autocomplete, a default maximum batch size can be set using a --backend-config=default-max-batch-size=<int>command line argument. This allows all models capable of batching and configured with autogenerated models to have a default maximum batch size. By default, this value is set to 4. Backend developers can TRITONBACKEND_BackendConfiguse this by getting it from the api default-max-batch-size. Currently, the following backends that use these default batching values and turn on dynamic batching in their generated model configurations are:

TensorFlow backend
Onnx runtime backend
TensorRT backend

TensorRT models store the maximum batch size explicitly and do not use the default-max-batch-size parameter. However, if max_batch_size > 1 and no scheduler is provided , the dynamic batch scheduler will be enabled.

If the maximum batch size value set for the model is greater than 1, the dynamic_batching configuration will be set if no scheduler is provided in the configuration file .

type of data

The following table shows the tensor data types supported by Triton. The first column shows the name of the data type as it appears in the model configuration file. The next four columns show the corresponding data types for the supported model frames. If the model skeleton does not have an entry for a given data type, then Triton does not support that data type for that model. The sixth column, labeled "API", shows the corresponding data types for the TRITONSERVER C API, TRITONBACKEND C API, HTTP/REST protocol, and GRPC protocol. The last column shows the corresponding data types for the Python numpy library.

Model Config	TensorRT	TensorFlow	ONNX Runtime	PyTorch	API	NumPy
TYPE_BOOL	kBOOL	DT_BOOL	BOOL	kBool	BOOL	bool
TYPE_UINT8	kUINT8	DT_UINT8	UINT8	Kbyte	UINT8	uint8
TYPE_UINT16		DT_UINT16	UINT16		UINT16	uint16
TYPE_UINT32		DT_UINT32	UINT32		UINT32	uint32
TYPE_UINT64		DT_UINT64	UINT64		UINT64	uint64
TYPE_INT8	kINT8	DT_INT8	INT8	kChar	INT8	you8
TYPE_INT16		DT_INT16	INT16	kshort	INT16	int16
TYPE_INT32	kINT32	DT_INT32	INT32	ext	INT32	int32
TYPE_INT64		DT_INT64	INT64	kLong	INT64	int64
TYPE_FP16	kHALF	DT_HALF	FLOAT16		FP16	float16
TYPE_FP32	kFLOAT	DT_FLOAT	FLOAT	kFloat	FP32	float32
TYPE_FP64		DT_DOUBLE	DOUBLE	kDouble	FP64	float64
TYPE_STRING		DT_STRING	STRING		BYTES	dtype(object)
TYPE_BF16					BF16

For TensorRT, each value is in the nvinfer1::DataType namespace . For example, nvinfer1::DataType::kFLOATis a 32-bit floating-point data type.

For TensorFlow, each value is in the tensorflow namespace. For example, tensorflow::DT_FLOATa 32-bit floating point value.

For ONNX runtimes, each value is ONNX_TENSOR_ELEMENT_DATA_TYPE_prefixed . For example, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOATis a 32-bit floating-point data type.

For PyTorch, each value is in the torch namespace. For example, torch::kFloatis a 32-bit floating-point data type.

For Numpy, each value is in the numpy module. For example, numpy.float32is a 32-bit floating-point data type.

Reshape

The ModelTensorReshape attribute on a model configuration input or output is used to indicate that the inference API accepts a different input or output shape than the one expected or generated by the underlying framework model or custom backend.

For input, Reshape can be used to reshape the input tensor to a different shape expected by the framework or backend. A common use case is that a batch-enabled model expects the batched input to have shape [ batch-size ], meaning that the batch dimension fully describes the shape. For the inference API, equivalent shapes must be specified [ batch-size, 1 ], as each input must specify a non-empty dims. For this case the input should be specified as:

  input [
    {
    
    
      name: "in"
      dims: [ 1 ]
      reshape: {
    
     shape: [ ] }
    }

For output, Reshape can be used to reshape the output tensor generated by the framework or backend to a different shape returned by the inference API. A common use case is that a batch-enabled model expects the batch output to have shape [batch-size], meaning that the batch dimension fully describes the shape. For the inference API, equivalent shapes must be specified [ batch-size, 1 ], as each output must specify a non-empty dims. For this case the output should be specified as:

  output [
    {
    
    
      name: "in"
      dims: [ 1 ]
      reshape: {
    
     shape: [ ] }
    }

shape tensor

For models that support shape tensors, is_shape_tensor the attribute . An example configuration specifying a tensor of shape is shown below.

  name: "myshapetensormodel"
  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
    
    
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 1 , 3]
    },
    {
    
    
      name: "input1"
      data_type: TYPE_INT32
      dims: [ 2 ]
      is_shape_tensor: true
    }
  ]
  output [
    {
    
    
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 1 , 3]
    }
  ]

As mentioned above, Triton assumes that batching occurs along the first dimension not listed in the input or output tensor dims. However, for shape tensors, batching occurs at the first shape value. For the example above, an inference request must provide an input with the following shape.

  "input0": [ x, 1, 3]
  "input1": [ 3 ]
  "output0": [ x, 1, 3]

where x is the requested batch size. Triton requires that shape tensors be labeled as shape tensors in the model when batching is used. Note that the shape of "input1" [ 3 ]is not [ 2 ], which is how it is described in the model configuration. Since myshapetensormodelthe model is a batch model, the batch size should be provided as an additional value. “input1”Triton will accumulate all shape values in the batch dimension before making a model request .

For example, suppose a client sends the following three requests to Triton with the following inputs:

Request1:
input0: [[[1,2,3]]] <== shape of this tensor [1,1,3]
input1: [1,4,6] <== shape of this tensor [3]

Request2:
input0: [[[4,5,6]], [[7,8,9]]] <== shape of this tensor [2,1,3]
input1: [2,4,6] <== shape of this tensor [3]

Request3:
input0: [[[10,11,12]]] <== shape of this tensor [1,1,3]
input1: [1,4,6] <== shape of this tensor [3]

Assuming these requests are batched together, they will be passed to the model:

Batched Requests to model:
input0: [[[1,2,3]], [[4,5,6]], [[7,8,9]], [[10,11,12]]] <== shape of this tensor [4,1,3]
input1: [4, 4, 6] <== shape of this tensor [3]

Currently, only TensorRT supports tensors of shape. Read Shape Tensor I/O to learn more about shape tensors.

version policy

Each model can have one or more versions . The ModelVersionPolicy property of the model configuration is used to set one of the following policies.

All: All model versions available in the model repository are available for inference. version_policy: { all: {}}
Latest: Only the latest "n" versions of the model in the repository are available for inference. The latest version of the model is the numerically highest version number. version_policy: {Latest: {num_versions: 2}}
Specific: Only the specifically listed model versions are available for inference. version_policy: {Specific: {versions: [1,3]}}

If no version policy is specified, Latest (n=1) is used by default, meaning that Triton only provides the latest version of the model. In all cases, adding or removing a version subdirectory from the model repository can change the model version used in subsequent inference requests.

The following configuration specifies that all versions of the model are available from the server.

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
    
    
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    },
    {
    
    
      name: "input1"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
    
    
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  version_policy: {
    
     all {
    
     }}

instance group

Triton can provide multiple instances of a model so that multiple inference requests to the model can be processed concurrently. Model configuration ModelInstanceGroupproperties are used to specify the number of execution instances that should be available and what computing resources should be used for those instances.

multiple model instances

By default, a single execution instance of the model is created for each GPU available in the system. Instance group settings can be used to place multiple execution instances of a model on each GPU or only on certain GPUs. For example, the following configuration will place two execution instances of the model available on each system GPU.

  instance_group [
    {
    
    
      count: 2
      kind: KIND_GPU
    }
  ]

And the following configuration will place one execution instance on GPU 0 and two execution instances on GPU 1 and 2.

  instance_group [
    {
    
    
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    },
    {
    
    
      count: 2
      kind: KIND_GPU
      gpus: [ 1, 2 ]
    }
  ]

For a more detailed example of using instance groups, see this guide .

Example of CPU model

The instance group setting is also used to enable the execution model on the CPU. Even if a GPU is available in the system, the model can be executed on the CPU. The following places two execution instances on the CPU.

  instance_group [
    {
    
    
      count: 2
      kind: KIND_CPU
    }
  ]

If no count is specified for the KIND_CPU instance group, the default instance count for the selected backend (Tensorflow and Onnxruntime) will be 2. All other backends will default to 1.

host policy

Instance group settings are associated with host policies. The following configuration will associate all instances created by the instance group settings with the host policy "policy_0". By default, the host policy will be set according to the device type of the instance, for example, "cpu" for KIND_CPU, "model" for KIND_MODEL, and "model" for KIND_GPU “gpu_<gpu_id>”.

  instance_group [
    {
    
    
      count: 2
      kind: KIND_CPU
      host_policy: "policy_0"
    }
  ]

Rate limiter configuration

Instance groups can optionally specify a rate limiter configuration that controls how the rate limiter operates on the instances in the group. If rate limiting is off, the rate limiter configuration is ignored. If rate limiting is enabled and this configuration instance_groupis not provided, the execution of model instances belonging to this group will not be throttled by the rate limiter in any way. The configuration includes the following specifications:

resource

The set of resources required to execute a model instance . The "name" field identifies the resource, and the "count" field refers to the number of copies of the resource that the model instances in the group need to run. The Global field specifies whether the resource is shared per device or globally across the system. Loaded models cannot specify resources with the same name as globals and non-globals. If no resources are provided, triton assumes that the execution of the model instance does not require any resources and will begin execution as soon as the model instance becomes available.

priorities

The priority is used as a weight value to prioritize all instances of all models. Instances with a priority of 2 will get 1/2 the scheduling opportunities of instances with a priority of 1.

The following example specifies that the instances in the group require four "R1" and two "R2" resources to execute. Resource "R2" is a global resource. Also, instance_group has a rate limiter priority of 2.

  instance_group [
    {
    
    
      count: 1
      kind: KIND_GPU
      gpus: [ 0, 1, 2 ]
      rate_limiter {
    
    
        resources [
          {
    
    
            name: "R1"
            count: 4
          },
          {
    
    
            name: "R2"
            global: True
            count: 2
          }
        ]
        priority: 2
      }
    }
  ]

The configuration above creates 3 model instances, one for each device (0, 1, and 2). There will be no contention among these three instances for "R1" because "R1" is local to their own device, however, they will be contention for "R2" because it is designated as a global resource, which means that "R2 ” are shared throughout the system. Although these instances do not compete for “R1” among themselves, they do compete for “R1” with other model instances that have “R1” in their resource requirements and are on the same run on the device.

Integrating Model Instance Groups

An ensemble model is an abstraction that Triton uses to execute user-defined model pipelines. An instance_group field cannot be specified for an ensemble model since there is no physical instance associated with it.

However, each combined model that makes up an ensemble can specify an instance_group in its configuration file and individually support parallel execution when the ensemble receives multiple requests, as described above.

CUDA computing power

Similar to the default_model_filename field, you can optionally specify the cc_model_filenames field to map the GPU's CUDA compute capabilities to the corresponding model filenames at model load time . This is especially useful for TensorRT models, since they are usually associated with specific compute capabilities.

cc_model_filenames [
  {
    
    
    key: "7.5"
    value: "resnet50_T4.plan"
  },
  {
    
    
    key: "8.0"
    value: "resnet50_A100.plan"
  }
]

Scheduling and Batching

Triton supports batch inference by allowing a single inference request to specify a batch of inputs. Inference on a batch of inputs is performed simultaneously, which is especially important for GPUs because it can greatly increase inference throughput. In many use cases, individual inference requests are not batched, and therefore, they cannot benefit from the throughput benefits of batching.

Inference Server includes a variety of scheduling and batching algorithms, supporting many different model types and use cases. For more information on model types and schedulers, see Models and Schedulers .

default scheduler

If no scheduling_choice attribute is specified in the model configuration, the default scheduler is used for the model. The default scheduler simply distributes inference requests to all model instances configured for the model.

dynamic batch processor

Dynamic batching is a Triton feature that allows the server to compose inference requests so that batches are created dynamically. Creating a batch of requests usually results in an increase in throughput. Dynamic batchers are applied to stateless models . Dynamically created batches are distributed to all model instances configured for the model .

Dynamic batching is enabled and configured independently for each model using the ModelDynamicBatching property in the model configuration. These settings control the preferred size for dynamically created batches, the maximum time a request can be delayed in the scheduler to allow other requests to join the dynamic batch, and queue properties such as queue size, priority, and timeout. About Dynamic Batching See this guide for more detailed examples .

Recommended configuration process

Each setting is described in detail below. The following steps are the recommended procedure for tuning the dynamic batcher for each model. It is also possible to automatically search for different dynamic batcher configurations using the Model Analyzer.

Determines the maximum batch size for the model.
Add the following to your model configuration to enable dynamic batcher with all default settings. By default, the dynamic batcher will create as large batches as possible, up to the maximum batch size, without delay in forming batches.
```
  dynamic_batching {
      
       }
```
Use the Performance Analyzer to determine the latency and throughput provided by the default dynamic batcher configuration.
If the default configuration results in latency values that are within your latency budget, try one or both of the following to trade off increased latency for increased throughput:
- Increase the maximum batch size.
- Set the batch delay to a non-zero value. Try increasing the latency value until you exceed your latency budget to see the impact on throughput.
Most models should not use the preferred batch size. A preferred batch size should only be configured if that batch size results in significantly higher performance than other batch sizes.

preferred batch size

The preferred_batch_size attribute indicates the batch size that the dynamic batcher should attempt to create. For most models, preferred_batch_size should not be specified, as described in the recommended configuration procedure. An exception is TensorRT models that specify multiple optimization profiles for different batch sizes. In such cases, since some optimization profiles may provide significant performance improvements over others, it may make sense to use preferred_batch_size to denote the batch size supported by those higher performance optimization profiles.

The following examples show configurations that enable dynamic batching with preferred batch sizes of 4 and 8.

  dynamic_batching {
    
    
    preferred_batch_size: [ 4, 8 ]
  }

When model instances are available for inference, the dynamic batcher will attempt to create batches based on the requests available in the scheduler. Requests are added to the batch in the order they were received. If the dynamic batcher can form a batch of the preferred size, it will create a batch of the largest possible preferred size and send it for inference. If the dynamic batcher cannot form a batch of the preferred size (or if the dynamic batcher is not configured with any preferred batch size), it will send a batch of a maximum size which may be smaller than the maximum batch size allowed by the model (but please See the next section for delay options that change this behavior).

The size of the resulting batch can be checked aggregated using the count metric .

delayed batching

Dynamic batchers can be configured to allow requests to be delayed in the scheduler for a limited amount of time to allow other requests to join the dynamic batch. For example, the following configuration sets a maximum latency of 100 microseconds for requests.

  dynamic_batching {
    
    
    max_queue_delay_microseconds: 100
  }

max_queue_delay_microsecondsThe property setting changes the dynamic batcher behavior when the maximum size (or preferred size) batch cannot be created . When a batch of the maximum or preferred size cannot be created from the available requests, the dynamic batcher delays sending batches as long as no requests are delayed beyond the configured max_queue_delay_microsecondsvalue . If a new request arrives during this delay and allows the dynamic batcher to form a batch of the maximum or preferred batch size, that batch is sent immediately for inference. If the delay expires, the dynamic batcher will send the batch as is, even if it is not the maximum or preferred size.

preserve order

The preserve_ordering attribute is used to force all responses to be returned in the same order that the requests were received. See the protobuf documentation for details.

priority

By default, the dynamic batcher maintains a queue containing all inference requests for a model. Requests are processed sequentially and batched. The priority_levels attribute can be used to create multiple priorities in a dynamic batcher, allowing requests with higher priority to bypass requests with lower priority. Requests with the same priority are processed in order. Inference requests with no priority set are scheduled using the default_priority_level attribute.

queue strategy

The dynamic batcher provides several settings to control how requests are queued for batching.

The ModelQueuePolicy for individual queues can be set using default_queue_policy when priority_levels are not defined. When priority_levels are defined, each priority can have a different ModelQueuePolicy, specified by default_queue_policy and priority_queue_policy.

The ModelQueuePolicy attribute allows setting the maximum queue size with max_queue_size. The timeout_action, default_timeout_microseconds, and allow_timeout_override settings allow configuring the queue to reject or defer individual requests if the time in the queue exceeds the specified timeout.

custom batch

In addition to the specified behavior of the dynamic batcher, you can also set custom batching rules. To do this, you will implement five functions in tritonbackend.h and create a shared library. These functions are described below.

Function	Description
TRITONBACKEND_ModelBatchIncludeRequest	Determines whether a request should be included in the current batch
TRITONBACKEND_ModelBatchInitialize	Initializes a record-keeping data structure for a new batch
TRITONBACKEND_ModelBatchFinalize	Deallocates the record-keeping data structure after a batch is formed
TRITONBACKEND_ModelBatcherInitialize	Initializes a read-only data structure for use with all batches
TRITONBACKEND_ModelBatcherFinalize	Deallocates the read-only data structure after the model is unloaded

The path to the shared library can be passed into the model configuration via the parameter TRITON_BATCH_STRATEGY_PATH. If not provided, the dynamic batcher will look for a custom batch strategy named batchstrategy.so in the modelversion, model, and backend directories, in that order. If found, it will load it. This allows you to easily share custom batching strategies across all models using the same backend.

See the backend examples directory for a tutorial on how to create and use a custom batch library .

sequence batch program

Like dynamic batchers, sequential batchers combine non-batched inference requests so that batches are created dynamically. Unlike dynamic batchers, sequential batchers are applied to stateful models, where a series of inference requests must be routed to the same model instance. Dynamically created batches are distributed to all model instances configured for the model.

Sequence batching is enabled and configured independently for each model using the ModelSequenceBatching property in the model configuration. These settings control sequence timeouts and configure how Triton will send control signals to the model to indicate sequence start, end, readiness, and correlation IDs. See Stateful Models for more information and examples.

integrated scheduler

The ensemble scheduler must be used for ensemble models and cannot be used for any other type of model.

Ensemble schedulers are enabled and configured independently for each model using the ModelEnsembleScheduling property in the model configuration. These settings describe the models included in the ensemble and the flow of tensor values between the models. See Ensemble Models for more information and examples.

Optimization Strategy

Model Configuration The ModelOptimizationPolicy property is used to specify optimization and priority settings for the model. These settings control whether/how the backend optimizes the model and how Triton schedules and executes the model. See the ModelConfig protobuf and optimization documentation for currently available settings.

Model warm-up

When Triton loads a model, the corresponding backend is initialized for that model. For some backends, some or all of this initialization is deferred until the model receives its first inference request (or first few inference requests). Therefore, the first (few) inference requests may be significantly slower due to lazy initialization.

To avoid these initial, slow inference requests, Triton provides a configuration option that enables the model to be "warmed up" so that it is fully initialized before the first inference request is received. When the ModelWarmup property is defined in the model configuration, Triton will not show that the model is ready for inference until the model warmup is complete.

The model configuration ModelWarmup is used to specify the warmup settings for the model. These settings define the sequence of inference requests that Triton will create to warm up each model instance. A model instance will only be provided if the request completes successfully. Note that the effect of warming up a model varies by framework backend and can cause Triton to respond slower to model updates, so users should experiment and choose a configuration that suits their needs. See L0_warmup for the ModelWarmup protobuf documentation for currently available settings , and for examples specifying different warmup sample variants .

response caching

The model configuration response_cache section has an enable boolean to enable response caching for this model.

response_cache {
    
    
  enable: true
}

In addition to enabling caching in the model configuration, --cache-config must be specified when starting the server to enable caching on the server side. See the Response Caching documentation for more details on enabling server-side caching.