Triton Tutorial --- Rate Limiter

Triton Tutorial — Rate Limiter

insert image description here

Triton Series Tutorials:

  1. quick start
  2. Deploy your own models with Triton
  3. Triton Architecture
  4. model warehouse
  5. storage agent
  6. model settings
  7. optimization
  8. dynamic batching

rate limiter

Rate limiters govern the rate at which Triton schedules requests on model instances. The rate limiter operates on all models loaded in Triton to allow cross-model prioritization.

In the absence of a rate limit ( --rate-limit=off ), Triton schedules the execution of a request (or a set of requests when using dynamic batching) as soon as a model instance becomes available. This behavior is usually best for performance. However, in some cases, running all the models at the same time may put too much load on the server. For example, model execution on some frameworks dynamically allocates memory. Running all such models simultaneously may cause the system to run out of memory.

Rate limiters allow deferring inference execution for certain model instances so that not all model instances are running at the same time. Model priority is used to decide which model instance to schedule next.

Use a rate limiter

To enable rate limiting, the user must set the --rate-limit option when starting tritonserver. See Usage of options emitted by tritonserver --help for more information.

The rate limiter is controlled by the rate limiter configuration specified for each model instance, as described in Rate Limiter Configuration . A rate limiter configuration includes resources and priorities for model instances defined by an instance group .

resource

A resource is identified by a unique name and a count indicating the number of copies of the resource. By default, model instances do not use rate limiter resources. By listing the resources/count, the model instance indicates that it needs many resources to be available on the model instance device before allowing it to execute. At execution time, the specified number of resources are allocated to the model instance and are only released at the end of execution. By default, the available number of resource copies is the maximum of all model instances that list that resource. For example, suppose three loaded model instances A, B, and C each specify the following resource requirements for a single device:

A: [R1: 4, R2: 4]
B: [R2: 5, R3: 10, R4: 5]
C: [R1: 1, R3: 7, R4: 2]

By default, based on these model instance requirements, the server will create the following resources with the specified replicas:

R1: 4
R2: 5
R3: 10
R4: 5

These values ​​ensure that all model instances can be successfully dispatched. A resource's default can be overridden by specifying it explicitly on the command line using --rate-limit-resourcethe option . tritonserver --helpMore detailed instructions for use will be provided.

By default, available resource copies are provided per device, and resource requirements for a model instance are enforced against the corresponding resources associated with the device on which the model instance is running. --rate-limit-resourceAllows users to provide different copies of resources to different devices. Rate limiters can also handle global resources. A global resource will have a copy across the system, rather than creating a copy of the resource for each device.

The rate limiter depends on the model configuration to determine whether a resource is global or not. See resources for more details on how to specify them in the model configuration.

For tritonserver, running on a dual-device machine, call --rate-limit-resource=R1:10 --rate-limit-resource=R2:5:0 --rate-limit-resource=R2:8:1 - -rate-limit-resource=R3:2, available resource copies are:

GLOBAL   => [R3: 2]
DEVICE 0 => [R1: 10, R2: 5]
DEVICE 1 => [R1: 10, R2: 8]

where R3 appears as a global resource in one of the loaded models.

priority

In a resource-constrained system, model instances compete for resources to execute their inference requests. Priority settings help determine which model instance is selected for the next execution. See Priority for details .

Guess you like

Origin blog.csdn.net/kunhe0512/article/details/131319988