ONNX Runtime杂谈

ONNX Runtime的设计理念

本文档概述了高性能，跨平台的深度学习推理引擎ONNX Runtime的设计理念。

ONNX Runtime的主要目标

最大化自动地在不同的平台上利用定制的accelerators和runtimes。
为定制的accelerators和runtimes提供正确的抽象和运行环境。并且我们把这种抽象称之为execution
provider，它定义并公开了ONNXRuntime的一系列功能：包含一组单个或融合的可execute的节点，还有内存分配器等。自定义的accelerators和runtimes是execution provider的实例。
我们不期望execution provider始终可以完全运行ONNX模型在其设备上，简单点说就是对于单个onnx模型中的所有operator，在某个平台上例如CUDA并不是都支持，而是会存在有个别Operator不支持cuda运算，但是该operator支持cpu运算，onnx runtime可以让模型在CUDA和CPU两个平台上一起运行。这意味着ONNXRuntime必须能够执行一个涉及多个execution provider的异构环境中的模型。
提供高等级的优化策略，这个可以描述为通过graph-transformation API优化的model-to-model transformations。这种transformations可以简单的归纳为两大类：全局transformations，那些需要对整个图进行分析和转换；局部transformations，这个可被捕获为简单的（代数）重写规则。

ONNX Runtime的高级系统架构

流程非常简单。从ONNX模型开始，ONNXRuntime首先将模型图转换为其内存中的图表示形式。然后，它应用许多graph transformations，这些转换包括：a）执行一组独立于提供程序的优化，例如float16和float32之间的转换转换，以及b）根据可用的execution provider将图形划分为一组子图。每个子图都分配给一个execution provider。通过使用GetCapability（）API查询execution provider的功能，使得我们确保可以由对应的execution provider执行子图。

有关将大图分割为一系列的子图的更多信息

ONNXRuntime根据可用的execution provider将模型图划分为子图，每个execution provider一个子图。 ONNXRuntime提供了默认execution provider，该默认execution provider用作无法推送到更专业但效率更高的执行程序提供者的后备执行。直观地讲，我们希望在可能的情况下将计算推向更专业的execution provider。

我们使用一种简单的图分区技术。将按照特定顺序考虑可用的execution provider，并将为每个execution provider分配它能够处理的最大子图（可能有多个）。最后要考虑的是ONNXRuntime提供的默认execution provider，它可以确保完整性。将来可以考虑使用更复杂的优化（甚至可以将其实现为复合execution provider）。

从概念上讲，每个分区都简化为单个融合运算符。它是通过调用execution provider的Compile（）方法创建，并将其包装为custom operator。当前，我们仅支持同步执行模式。execution provider公开其内存分配器，该内存分配器用于为execution provider分配输入张量。重写和分区将初始模型图转换为新图，该图由分配给其中一个的运算符组成默认execution provider或其他注册的execution provider。 ONNXRuntime执行引擎负责运行该子图。

ONNXRuntime partitions a model graph into subgraphs based on the available execution providers, one for each distinct provider. ONNXRuntime provides
a default execution provider that is used as the fallback execution for the
operators that cannot be pushed onto the more specialized but more efficient
execution providers. Intuitively we want to push computation to more
specialized execution providers whenever possible.

We use a simple graph partitioning technique. The available execution providers
will be considered in a specific order, and each will be assigned the maximal
subgraphs (possibly more than one) that it is able to handle. The
ONNXRuntime-provided default execution provider will be the last one
considered, and it ensures completeness. More sophisticated optimizations can be
considered in the future (or can even be implemented as a composite execution
provider).

Conceptually, each partition is reduced to a single fused operator. It is
created by invoking the execution provider’s Compile() method and wraps it as a
custom operator. Currently we support only synchronous mode of execution. An execution
provider exposes its memory allocator, which is used to allocate the input
tensors for the execution provider. The rewriting and partitioning transform the
initial model graph into a new graph composed of operators assigned to either
the default execution provider or other registered execution
providers. The ONNXRuntime execution engine is responsible for running this graph.

关键决策说明

Multiple threads can invoke the Run() method on the same
inference session object. See API doc for more details.
即onnx runtime是线程安全的，同一个session object可以被多个线程调用。
To facilitate this, the Compute() function of all kernels is const
implying the kernels are stateless.
为了实现这点，所有kernels的Compute()函数都是const修饰的，这个暗示kernels是无状态的。
Implementations of the operators by execution providers are called
kernels. Each execution provider supports a subset of the (ONNX)
operators/kernels.
还需要说明的是execution providers的operators实现称之为kernels，每一个execution provider都支持一系列的(ONNX)operators/kernels。
The ONNX Runtime guarantees that all operators are supported by the default
execution provider.
ONNX Runtime 保证默认execution provider支持所有的operators。
Tensor representation: ONNXRuntime will utilize a standard representation for
the tensor runtime values. The execution providers can internally use a
different representation if they choose to, but it is their responsibility to
convert the values from/to the standard representation at the boundaries of
their subgraph.
Tensor数据格式：ONNXRuntime会使用一种标准的数据类型去表示模型运行时的tensor值。execution providers如果愿意可以在内部使用不同的数据表示，但是在与他们的子图边界必须要将内部数据格式转化为标准的的数据格式。

额外的功能说明

丶Shining

发布了45 篇原创文章 · 获赞 21 · 访问量 5万+

私信关注

一、ONNX Runtime的设计理念