Introduction to TensorRT

 

What is TensorRT

TensorRT is a high-performance neural network inference library developed by Nvidia's C++ language. It is an optimizer and runtime engine for production deployment . Its high-performance computing power relies on Nvidia's graphics processing unit. It focuses on inference tasks and complements commonly used neural network learning frameworks, including TensorFlow, Caffe, PyTorch, MXNet, and more. You can directly load the trained model files of these frameworks, and also provide API interfaces to build models by yourself through programming.

 

 

TensorRT relies on Nvidia's deep learning hardware environment, which can be GPU or DLA, and cannot be used without it.

TensorRT supports most of the current neural network Layer definitions, and provides APIs for developers to implement special Layer operations by themselves. The related function is INetworkDefinition::addPluginV2().

 

key interface type

In the TensorRT core library, the most critical interface types are:

  • IExecutionContext inference engine running context
  • ICudaEngine inference engine
  • IRuntime CudaEngine deserialization
  • INetWorkDefinition network definition
  • IParser network model analysis
  • IOptimizationProfile optimization configuration
  • Construction parameters of IBuilderConfig CudaEngine
  • IBuilder constructor, mainly used to construct CudaEngine
  • ILogger log interface, which needs to be implemented by developers

IExecutionContext

The inference engine running context (Context), which uses CudaEngine to perform inference operations, is the final execution interface for inference operations.

A CudaEngine is allowed to have multiple running contexts, and each context can use a different batch size. If the input size of the network model supports dynamic adjustment, each context can also use its own different size input.

The main function is to perform inference operations, the specific function IExecutionContext::executeV2(bindings)

Created by CudaEngine, ICudaEngine::createExecutionContext()

 

ICudaEngine

It can be called an inference engine, which allows applications to call this interface to perform inference, supports synchronous execution and asynchronous execution, and implements asynchrony through Stream and Event in Cuda. An inference engine can have multiple running contexts and supports batch input execution.

The main function of CudaEngine is to perform inference tasks by creating a Context. The way to create a Context is ICudaEngine::createExecutionContext().

CudaEngine can be serialized into memory and then cached on disk. The next time you use it, you can directly load it from the disk file to memory and then serialize it to CudaEngine, which saves a lot of time and parameter configuration.

The creation method of CudaEngine depends on INetworkDefinition (network definition interface). NetWorkDefinition is generally obtained by parsing ONNX model files or TensorFlow trained model files. The parsing of ONNX model files requires the nvonnxparser::IParser interface.

Note : It is a time-consuming process to generate CudaEngine from NeworkDefinition. The generated CudaEngine can be cached to a disk file, and can be directly loaded for subsequent use.

In addition , CudaEngine is not cross-platform, and the CudaEngine generated by different GPU models and different TensorRT versions may be incompatible. When using cached CudaEngine files, pay attention to distinguish them. These influencing factors can be used as part of the file name Differentiate, for example, file names such as Win64_RTX2080TI_70011.engine.

Related interfaces:

  • IExecutionContext, generate Context and infer through Context, related functions: ICudaEngine::createExecutionContext()
  • IRuntime, deserialize cache files, get CudaEngine, related functions: IRuntime::deserializeCudaEngine()
  • IBuilder, constructs CudaEngine according to NetworkDefinision and BuilderConfig, related functions: IBuilder::buildEngineWithConfig(INetworkDefinision,IBuilderConfig)

 

IRuntime

The name of this interface is easy to misunderstand, and the name seems to be a very low-level interface, but its actual function is mainly only one, which is to deserialize the serialized cache file of CudaEngine and get the CudaEngine object again.

How to get it: nvinfer1::createInferRuntime(void)

Related interfaces:

  • ICudaEngine, constructs CudaEngine according to NetworkDefinision and BuilderConfig, related functions: IBuilder::buildEngineWithConfig()

INetWorkDefinition

Network definition interface, this interface provides a series of functions for developers to construct a neural network from scratch, including the dimension size of input and output tensors, the type and activation function of each Layer, etc. It also provides an interface for developers to add A custom Layer is a very powerful interface. However , any of its functions are basically not used in general actual use. Because the definition of the network is automatically generated from the trained network model file such as ONNX file.

The general use steps of this interface:

  1. Generate an INetWorkDefinition through IBuilder::createNetwork().
  2. Use the interface NvOnnxParser::createParser(&INetWorkDefinition,...) to create an IPaser object bound to this INetWorkDefinition.
  3. Calling IPaser::parseFromFile("path.onnx") will construct an INetWorkDefinition object from the model file.

IParser

ONNX Parser, parses the trained model file, and constructs the bound INetWorkDefinition object according to the model file.

Obtaining method: NvOnnxParser::createParser(&INetWorkDefinition,...)

Main function: IPaser::parseFromFile("path.onnx")

IOptimizationProfile

Specify the dimensions of each input and output tensor for the dynamic model, the function is IOptimizationProfile::setDimensions().

There must be at least one IOptimizationProfile when constructing CudaEngine, because each ExecutionContext must specify an IOptimizationProfile before it can perform inference operations.

How to get IBuilder::createOptimizationProfile(void)

Related interfaces:

  • IBuilderConfig, each IBuilderConfig must have at least one IOptimizationProfile, IOptimizationProfile will be constructed into CudaEngine along with IBuilderConfig, and specified by ExecutionContext. The related function is IBuilderConfig::addOptimizationProfile(IOptimizationProfile)
  • IExecutionContext, after each Context is created, you need to specify an IOptimizationProfile, IExecutionContext::setOptimizationProfile(index), this index is the serial number of IOptimizationProfile in IBuilderConfig, the order is according to the calling order of addOptimizationProfile.

IBuilderConfig

To construct the configuration parameters of CudaEngine, you can add the IOptimizationProfile configuration to set the maximum working memory space, the maximum batch size, the minimum acceptable precision level, and half-floating-point precision operations.

How to get IBuilder::createBuilderConfig(void)

Related interfaces:

  • IBuilder, constructs CudaEngine according to NetworkDefinision and BuilderConfig, function: IBuilder::buildEngineWithConfig(INetworkDefinision,IBuilderConfig)

IBuilder

The IBuilder interface is mainly used to build CudaEngine, and it is also used to generate INetWorkDefinition interface objects, IOptimizationProfile and IBuilderConfig interface objects.

ILogger

The log interface is used to output some messages, warnings, errors and other information inside TensorRT.

When creating IBuilder and IRuntime, you need to pass in the ILogger object. We need to implement this interface and create an object to pass to them. The simplest implementation of an interface is as follows:

class Logger : public ILogger           
 {
     void log(Severity severity, const char* msg) override
     {
         // suppress info-level messages
         if (severity != Severity::kINFO)
             std::cout << msg << std::endl;
     }
 } gLogger;

flow chart

Summarize

This article introduces the necessary interfaces for using TensorRT, masters the calling relationship between these interfaces, and understands the workflow of TensorRT. In the future, the specific details of each step will be introduced in detail in combination with the actual project.

reference

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

 

Guess you like

Origin blog.csdn.net/Ango_/article/details/116140436