Pytorch model deployment--------Introduction to TensorRT

TensorRT


Reprint link

Introduction

TensorRT (TRT) is a tool that can significantly speed up the inference of deep learning models. If it can be used well, it can significantly improve our GPU usage efficiency and model running speed.

TensorRT(TRT) is a fast GPU inference framework, and its regular process is to use existing model filesCompile an engineIn the process of compiling the engine, ** will find the optimal operator method for each layer of calculation operation, so that the compiled engine can be executed very efficiently. **It is very similar to the C++ compilation process.

Regarding the relevant information of the TRT, I think the NV official shall prevail. For TRT, its speed advantage in reasoning is definitely self-evident, how good it is, how fast it is, and whether it is suitable for your business scenario, this requires everyone to judge by themselves. The general process and some basic optimizations are also explained in the official NV information. You can refer to this ticket article deploying-deep-learning-nvidia-tensorrt . For a quick understanding of TRT and the installation process, you can also refer to TensorRT-Introduction-Use-Install

Model building

TRT compiles the model structure and parameters as well as the corresponding kernel calculation method into a binary engine, thus greatly speeding up the inference speed after deployment. In order to be able to use TRT for reasoning, an eneninge needs to be created. There are two ways to create engine in TRT:

  • Compiled by the network model structure and parameter files, it is very slow.
  • Reading an existing engine (gie file) is faster because it skips the process of model analysis and so on.

The first method is very slow, but when you deploy a model for the first time, or modify the accuracy of the model, input data type, network structure, etc., as long as the model is modified, it must be recompiled (in fact, there is another TRT that can be reloaded. The parameter method is not covered in this article).

Now suppose we are using TRT for the first time, so we can only choose the first way to create an engine. In order to create an engine, we need two files, model structure and model parameters, and a method to parse these two files. In TRT, the engine is compiled by the IBuilderconduct of an object, so we need a new key IBuilderobjects:

nvinfer1::IBuilder *builder = createInferBuilder(gLogger);

gLoggerIt is the log interface in TRT ILogger, inherit this interface and create your own logger object to pass in.

In order to compile a engine, builderyou first need to create a INetworkDefinitioncontainer as a model:

nvinfer1::INetworkDefinition *network = builder->createNetwork();

Note that at this time networkis empty , we need to fill in the model structure and parameters, that is, to resolve our own model structure and parameter files, get data into them.

TRT officially gave three parsers of mainstream framework model formats, namely:

  • ONNX:IOnnxParser parser = nvonnxparser::createParser(*network, gLogger);
  • Caffe:ICaffeParser parser = nvcaffeparser1::createCaffeParser();
  • UFF :IUffParser parser = nvuffparser::createUffParser();

Among them, UFF is the format used for TensorFlow. The corresponding files can be parsed by calling these three parsers. With ICaffeParser, for example, calls its parsemethod to fill network.

virtual const IBlobNameToTensor* nvcaffeparser1::ICaffeParser::parse(
    const char* deploy, 
    const char * model, 
	nvinfer1::INetworkDefinition &network, 
	nvinfer1::DataType weightType)

//Parameters
//deploy	    The plain text, prototxt file used to define the network configuration.
//model	        The binaryproto Caffe model that contains the weights associated with the network.
//network	    Network in which the CaffeParser will fill the layers.
//weightType    The type to which the weights will transformed.

In this way, you can get a filled network, you can compile the engine, it seems that everything is wonderful...

However, the actual TRT is not perfect. For example, many operations of TensorFlow are not supported, so the files you pass in are often not parsed at all (one of the most common dilemmas of deep learning frameworks). So we need to do what you fill networkit, which needs to be called TRT low-level interface to create the model structure, similar to you or TensorFlow Caffe in doing so.

TRT provides a richer interface so that you can create your own network directly through these interfaces, such as adding a convolutional layer:

virtual IConvolutionLayer* nvinfer1::INetworkDefinition::addConvolution(ITensor &input, 
                                                                        int nbOutputMaps,
                                                                        DimsHW kernelSize,
                                                                        Weights kernelWeights,
                                                                        Weights biasWeights)		

// Parameters
// input	The input tensor to the convolution.
// nbOutputMaps	The number of output feature maps for the convolution.
// kernelSize	The HW-dimensions of the convolution kernel.
// kernelWeights	The kernel weights for the convolution.
// biasWeights	The optional bias weights for the convolution.

The parameters here basically have similar meanings to other deep learning frameworks, and there is nothing to talk about. Just encapsulate the data into the data structure in the TRT. Maybe the difference between constructing the training network in peacetime is the need to fill in the parameters of the model, because TRT is an inference framework, and the parameters are known and determined. This process is generally to read the trained model, construct the data structure type of TRT and put it in it, which means you need to parse the model parameter file by yourself.

The reason why TRT network configuration interfaces are more abundant , because even with these low-level interface so, many still can not complete the operation, that is no corresponding add*method, not to mention the reality of business may also involve a lot of custom functional layers, Therefore, there has been TRT plugin interface that allows you to define an add*operation. Its flow is inherited nvinfer1::IPluginV2interfaces, use cuda write a self-defined function layer, then inherit nvinfer1::IPluginCreatorthe preparation of their creation classes need to override its virtual methods createPlugin. The last call REGISTER_TENSORRT_PLUGINmacro to register the plugin can be used. Introduction to the member functions of the plugin interface.

// 获得该自定义层的输出个数,比如 leaky relu 层的输出个数为1
virtual int getNbOutputs() const = 0;

// 得到输出 Tensor 的维数
virtual Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) = 0;

// 配置该层的参数。该函数在 initialize() 函数之前被构造器调用。它为该层提供了一个机会,可以根据其权重、尺寸和最大批量大小来做出算法选择。
virtual void configure(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs, int maxBatchSize) = 0;

// 对该层进行初始化,在 engine 创建时被调用。
virtual int initialize() = 0;

// 该函数在 engine 被摧毁时被调用
virtual void terminate() = 0;

// 获得该层所需的临时显存大小。
virtual size_t getWorkspaceSize(int maxBatchSize) const = 0;

// 执行该层
virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) = 0;

// 获得该层进行 serialization 操作所需要的内存大小
virtual size_t getSerializationSize() = 0;

// 序列化该层,根据序列化大小 getSerializationSize(),将该类的参数和额外内存空间全都写入到系列化buffer中。
virtual void serialize(void* buffer) = 0;

We need to rewrite the implementation of all or part of the functions here according to the functions of our own layer. There are many details here, and there is no way to expand them one by one. When you need to customize it, you still need to look at the official API.

Once the network model is built, the engine can be compiled, and some settings for the engine are required. For example, calculation accuracy, supported batch size, etc., because these settings are different, the compiled engine is also different.

TRT supports FP16 calculation, which is also the officially recommended calculation accuracy, and its setting is simpler. Call it directly:

builder->setFp16Mode(true);

In addition, when setting the precision, there is an interface for setting the strict policy:

builder->setStrictTypeConstraints(true);

This interface is whether to perform type conversion strictly according to the set precision. If the strict policy is not set, the TRT may choose a higher precision (not affecting performance) calculation type in some calculations.

In addition to accuracy, batch size and workspace size need to be set to run:

builder->setMaxBatchSize(batch_size);
builder->setMaxWorkspaceSize(workspace_size);

The batch size here is the largest batch size that can be supported at runtime, and a smaller batch size than this value can be selected at runtime, and the workspace is also set relative to this maximum batch size.

After setting the above parameters, you can compile the engine.

nvinfer1::ICudaEngine *engine = builder->buildCudaEngine(*network);

Compilation takes a long time, wait patiently.

Engine serialization and deserialization

It takes a long time to compile the engine. When the model, calculation accuracy, batch size, etc. remain unchanged, we can choose to save the engine locally for the next run, that is, engine serialization. TRT provides a convenient serialization method:

nvinfer1::IHostMemory *modelStream = engine->serialize();

Through this call, what is obtained is a binary stream, which can be saved by writing this stream to a file.

If you need to deploy again, you can directly deserialize the saved file and skip the compilation step.

IRuntime* runtime = createInferRuntime(gLogger);
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);

Use engine to make predictions

Once you have the engine, you can use it for inference.

First create an inference context. This context is similar to a namespace and is used to store variables of an inference task.

IExecutionContext *context = engine->createExecutionContext();

An engine can have multiple contexts , which means that an engine can perform multiple prediction tasks at the same time.

Then there is the index of binding input and output. The reason for this step is that in the process of building engine, TRT maps input and output to index number sequence, so we can only get the information of input and output layer through index number. Although TRT provides an interface for obtaining index numbers by name, local storage can facilitate subsequent operations.

We can first get the number of index numbers:

int index_number = engine->getNbBindings();

We can judge whether the number of numbers is the same as the sum of input and output of our network. For example, if you have one input and one output, then the number of numbers is 2. If it is not, it means that there is a problem with the engine; if there is no problem, we can get the serial number corresponding to the input and output by name:

int input_index = engine->getBindingIndex(input_layer_name);
int output_index = engine->getBindingIndex(output_layer_name);

For a common input and output network, the input index number is 0, and the output index number is 1, so this step is not necessary.

Next, you need to allocate memory space for the input and output layers. In order to allocate video memory space, we need to know the dimensional information of input and output and the type of data stored. The representation of dimensional information and data type in TRT is as follows:

class Dims
{
public:
    static const int MAX_DIMS = 8; //!< The maximum number of dimensions supported for a tensor.
    int nbDims;                    //!< The number of dimensions.
    int d[MAX_DIMS];               //!< The extent of each dimension.
    DimensionType type[MAX_DIMS];  //!< The type of each dimension.
};

enum class DataType : int
{
    kFLOAT = 0, //!< FP32 format.
    kHALF = 1,  //!< FP16 format.
    kINT8 = 2,  //!< quantized INT8 format.
    kINT32 = 3  //!< INT32 format.
};

We obtain the data dimension (dims) and data type (dtype) of the input and output through the index number, and then open up memory space for each output layer to store the output result:

for (int i = 0; i < index_number; ++i)
{
	nvinfer1::Dims dims = engine->getBindingDimensions(i);
	nvinfer1::DataType dtype = engine->getBindingDataType(i);
    // 获取数据长度
    auto buff_len = std::accumulate(dims.d, dims.d + dims.nbDims, 1, std::multiplies<int64_t>());
    // ...
    // 获取数据类型大小
    dtype_size = getTypeSize(dtype);	// 自定义函数
}

// 为 output 分配显存空间
for (auto &output_i : outputs)
{
    cudaMalloc(buffer_len_i * dtype_size_i * batch_size);
}

What this article gives is pseudo-code, which only represents logic, so some simple custom functions will be involved.

At this point, we have made preparations, and now we can put data into the model for reasoning.

Forward prediction

The forward prediction execution of TRT is asynchronous, and the context submits tasks through an enqueue call:

cudaStream_t stream;
cudaStreamCreate(&stream);
context->enqueue(batch_size, buffers, stream, nullptr);
cudaStreamSynchronize(stream);

Enqueue is a function of TRT that actually executes tasks. We also need to implement this function interface when writing plugin. among them:

  • batch_size: Engine passed in during the build process max_batch_size.

  • buffers: It is an array of pointers, and its subscript corresponds to the index number of the input and output layer. It stores the input data pointer and the output data storage address (that is, the opened video memory address).

  • stream: Stream is the concept of a series of sequential operations in cuda. For our model, all model operations are executed on the specified equipment in the order specified by the (network structure).

    CUDA stream refers to a bunch of asynchronous CUDA operations, which are executed on the device in the order in which the host code is called. Stream maintains the sequence of these operations, and allows these operations to enter the work queue after all preprocessing is completed, and can also perform some query operations on these operations. These operations include data transfer from host to device, launch kernel and other host initiate actions executed by device. The execution of these operations is always asynchronous, and the cuda runtime will determine the appropriate timing of these operations. We can use the corresponding cuda api to ensure that the results obtained are obtained after all operations are completed. Operations in the same stream have a strict execution order , but different streams have no such restriction.

It should be noted that the input and output data buffers in the array are on the GPU, by cudaMemcpy(need to open a memory to store in advance) on the input data to the GPU copies the CPU. In the same way, the output data also needs to be copied from the GPU to the CPU.

The first two sentences create a cuda stream, and the last sentence is to wait for the completion of the asynchronous stream, and then copy the data from the video memory.

At this point, we have completed a basic forecasting process of TRT.

to sum up

This article only describes the TRT prediction process and some common calls, and does not involve specific networks and specific implementations, and there are not too many coding details. Different operations on different networks require the writing of some extension plugins, and coding, including the development and management of memory and video memory, and the deconstruction and cleanup of TRT, etc. are not within the scope of this article.

Guess you like

Origin blog.csdn.net/ahelloyou/article/details/114870232