Three ways to use onnx to TensorRT (finally run in Python)

background

Record three ways to convert onnx to TensorRT acceleration

1. Use onnxruntime directly

When the onnxruntime session is initialized, the first provider is added to TensorrtExecutionProvider. The software will automatically check whether it supports TensorRT. If it can, it will convert and run it. If not, it will continue to find the next one. It is also possible that TensorRT will report an error halfway. This has to be done. See what's wrong with the environment.
insert image description here
But this method has a very serious problem. It needs to be converted every time it runs. I haven’t seen how to save the converted engine for the time being. It takes too much time for the initialization of engineering applications.

2. Use trtexec.exe

I believe that those who want to use TensorRT have already downloaded the TensorRT folder, and the ones that are not downloaded are here . There is a trtexec.exe in the bin folder. We can execute this program directly on the command line to convert onnx. insert image description here
Here is an example of onnx to tensorrt:

trtexec.exe --onnx=model.onnx --saveEngine=model.trt --fp16

Indicates converting model.onnx, saving the final engine as model.trt (with any suffix), and using fp16 precision (according to personal needs, the precision is slightly reduced, and the speed is improved. And some models will make mistakes when using fp16). There are some other specific parameters, you can see the help of this trtexec.exe to decide for yourself.

3. Use the python program to convert

Just refer to an article I wrote before.
If you need to convert to fp16 precision, you can add a sentence, just this sentence, what is FP16:
insert image description here

Reasoning using engines in python

1. Create trt_session.py

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit


TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine, context):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for i, binding in enumerate(engine):
        size = trt.volume(context.get_binding_shape(i))
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream


# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]
    
    
class TensorRTSession():
	def __init__(self, model_path):
		f = open(model_path, 'rb')
		runtime = trt.Runtime(TRT_LOGGER)
		trt.init_libnvinfer_plugins(TRT_LOGGER, '')
		self.engine = runtime.deserialize_cuda_engine(f.read())
		self.context = self.engine.create_execution_context()
		self.inputs_info = None
		self.outputs_info = None
		self.inputs, self.outputs, self.bindings, self.stream = allocate_buffers(self.engine, self.context)
		
	class nodes_info():
		def __init__(self, name, shape):
			self.name = name
			self.shape = shape

	def __call__(self, inputs):
		return self.update(inputs)
		
	def get_inputs(self):
		inputs_info = []
		for i, binding in enumerate(self.engine):
			if self.engine.binding_is_input(binding):
				shape = self.context.get_binding_shape(i)
				inputs_info.append(self.nodes_info(binding, shape))
				# print(binding, shape)
		self.inputs_info = inputs_info
		return inputs_info

	def get_outputs(self):
		outputs_info = []
		for i, binding in enumerate(self.engine):
			if not self.engine.binding_is_input(binding):
				shape = self.context.get_binding_shape(i)
				outputs_info.append(self.nodes_info(binding, shape))
		
		self.outputs_info = outputs_info
		return outputs_info

	def update(self, input_arr, cuda_ctx=pycuda.autoinit.context):
		cuda_ctx.push()

		# Do inference
		for i in range(len(input_arr)):
			self.inputs[i].host = np.ascontiguousarray(input_arr[i])
		
		trt_outputs = do_inference(self.context, bindings=self.bindings, inputs=self.inputs, outputs=self.outputs,
								   stream=self.stream, batch_size=1)
		if cuda_ctx:
			cuda_ctx.pop()

		trt_outputs = trt_outputs[0].reshape(self.outputs_info[0].shape)
		return trt_outputs

The final trt_outputs may need to be adjusted according to the actual number of outputs. I wrote it directly because there is only one output. I have encountered the problem of video memory overflow caused by my continuous application of Stream before. I don’t know if this modification will continue.

2. call session

This part is abstracted from my own code, there may be some problems, if you find any problems, please comment and tell me

# 初始化
session = TensorRTSession(model_path)
session.get_inputs()
session.get_outputs()
# 推理
session((tensor1, tensor2))

3. Possible problems

I feel that the most likely problem is that TensorRT may report version incompatibility, and I have been looking at it for a long time.
After choosing my TensorRT for the cuda version, I started running and found that the old newspaper version was not compatible, but my cuda version and TensorRT were compatible after reading the description, so I didn't understand it very well. I didn't have this problem when using C++ reasoning before. At first, I suspected that the Python version of TensorRT had this problem. Later, it was found that the pytorch installed in the environment brought a CuDNN version, and TensorRT conflicted with this CuDNN version . My solution at the time was to change pytorch to the cpu version. If you just use it for reasoning, you can create a new environment and only install TensorRT.