background
Record three ways to convert onnx to TensorRT acceleration
1. Use onnxruntime directly
When the onnxruntime session is initialized, the first provider is added to TensorrtExecutionProvider. The software will automatically check whether it supports TensorRT. If it can, it will convert and run it. If not, it will continue to find the next one. It is also possible that TensorRT will report an error halfway. This has to be done. See what's wrong with the environment.
But this method has a very serious problem. It needs to be converted every time it runs. I haven’t seen how to save the converted engine for the time being. It takes too much time for the initialization of engineering applications.
2. Use trtexec.exe
I believe that those who want to use TensorRT have already downloaded the TensorRT folder, and the ones that are not downloaded are here . There is a trtexec.exe in the bin folder. We can execute this program directly on the command line to convert onnx.
Here is an example of onnx to tensorrt:
trtexec.exe --onnx=model.onnx --saveEngine=model.trt --fp16
Indicates converting model.onnx, saving the final engine as model.trt (with any suffix), and using fp16 precision (according to personal needs, the precision is slightly reduced, and the speed is improved. And some models will make mistakes when using fp16). There are some other specific parameters, you can see the help of this trtexec.exe to decide for yourself.
3. Use the python program to convert
Just refer to an article I wrote before.
If you need to convert to fp16 precision, you can add a sentence, just this sentence, what is FP16:
Reasoning using engines in python
1. Create trt_session.py
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine, context):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for i, binding in enumerate(engine):
size = trt.volume(context.get_binding_shape(i))
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
class TensorRTSession():
def __init__(self, model_path):
f = open(model_path, 'rb')
runtime = trt.Runtime(TRT_LOGGER)
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
self.inputs_info = None
self.outputs_info = None
self.inputs, self.outputs, self.bindings, self.stream = allocate_buffers(self.engine, self.context)
class nodes_info():
def __init__(self, name, shape):
self.name = name
self.shape = shape
def __call__(self, inputs):
return self.update(inputs)
def get_inputs(self):
inputs_info = []
for i, binding in enumerate(self.engine):
if self.engine.binding_is_input(binding):
shape = self.context.get_binding_shape(i)
inputs_info.append(self.nodes_info(binding, shape))
# print(binding, shape)
self.inputs_info = inputs_info
return inputs_info
def get_outputs(self):
outputs_info = []
for i, binding in enumerate(self.engine):
if not self.engine.binding_is_input(binding):
shape = self.context.get_binding_shape(i)
outputs_info.append(self.nodes_info(binding, shape))
self.outputs_info = outputs_info
return outputs_info
def update(self, input_arr, cuda_ctx=pycuda.autoinit.context):
cuda_ctx.push()
# Do inference
for i in range(len(input_arr)):
self.inputs[i].host = np.ascontiguousarray(input_arr[i])
trt_outputs = do_inference(self.context, bindings=self.bindings, inputs=self.inputs, outputs=self.outputs,
stream=self.stream, batch_size=1)
if cuda_ctx:
cuda_ctx.pop()
trt_outputs = trt_outputs[0].reshape(self.outputs_info[0].shape)
return trt_outputs
The final trt_outputs may need to be adjusted according to the actual number of outputs. I wrote it directly because there is only one output. I have encountered the problem of video memory overflow caused by my continuous application of Stream before. I don’t know if this modification will continue.
2. call session
This part is abstracted from my own code, there may be some problems, if you find any problems, please comment and tell me
# 初始化
session = TensorRTSession(model_path)
session.get_inputs()
session.get_outputs()
# 推理
session((tensor1, tensor2))
3. Possible problems
I feel that the most likely problem is that TensorRT may report version incompatibility, and I have been looking at it for a long time.
After choosing my TensorRT for the cuda version, I started running and found that the old newspaper version was not compatible, but my cuda version and TensorRT were compatible after reading the description, so I didn't understand it very well. I didn't have this problem when using C++ reasoning before. At first, I suspected that the Python version of TensorRT had this problem. Later, it was found that the pytorch installed in the environment brought a CuDNN version, and TensorRT conflicted with this CuDNN version . My solution at the time was to change pytorch to the cpu version. If you just use it for reasoning, you can create a new environment and only install TensorRT.