[Personal record] torch turns onnx to accelerate the grid_sample interface (4D/5D) of TensorRT

background

Many excellent models in binocular use torch's grid_sample, but it has no interface in TensorRT. Although there are alternatives to 4D, there is still a performance loss. And I recently tested BGNET, and I feel that it is the most reliable in the real-time solution, and there will not be a large number of mismatches in the jitter, but it has a bilateral filter that increases the dimension and becomes a 5D grid_sample, which is a big headache. There is no alternative writing method on the Internet. I implemented it myself, and the result was the same, but the time was doubled directly. After thinking about it, I still try to see if I can get an interface.

testing platform

Jetson Xavier NX

antecedent

After some complicated attempts, I first tried torch2trt. I feel that the details are easy to fail to meet his interface requirements. There are too many investigations, so I give up.
I feel that it is not appropriate to compile the source code TensorRT source code on NX. After a search and try, the following solution is determined.

reference

  1. Others github tutorial
  2. NVIDIA official instructions
  3. github open source TensorRT interface
  4. NVIDIA's trtexec folder

process

  1. First register and export the unsupported interfaces in onnx. The registration code is copied here . Note that if you want to change the name of the operator, change the place I marked.
import torch
from my_model import my_model 

import typing
from torch.onnx import symbolic_helper


_OPSET_VERSION = 11
_registered_ops: typing.AbstractSet[str] = set()


def _reg(symbolic_fn: typing.Callable):
	name = "::%s" % symbolic_fn.__name__
	torch.onnx.register_custom_op_symbolic(name, symbolic_fn, _OPSET_VERSION)
	_registered_ops.add(name)


def register():
	"""Register ONNX Runtime's built-in contrib ops.
	Should be run before torch.onnx.export().
	"""

	def grid_sampler(g, input, grid, mode, padding_mode, align_corners):
		# mode
		#   'bilinear'      : onnx::Constant[value={0}]
		#   'nearest'       : onnx::Constant[value={1}]
		#   'bicubic'       : onnx::Constant[value={2}]
		# padding_mode
		#   'zeros'         : onnx::Constant[value={0}]
		#   'border'        : onnx::Constant[value={1}]
		#   'reflection'    : onnx::Constant[value={2}]
		mode = symbolic_helper._maybe_get_const(mode, "i")
		padding_mode = symbolic_helper._maybe_get_const(padding_mode, "i")
		mode_str = ["bilinear", "nearest", "bicubic"][mode]
		padding_mode_str = ["zeros", "border", "reflection"][padding_mode]
		align_corners = int(symbolic_helper._maybe_get_const(align_corners, "b"))

		# From opset v13 onward, the output shape can be specified with
		# (N, C, H, W) (N, H_out, W_out, 2) => (N, C, H_out, W_out)
		# input_shape = input.type().sizes()
		# gird_shape = grid.type().sizes()
		# output_shape = input_shape[:2] + gird_shape[1:3]
		# g.op(...).setType(input.type().with_sizes(output_shape))

		return g.op(
		    ## op name, modify here. not sure whether "com.microsoft::" is required
			"com.microsoft::GridSamplePluginDynamic",  
			input,
			grid,
			mode_s=mode_str,
			padding_mode_s=padding_mode_str,
			align_corners_i=align_corners,
		)

	_reg(grid_sampler)



@torch.no_grad()
def convert():
	register()

	# set cpu
	device = "cuda"
	model = my_model (88, 'models.pth').to(device)
	model.eval()

	t1 = torch.rand(1, 1, 384, 640).to(device)
	t2 = torch.rand(1, 1, 384, 640).to(device)

	# Export the model
	torch.onnx.export(model,
					  (t1, t2),
					  'model.onnx',   # where to save the model (can be a file or file-like object)
					  export_params=True,        # store the trained parameter weights inside the model file
					  opset_version=11,          # the ONNX version to export the model to
					  do_constant_folding=True,  # whether to execute constant folding for optimization
					  input_names = ['left', 'right'],   # the model's input names
					  output_names = ['output'])


if __name__ == "__main__":
	convert()

Because the grid_sample I want to use is 5D, even if it is exported with the latest onnx, it does not support it, so I still use the old version of onnx to export. For 4D, I don't know if there will be any problems when exporting directly with a higher version. Had a chance to try out CREStereo afterwards.

  1. Use the onnx-simplifier casually, it is possible to eliminate some if nodes to avoid tensorrt reporting some errors
python3 -m onnxsim model.onnx model_sim.onnx
  1. Download the open source interface from here (mmcv also has it, I feel that you can find it there too), I personally use grid_sample so I deleted everything else. After deleting, change these two places.
    insert image description here
    After compiling according to his markdown, it will be fine. The generated library file is in build/lib, it looks like this
    insert image description here
  2. Then use trtexec to convert the model file and link to the interface:
/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.trt --fp16 \
--plugins=/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so

Here I linked it at the beginning but the interface still didn’t match. The name in the interface project didn’t seem to change. After thinking about it, I changed the operator type in the onnx model to GridSamplePluginDynamic.

  1. Then run the project file in c++ on my previous project to achieve the effect.
    At first, I tried to link directly in CMakeLists, but found no effect. But trtexec is possible, so I took a look at the source code of trtexec , he used the dlopen function, and I followed suit.
#include <dlfcn.h>

    ......
    
    string dll_path = "/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so";
    
    void *handle = dlopen(dll_path.c_str(), RTLD_LAZY);
    if (NULL == handle)
    {
        printf("dlopen error. msg:%s", dlerror());
        return -1;
    }
    
    ......

At the same time, write in CMakeLists

target_link_libraries(main ${CMAKE_DL_LIBS} )
  1. Then compile and run the graph, the effect is not convenient to display, it looks the same as the grid_sample realized by other methods before. The grid_sample running model originally implemented by other methods is about 12 frames, and now it is 17 frames, which is significantly improved. done~

Replenish

For python implementation, I can refer to here , use ctypes.CDLL (plugin_lib) to link the library, and the others are the same. Not sure, haven't tried it yet.

Guess you like

Origin blog.csdn.net/weixin_42492254/article/details/127324189