【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速

背景

双目里比较优秀的很多模型都有用到torch的grid_sample，但是其在TensorRT中没有接口。虽说4D的有替代方案，但是还是有性能损失。并且我最近测试了BGNET，感觉在实时方案中是最可靠的，在抖动中不会大量误匹配，但是其有双边滤波是升维的，变成了5D的grid_sample，让人头大。替代写法网上都没有，我自己照着实现了一下，结果是一致的，但是时间直接翻一倍，我人裂开，onnx部署时间和torch原来跑的时间一致就很难受。思来想去还是试试能不能怼个接口上去。

测试平台

Jetson Xavier NX

前情

经过一些复杂尝试，首先试了下torch2trt，感觉细节上容易对不上他的接口要求，要排查的太多了，放弃。
觉得在NX上编译源码TensorRT源码还是不太合适，在一顿搜索尝试之后确定了以下方案。

参考

过程

首先将onnx内不支持的接口注册再导出，注册代码是复制的这里，注意要改算子名字的话就改我标的地方。

import torch
from my_model import my_model 

import typing
from torch.onnx import symbolic_helper


_OPSET_VERSION = 11
_registered_ops: typing.AbstractSet[str] = set()


def _reg(symbolic_fn: typing.Callable):
	name = "::%s" % symbolic_fn.__name__
	torch.onnx.register_custom_op_symbolic(name, symbolic_fn, _OPSET_VERSION)
	_registered_ops.add(name)


def register():
	"""Register ONNX Runtime's built-in contrib ops.
	Should be run before torch.onnx.export().
	"""

	def grid_sampler(g, input, grid, mode, padding_mode, align_corners):
		# mode
		#   'bilinear'      : onnx::Constant[value={0}]
		#   'nearest'       : onnx::Constant[value={1}]
		#   'bicubic'       : onnx::Constant[value={2}]
		# padding_mode
		#   'zeros'         : onnx::Constant[value={0}]
		#   'border'        : onnx::Constant[value={1}]
		#   'reflection'    : onnx::Constant[value={2}]
		mode = symbolic_helper._maybe_get_const(mode, "i")
		padding_mode = symbolic_helper._maybe_get_const(padding_mode, "i")
		mode_str = ["bilinear", "nearest", "bicubic"][mode]
		padding_mode_str = ["zeros", "border", "reflection"][padding_mode]
		align_corners = int(symbolic_helper._maybe_get_const(align_corners, "b"))

		# From opset v13 onward, the output shape can be specified with
		# (N, C, H, W) (N, H_out, W_out, 2) => (N, C, H_out, W_out)
		# input_shape = input.type().sizes()
		# gird_shape = grid.type().sizes()
		# output_shape = input_shape[:2] + gird_shape[1:3]
		# g.op(...).setType(input.type().with_sizes(output_shape))

		return g.op(
		    ## op name, modify here. not sure whether "com.microsoft::" is required
			"com.microsoft::GridSamplePluginDynamic",  
			input,
			grid,
			mode_s=mode_str,
			padding_mode_s=padding_mode_str,
			align_corners_i=align_corners,
		)

	_reg(grid_sampler)



@torch.no_grad()
def convert():
	register()

	# set cpu
	device = "cuda"
	model = my_model (88, 'models.pth').to(device)
	model.eval()

	t1 = torch.rand(1, 1, 384, 640).to(device)
	t2 = torch.rand(1, 1, 384, 640).to(device)

	# Export the model
	torch.onnx.export(model,
					  (t1, t2),
					  'model.onnx',   # where to save the model (can be a file or file-like object)
					  export_params=True,        # store the trained parameter weights inside the model file
					  opset_version=11,          # the ONNX version to export the model to
					  do_constant_folding=True,  # whether to execute constant folding for optimization
					  input_names = ['left', 'right'],   # the model's input names
					  output_names = ['output'])


if __name__ == "__main__":
	convert()

因为我要使用的grid_sample是5D的，就算用最新的onnx导出也不支持，所以我还是用老版本的onnx导出。4D的我不知道直接用高版本导出会有啥问题哈。之后有机会试试CREStereo的。

随手用下onnx-simplifier，有可能可以消除一些if节点避免tensorrt报一些错

python3 -m onnxsim model.onnx model_sim.onnx

从这里下载开源的接口（mmcv也有，我感觉也可以找那边的用），我个人就用到grid_sample所以把其他都删了。删完之后这两个地方改一下。

按照他的markdown编译好之后就可以了。生成的库文件在build/lib里面，长这样
随后使用trtexec转换模型文件，链接上接口：

/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.trt --fp16 \
--plugins=/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so

这里我一开始的时候链接上但是接口还是对不上，改接口工程里面的名字好像没变换，想了想算了把onnx模型里面算子type改成GridSamplePluginDynamic了。

随后在自己之前工程上随手链上工程文件用c++跑一遍效果。
一开始尝试在CMakeLists里面直接链上，然后发现没有效果。但是trtexec是可以的呀，所以我瞅了眼trtexec的源码，他用的是dlopen函数，我也跟着用了。

#include <dlfcn.h>

    ......
    
    string dll_path = "/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so";
    
    void *handle = dlopen(dll_path.c_str(), RTLD_LAZY);
    if (NULL == handle)
    {
        printf("dlopen error. msg:%s", dlerror());
        return -1;
    }
    
    ......

同时在CMakeLists里面写上

target_link_libraries(main ${CMAKE_DL_LIBS} )

然后编译跑图，效果不太方便展示，看上去是和之前自己用别的方法实现的grid_sample是一样的。原先用别的方法实现的grid_sample跑模型大概12帧，现在17帧，提升明显。完事儿~

补充

python实现个人感觉可以参考这里，使用ctypes.CDLL(plugin_lib)来链接上库，其他一致。不确定哈，还没试过。

【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速

背景

测试平台

前情

参考

过程

补充

猜你喜欢