Because the blogger needs to deploy tensorrt to the end-to-end crawling network in his recent experiment, but he has not used tensorrt before, consulted a lot of information, stepped on a lot of pitfalls, and finally deployed successfully. So I wanted to record it. This article mainly uses Unet and grcnn (antipodal robotic grasping) as examples to explain the conversion of the end-to-end pytorch model to tensorrt.
article index
Chapter 1: Preparations
Blogger's software environment: ubuntu18+cuda11.3+cudnn8.6.0+python3.8+torch1.12.0+tensorrt8.5.2.2, GPU is RTX3070. Since there are many tutorials on the cuda+cudnn+pytorch1.12 installation network, I won't repeat them here.
Step1: Install tensorRT8.5.2.2
It can be downloaded from the official website .
Because you need to log in to nvidia's official website, for convenience, here is a download from Baidu cloud network disk: extract code ltjy.
unzip files:
tar -xzvf TensorRT-8.5.2.2.Linux.x86_64-gnu.cuda-11.8.cudnn8.6.tar.gz
Add environment variables:
export PATH="$PATH:*****/TensorRT-8.5.2.2/bin"#自己下载的TensorRT/bin所在地址
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:*******/TensorRT/lib"#自己下载的TensorRT/lib所在地址
Create and activate a virtual environment:
conda create -n pt12 python=3.8
conda activate pt12
Install the wheel file:
cd TensorRT-8.5.2.2/python
pip install tensorrt-8.5.2.2-cp38-none-linux_x86_64.whl
Step2: Install the onnx-tensorrt toolkit
There are many ways to convert an onnx file to a trt file. If you do not need to use INT8 for quantitative reasoning, it is recommended to use this toolkit for conversion.
git clone https://github.com/onnx/onnx-tensorrt.git
cd onnx-tensorrt
git checkout 8.0-GA
git submodule update --init
mkdir build && cd build
cmake .. -DTENSORRT_ROOT=/******/TensorRT-8.5.2.2 #刚才装的位置
Similar to this blog , error 1: cmake version is too low
Solution: upgrade cmake version
pip install cmake --upgrade
or go to cmake official website to download
Error 2: Could NOT find Protobuf (missing: Protobuf_LIBRARIESProtobuf_INCLUDE_DIR)
Solution: Install libprotobuf-dev protobuf-compiler
sudo apt-get install libprotobuf-dev protobuf-compiler
proto --version
start compiling
make -j8
Error: /usr/include/NvInferRuntimeCommon.h:56:10:fatal error:cuda_runtime_api.h: No such file or directory
Solution: Configure cuda-related environment variables
sudo gedit ~/.bashrc
export PATH=/usr/local/cuda-11.3/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH
export CPATH=/usr/local/cuda-11.3/targets/x86_64-linux/include:$CPATH
export LD_LIBARARY_PATH=/usr/local/cuda-11.3/targets/x86_64-linux/lib:$LD_LIBARARY_PATH
source ~/.basrc
start installation:
sudo make install
Run the test:
onnx2trt -V
Check the version number, so far all preliminary preparations are completed.
Chapter 2: Pytorch-Unet to TensorRT-Unet
To convert the pytorch model to the tensorrt engine model needs to go through two steps.
1. Convert pt or model files to onnx files;
2. Use conversion tools to convert onnx files to trt files.
Pytorch-Unet obtains its segmented grayscale image by inputting an rgb image through the network. The address of its github project is here . For related explanations, please refer to this blog . Go straight to the dry goods here.
Step1: pull the code from github
git clone https://github.com/milesial/Pytorch-UNet.git
cd Pytorch-UNet
git checkout v1.0
Download the data set:
Download the data set through the project readme file of the code in github, put the pictures in train_hq.zip into data/img, and put the pictures in train_mask.zip into data/mask.
Step2: Training network
To facilitate subsequent deployment, modify the content of the preprocess function in dataset.py in utils, and change NewW, NewH to 960, 640.
Just run the train.py file directly.
Step3: Convert pt model file to onnx
test.py:
The dummy_input here is changed to our modified (1, 3, 640, 960) initialization model parameters to ensure that they are consistent with those in train.py.
train.py
Step4: Convert onnx to trt model
onnx2trt unet_deconv.onnx -o unet_deconv.trt
Error 1 : The onnx model is too complex to convert.
Solution : install the onnxsim tool
pip install onnx-simplifier
python -m onnxsim input_onnx_model output_onnx_model
Error 2 : Cannot open the target file libnvinfer.so.8
Solution : link dynamic library
sudo gedit /etc/ld.so.conf
添加一行:
/home/lab/xcy/TensorRT-8.5.2.2/lib #自己TensorRT中lib所在路径
sudo ldconfig
Error 3 : Cannot find libnvinfer.so.8.4.3
Solution :
sudo cp /home/lab/xcy/TensorRT-8.5.2.2/lib/libvinfer_build_resource.so.8.4.3 /usr/lib
Step5: Write reasoning code
inference.py
import os
import sys
import time
# from PIL import Image
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
# TensorRT logger singleton
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
def load_engine(trt_path):
# 反序列化引擎
with open(trt_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
class TRTInference(object):
"""Manages TensorRT objects for model inference."""
def __init__(self, trt_engine_path, onnx_model_path, trt_engine_datatype=trt.DataType.FLOAT, batch_size=1):
"""Initializes TensorRT objects needed for model inference.
Args:
trt_engine_path (str): path where TensorRT engine should be stored
uff_model_path (str): path of .uff model
trt_engine_datatype (trt.DataType):
requested precision of TensorRT engine used for inference
batch_size (int): batch size for which engine
should be optimized for
"""
# Initialize runtime needed for loading TensorRT engine from file
# TRT engine placeholder
self.trt_engine = None
# Display requested engine settings to stdout
print("TensorRT inference engine settings:")
print(" * Inference precision - {}".format(trt_engine_datatype))
print(" * Max batch size - {}\n".format(batch_size))
# If we get here, the file with engine exists, so we can load it
if not self.trt_engine:
print("Loading cached TensorRT engine from {}".format(
trt_engine_path))
self.trt_engine = load_engine(
trt_engine_path)
# This allocates memory for network inputs/outputs on both CPU and GPU
self.inputs, self.outputs, self.bindings, self.stream = allocate_buffers(self.trt_engine)
# Execution context is needed for inference
self.context = self.trt_engine.create_execution_context()
def infer(self, full_img, output_shapes, new_width, new_height):
"""Infers model on given image.
Args:
image_path (str): image to run object detection model on
"""
assert new_width > 0 and new_height > 0, "Scale is too small"
# resize and transform to array
scale_img = cv2.resize(full_img, (new_width, new_height))
print("scale image shape:{}".format(scale_img.shape))
# scale_img = np.array(scale_img)
# HWC to CHW
scale_img = scale_img.transpose((2, 0, 1))
# 归一化
if scale_img.max() > 1:
scale_img = scale_img / 255
# 扩增通道数
# scale_img = np.expand_dims(scale_img, axis=0)
# 将数据成块
scale_img = np.array(scale_img, dtype=np.float32, order='C')
# Copy it into appropriate place into memory
# (self.inputs was returned earlier by allocate_buffers())
np.copyto(self.inputs[0].host, scale_img.ravel())
# Output shapes expected by the post-processor
# output_shapes = [(1, 11616, 4), (11616, 21)]
# When infering on single image, we measure inference
# time to output it to the user
inference_start_time = time.time()
# Fetch output from the model
trt_outputs = do_inference(
self.context, bindings=self.bindings, inputs=self.inputs,
outputs=self.outputs, stream=self.stream)
print("network output shape:{}".format(trt_outputs[0].shape))
# Output inference time
print("TensorRT inference time: {} ms".format(
int(round((time.time() - inference_start_time) * 1000))))
# Before doing post-processing, we need to reshape the outputs as the common.do_inference will
# give us flat arrays.
outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]
# And return results
return outputs
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
ps: Here is the complete reasoning code, a lot of information on the Internet is not complete, so I posted it here.
The parameters that predict.py
needs to set according to the actual situation:
engine_file_path: engine file path
onnx_file_path: onnx file path
new_width, new_height: input width and height
trt_engine_datatype: engine precision supports fp32 and fp16
image_path: test image path
import tensorrt as trt
import numpy as np
import cv2
import inference as inference_utils # TRT/TF inference wrappers
if __name__ == "__main__":
# 1. 网络构建
# Precision command line argument -> TRT Engine datatype
TRT_PRECISION_TO_DATATYPE = {
16: trt.DataType.HALF,
32: trt.DataType.FLOAT
}
# datatype: float 32
trt_engine_datatype = TRT_PRECISION_TO_DATATYPE[16]
# batch size = 1
max_batch_size = 1
engine_file_path = "unet_deconv_sim.trt"
onnx_file_path = "unet_deconv_sim.onnx"
new_width, new_height = 960, 640
output_shapes = [(1, new_height, new_width)]
trt_inference_wrapper = inference_utils.TRTInference(
engine_file_path, onnx_file_path,
trt_engine_datatype, max_batch_size,
)
# 2. 图像预处理
image_path = "example.jpg"
img = cv2.imread(image_path)
# inference
trt_outputs = trt_inference_wrapper.infer(img, output_shapes, new_width, new_height)[0]
# 输出后处理
out_threshold = 0.5
print("the size of tensorrt output : {}".format(trt_outputs.shape))
output = trt_outputs.transpose((1, 2, 0))
# 0/1像素值
output[output > out_threshold] = 255
output[output <= out_threshold] = 0
output = output.astype(np.uint8)
result = cv2.resize(output, (img.shape[1], img.shape[0]))
cv2.imwrite("best_output_deconv.jpg", result)
Final result:
Because this network is only used to test whether tensorrt can be deployed. In order to save time, the training model only uses one epoch, so the accuracy is not high.
Chapter 3: Pytorch-grcnn to Tensorrt-grcnn
Antipodal robotic grasping network input a 224x224x (1,3,4) picture, this picture can be rgbd type, rgb type and d type, the output captures pos, captures sin, captures cos and grabs the gripper open Spend. Its github project is here , and here is the method of inputting rgb image network conversion.
Step1: Train rgb input pytorch network
I won’t explain this step. There is a tutorial on the readme on the github official website, so let’s take a look.
Step2: Convert the model file to onnx file
The content of the onnxtotrt.py code is as follows:
Since the input image is rgb with three channels, the input type is (3, 224, 224), so it is necessary to modify the content of dummy_input, and then initialize its model parameters (input channels, number of channels) etc. It should be noted here that since the training model is not saved as a .pt file, you only need to call torch.load when loading the weights. The rest are consistent with the transformation in Chapter 2.
Step3: Convert onnx to .trt file
Due to space reasons, the code is pasted directly here:
onnx2trt grcnn.onnx -o grcnn.trt
Step4: Write reasoning code
After getting the trt file, start writing inference and predict codes.
The key point in inferencetest.py:
Since there are 4 network model outputs, after the output is obtained by do_inference, 4 results need to be obtained through slicing operations.
The key point in predict.py:
The input image preprocessing here is consistent with the input image of run_offline.py in the original network, but the depth is set to false. The difference between
run_offline.py
and pytorch is that to deploy tensorrt, the predict function inside needs to be Replace it with the inference function of tensorrt, and the rest remain unchanged. The post-processing function is also the same as the original network, except that the tensor variable operation supported by pytorch needs to be changed to numpy operation.
The final result is as follows
By evaluating the code, all images in the data set are used for verification, and the original inference speed is 33ms for each picture: the
inference speed after tensorrt acceleration is:
It can be seen that there is still an acceleration effect (the blogger’s computer pulls it). It is estimated that it will be better on the desktop. In addition, the video memory occupied by the original network will be much larger than that after tensorrt acceleration. The blogger forgot to take a screenshot and explain it verbally. The open source code of tensorrt is here . Don't whore for nothing~ click a star and clone!