Because the blogger needs to deploy tensorrt to the end-to-end crawling network in his recent experiment, but he has not used tensorrt before, consulted a lot of information, stepped on a lot of pitfalls, and finally deployed successfully. So I wanted to record it. This article mainly uses Unet and grcnn (antipodal robotic grasping) as examples to explain the conversion of the end-to-end pytorch model to tensorrt.

article index

Chapter 1: Preparations
- Step1: Install tensorRT8.5.2.2
- Step2: Install the onnx-tensorrt toolkit
Chapter 2: Pytorch-Unet to TensorRT-Unet
Chapter 3: Pytorch-grcnn to Tensorrt-grcnn

Chapter 1: Preparations

Blogger's software environment: ubuntu18+cuda11.3+cudnn8.6.0+python3.8+torch1.12.0+tensorrt8.5.2.2, GPU is RTX3070. Since there are many tutorials on the cuda+cudnn+pytorch1.12 installation network, I won't repeat them here.

Step1: Install tensorRT8.5.2.2

It can be downloaded from the official website .
insert image description here
Because you need to log in to nvidia's official website, for convenience, here is a download from Baidu cloud network disk: extract code ltjy.
unzip files:

tar -xzvf TensorRT-8.5.2.2.Linux.x86_64-gnu.cuda-11.8.cudnn8.6.tar.gz

Add environment variables:

export PATH="$PATH:*****/TensorRT-8.5.2.2/bin"#自己下载的TensorRT/bin所在地址
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:*******/TensorRT/lib"#自己下载的TensorRT/lib所在地址

Create and activate a virtual environment:

conda create -n pt12 python=3.8
conda activate pt12

Install the wheel file:

cd TensorRT-8.5.2.2/python
pip install tensorrt-8.5.2.2-cp38-none-linux_x86_64.whl

Step2: Install the onnx-tensorrt toolkit

There are many ways to convert an onnx file to a trt file. If you do not need to use INT8 for quantitative reasoning, it is recommended to use this toolkit for conversion.

git clone https://github.com/onnx/onnx-tensorrt.git
cd onnx-tensorrt
git checkout 8.0-GA
git submodule update --init
mkdir build && cd build
cmake .. -DTENSORRT_ROOT=/******/TensorRT-8.5.2.2 #刚才装的位置

Similar to this blog , error 1: cmake version is too low
Solution: upgrade cmake version

pip install cmake --upgrade
or go to cmake official website to download

Error 2: Could NOT find Protobuf (missing: Protobuf_LIBRARIESProtobuf_INCLUDE_DIR)
Solution: Install libprotobuf-dev protobuf-compiler

sudo apt-get install libprotobuf-dev protobuf-compiler
proto --version

start compiling

make -j8

Error: /usr/include/NvInferRuntimeCommon.h:56:10:fatal error:cuda_runtime_api.h: No such file or directory
Solution: Configure cuda-related environment variables

sudo gedit ~/.bashrc
export PATH=/usr/local/cuda-11.3/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH
export CPATH=/usr/local/cuda-11.3/targets/x86_64-linux/include:$CPATH
export LD_LIBARARY_PATH=/usr/local/cuda-11.3/targets/x86_64-linux/lib:$LD_LIBARARY_PATH
source ~/.basrc

insert image description here

start installation:

sudo make install

Run the test:

onnx2trt -V

Check the version number, so far all preliminary preparations are completed.

Chapter 2: Pytorch-Unet to TensorRT-Unet

To convert the pytorch model to the tensorrt engine model needs to go through two steps.
1. Convert pt or model files to onnx files;
2. Use conversion tools to convert onnx files to trt files.
Pytorch-Unet obtains its segmented grayscale image by inputting an rgb image through the network. The address of its github project is here . For related explanations, please refer to this blog . Go straight to the dry goods here.

Step1: pull the code from github

git clone https://github.com/milesial/Pytorch-UNet.git
cd Pytorch-UNet
git checkout v1.0

Download the data set:
Download the data set through the project readme file of the code in github, put the pictures in train_hq.zip into data/img, and put the pictures in train_mask.zip into data/mask.

Step2: Training network

To facilitate subsequent deployment, modify the content of the preprocess function in dataset.py in utils, and change NewW, NewH to 960, 640.
insert image description here
Just run the train.py file directly.

Step3: Convert pt model file to onnx

test.py:
insert image description here
The dummy_input here is changed to our modified (1, 3, 640, 960) initialization model parameters to ensure that they are consistent with those in train.py.
train.py

Step4: Convert onnx to trt model

onnx2trt unet_deconv.onnx -o unet_deconv.trt

Error 1 : The onnx model is too complex to convert.
Solution : install the onnxsim tool

pip install onnx-simplifier
python -m onnxsim input_onnx_model output_onnx_model

Error 2 : Cannot open the target file libnvinfer.so.8
insert image description here
Solution : link dynamic library

sudo gedit  /etc/ld.so.conf
添加一行：
/home/lab/xcy/TensorRT-8.5.2.2/lib #自己TensorRT中lib所在路径
sudo ldconfig

Error 3 : Cannot find libnvinfer.so.8.4.3
Solution :

sudo cp /home/lab/xcy/TensorRT-8.5.2.2/lib/libvinfer_build_resource.so.8.4.3 /usr/lib

Step5: Write reasoning code

inference.py

import os
import sys
import time
# from PIL import Image
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
# TensorRT logger singleton
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
 
def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    class HostDeviceMem(object):
        def __init__(self, host_mem, device_mem):
            self.host = host_mem
            self.device = device_mem

        def __str__(self):
            return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

        def __repr__(self):
            return self.__str__()

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))

    return inputs, outputs, bindings, stream

def load_engine(trt_path):
    # 反序列化引擎
    with open(trt_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())


class TRTInference(object):
    """Manages TensorRT objects for model inference."""
 
    def __init__(self, trt_engine_path, onnx_model_path, trt_engine_datatype=trt.DataType.FLOAT, batch_size=1):
        """Initializes TensorRT objects needed for model inference.
        Args:
            trt_engine_path (str): path where TensorRT engine should be stored
            uff_model_path (str): path of .uff model
            trt_engine_datatype (trt.DataType):
                requested precision of TensorRT engine used for inference
            batch_size (int): batch size for which engine
                should be optimized for
        """
 
        # Initialize runtime needed for loading TensorRT engine from file
        # TRT engine placeholder
        self.trt_engine = None
 
        # Display requested engine settings to stdout
        print("TensorRT inference engine settings:")
        print("  * Inference precision - {}".format(trt_engine_datatype))
        print("  * Max batch size - {}\n".format(batch_size))
        # If we get here, the file with engine exists, so we can load it
        if not self.trt_engine:
            print("Loading cached TensorRT engine from {}".format(
                trt_engine_path))
            self.trt_engine = load_engine(
                trt_engine_path)
 
        # This allocates memory for network inputs/outputs on both CPU and GPU
        self.inputs, self.outputs, self.bindings, self.stream = allocate_buffers(self.trt_engine)
 
        # Execution context is needed for inference
        self.context = self.trt_engine.create_execution_context()
 
    def infer(self, full_img, output_shapes, new_width, new_height):
        """Infers model on given image.
        Args:
            image_path (str): image to run object detection model on
        """
        
        assert new_width > 0 and new_height > 0, "Scale is too small"
        # resize and transform to array
        scale_img = cv2.resize(full_img, (new_width, new_height))
        print("scale image shape:{}".format(scale_img.shape))
        # scale_img = np.array(scale_img)
        # HWC to CHW
        scale_img = scale_img.transpose((2, 0, 1))
        # 归一化
        if scale_img.max() > 1:
            scale_img = scale_img / 255
        # 扩增通道数
        # scale_img = np.expand_dims(scale_img, axis=0)
        # 将数据成块
        scale_img = np.array(scale_img, dtype=np.float32, order='C')
        # Copy it into appropriate place into memory
        # (self.inputs was returned earlier by allocate_buffers())
        np.copyto(self.inputs[0].host, scale_img.ravel())
        # Output shapes expected by the post-processor
        # output_shapes = [(1, 11616, 4), (11616, 21)]
        # When infering on single image, we measure inference
        # time to output it to the user
        inference_start_time = time.time()
 
        # Fetch output from the model
        trt_outputs = do_inference(
            self.context, bindings=self.bindings, inputs=self.inputs,
            outputs=self.outputs, stream=self.stream)
        print("network output shape:{}".format(trt_outputs[0].shape))
        # Output inference time
        print("TensorRT inference time: {} ms".format(
            int(round((time.time() - inference_start_time) * 1000))))
        # Before doing post-processing, we need to reshape the outputs as the common.do_inference will
        # give us flat arrays.
        outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]
        # And return results
        return outputs
 
 
# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

ps: Here is the complete reasoning code, a lot of information on the Internet is not complete, so I posted it here.

The parameters that predict.py
needs to set according to the actual situation:
engine_file_path: engine file path
onnx_file_path: onnx file path
new_width, new_height: input width and height
trt_engine_datatype: engine precision supports fp32 and fp16
image_path: test image path

import tensorrt as trt
import numpy as np
import cv2
import inference as inference_utils  # TRT/TF inference wrappers
 
if __name__ == "__main__":
    # 1. 网络构建
    # Precision command line argument -> TRT Engine datatype
    TRT_PRECISION_TO_DATATYPE = {
    
    
        16: trt.DataType.HALF,
        32: trt.DataType.FLOAT
    }
    # datatype: float 32
    trt_engine_datatype = TRT_PRECISION_TO_DATATYPE[16]
    # batch size = 1
    max_batch_size = 1
    engine_file_path = "unet_deconv_sim.trt"
    onnx_file_path = "unet_deconv_sim.onnx"
    new_width, new_height = 960, 640
    output_shapes = [(1, new_height, new_width)]
    trt_inference_wrapper = inference_utils.TRTInference(
        engine_file_path, onnx_file_path,
        trt_engine_datatype, max_batch_size,
    )
    
    # 2. 图像预处理
    image_path = "example.jpg"
    img = cv2.imread(image_path)
    # inference
    trt_outputs = trt_inference_wrapper.infer(img, output_shapes, new_width, new_height)[0]
    # 输出后处理
    out_threshold = 0.5
    print("the size of tensorrt output : {}".format(trt_outputs.shape))
    output = trt_outputs.transpose((1, 2, 0))
    # 0/1像素值
    output[output > out_threshold] = 255
    output[output <= out_threshold] = 0
    
    output = output.astype(np.uint8)
    result = cv2.resize(output, (img.shape[1], img.shape[0]))
    cv2.imwrite("best_output_deconv.jpg", result)

Final result:
insert image description here
Because this network is only used to test whether tensorrt can be deployed. In order to save time, the training model only uses one epoch, so the accuracy is not high.

Chapter 3: Pytorch-grcnn to Tensorrt-grcnn

Antipodal robotic grasping network input a 224x224x (1,3,4) picture, this picture can be rgbd type, rgb type and d type, the output captures pos, captures sin, captures cos and grabs the gripper open Spend. Its github project is here , and here is the method of inputting rgb image network conversion.

Step1: Train rgb input pytorch network

I won’t explain this step. There is a tutorial on the readme on the github official website, so let’s take a look.

Step2: Convert the model file to onnx file

The content of the onnxtotrt.py code is as follows:
insert image description here
Since the input image is rgb with three channels, the input type is (3, 224, 224), so it is necessary to modify the content of dummy_input, and then initialize its model parameters (input channels, number of channels) etc. It should be noted here that since the training model is not saved as a .pt file, you only need to call torch.load when loading the weights. The rest are consistent with the transformation in Chapter 2.

Step3: Convert onnx to .trt file

Due to space reasons, the code is pasted directly here:

onnx2trt grcnn.onnx -o grcnn.trt

Step4: Write reasoning code

After getting the trt file, start writing inference and predict codes.
The key point in inferencetest.py:
insert image description here
Since there are 4 network model outputs, after the output is obtained by do_inference, 4 results need to be obtained through slicing operations.
The key point in predict.py:

The input image preprocessing here is consistent with the input image of run_offline.py in the original network, but the depth is set to false. The difference between
run_offline.py
insert image description here
and pytorch is that to deploy tensorrt, the predict function inside needs to be Replace it with the inference function of tensorrt, and the rest remain unchanged. The post-processing function is also the same as the original network, except that the tensor variable operation supported by pytorch needs to be changed to numpy operation.
insert image description here
The final result is as follows

insert image description here
By evaluating the code, all images in the data set are used for verification, and the original inference speed is 33ms for each picture: the

inference speed after tensorrt acceleration is:

insert image description here
It can be seen that there is still an acceleration effect (the blogger’s computer pulls it). It is estimated that it will be better on the desktop. In addition, the video memory occupied by the original network will be much larger than that after tensorrt acceleration. The blogger forgot to take a screenshot and explain it verbally. The open source code of tensorrt is here . Don't whore for nothing~ click a star and clone!

ubuntu18 one article learns Pytorch end-to-end network deployment Tensorrt model reasoning