Several ways to convert pytorch model (.pth) to tensorrt model (.engine)

Preface

This article summarizes several ways to convert a trained pytorch model into a tensorrt model for deployment. The conversion principle process is roughly as follows:

  1. Export network definitions and related weights;
  2. Analyze network definitions and related weights;
  3. Construct the optimal execution plan based on the graphics card operator;
  4. Serialize and store the execution plan;
  5. Deserialize execution plan;
  6. make inferences

It is worth noting the third point. You can see that the model converted by tensorrt is actually bound to the hardware. That is, during the deployment process, if your graphics card and graphics card-related driver software (cuda, cudnn) change , then the model will need to be re-converted.

1. trtexec

trtexec is a conversion program that comes with the tensorrt package. This program is located in the bin directory. It is easy to use and is the simplest way to convert trt models. Before use, cuda and cudnn need to be installed on the system, otherwise it will not run normally. Usage examples are as follows:

First, convert the pytorch model into an onnx model. The sample code is as follows:

def torch2onnx(model_path,onnx_path):
    model = load_model(model_path)
    test_arr = torch.randn(1,3,32,448)
    input_names = ['input']
    output_names = ['output']
    tr_onnx.export(
        model,
        test_arr,
        onnx_path,
        verbose=False,
        opset_version=11,
        input_names=input_names,
        output_names=output_names,
        dynamic_axes={
    
    "input":{
    
    3:"width"}}            #动态推理W纬度,若需其他动态纬度可以自行修改,不需要动态推理的话可以注释这行
    )
    print('->>模型转换成功!')

The trtexec conversion command is as follows:

Fixed size model conversion:

./trtexec --onnx=repvgg_a1.onnx --saveEngine=repvgg_a1.engine --workspace=1024  --fp16

Dynamic size model conversion:

./trtexec --onnx=repvgg_a1.onnx --saveEngine=repvgg_a1.engine --workspace=1024 --minShapes=input:1x3x32x32 --optShapes=input:1x3x32x320 --maxShapes=input:1x3x32x640 --fp16

Detailed explanation of parameters:

  • –onnx onnx path
  • –saveEngine trt serialization inference engine save address
  • –workspace Set workspace size in megabytes (default = 16)
  • –minShapes Generate dynamic shapes using the provided minimum shape configuration file
  • –optShapes Generate dynamic shapes using the provided configuration file of optimal shapes
  • –maxShapes Generate dynamic shapes using the provided maximum shape configuration file
  • –fp16 Turn on float16 precision inference (this mode is recommended, on the one hand it can speed up, on the other hand the accuracy decrease is relatively small)

2. torch2trt

torch2trt is an easy-to-use PyTorch to TensorRT converter officially maintained by nvidia. It is relatively simple to use, but the environment configuration is more complicated than the above method. Torch, torch2trt, and tensorrt need to be installed in advance. How to install tensorrt in the python environment For example, find these whl packages in Tensorrt's .tar package and install them directly using pip:


#1、安装tensorrt
cd ~/TensorRT-8.2.4.2/python
pip install tensorrt-8.2.4.2-cp37-none-linux_x86_64.whl

#2、安装Python UFF wheel文件。只有当你将TensorRT与TensorFlow一起使用时才需要安装这个文件  用处:pb转tensorRT
cd ~/TensorRT-8.2.4.2/uff
pip install uff-0.6.9-py2.py3-none-any.whl

#3、安装Python graphsurgeon whl文件   用处:可以让TensorRT 自定义网络结构
cd ~/TensorRT-8.2.4.2/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl

#注意trt7.0的版本没有这个包(不用装)
#4、安装Python onnx-graphsurgeon whl文件  
cd ~/TensorRT-8.2.4.2/onnx_graphsurgeon
pip install onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl

#5、安装pycuda  可以通过它来实现python 下CUDA 的编程
pip install pycuda

#6、验证安装,打印出tensorrt版本,即安装成功
python
import tensorrt
tensorrt.__version__

torch2trt installation:

git clone https://github.com/NVIDIA-AI-IOT/torch2trt.git
cd torch2trt
sudo python setup.py install --plugins

Model conversion code usage example:

    model = load_model(model_path)
    model.cuda()
    arr = torch.ones(1, 3, 32, 448).cuda()
    model_trt = torch2trt(model,
                          [arr],
                          fp16_mode=True,
                          log_level=trt.Logger.INFO,
                          max_workspace_size=(1 << 32),
                          max_batch_size=1,
                          )
    torch.save(model_trt.state_dict(), os.path.join(output, "model_trt.pth"))

    logger.info("Converted TensorRT model done.")
    
    engine_file = os.path.join(output, "model_trt.engine")
    with open(engine_file, "wb") as f:
        f.write(model_trt.engine.serialize())
    logger.info("Converted TensorRT model engine file is saved for C++ inference.")

The saved pth model can be loaded into TRTModule, and TRTModule can reason normally like the torch model:

from torch2trt import TRTModule

model_trt = TRTModule()

model_trt.load_state_dict(torch.load('model_trt.pth'))

The engine serialized file can be used for tensorrt program loading inference, but it should be noted here that torch2trt does not support dynamic inference. For more usage examples, refer to the github instructions of torch2trt .

三、torch2trt dynamic

torch2trt dynamic is the dynamic reasoning version of torch2trt, which supports dynamic reasoning after model conversion. It is basically the same as torch2trt in use. The first step is to install:

git clone https://github.com/grimoire/torch2trt_dynamic.git 
cd torch2trt_dynamic
python setup.py develop

Then there is a usage example, adding a dynamic scale parameter:

from torch2trt_dynamic import torch2trt_dynamic
import torch
from torch import nn
from torchvision.models.resnet import resnet50
import os

# create some regular pytorch model...
model = resnet50().cuda().eval()

# create example data
x = torch.ones((1, 3, 224, 224)).cuda()

# convert to TensorRT feeding sample data as input
opt_shape_param = [
    [
        [1, 3, 128, 128],   # min
        [1, 3, 256, 256],   # opt
        [1, 3, 512, 512]    # max
    ]
]
model_trt = torch2trt_dynamic(model, [x], fp16_mode=False, opt_shape_param=opt_shape_param)
torch.save(model_trt.state_dict(), os.path.join(output, "model_trt.pth"))

 logger.info("Converted TensorRT model done.")
    
 engine_file = os.path.join(output, "model_trt.engine")
with open(engine_file, "wb") as f:
    f.write(model_trt.engine.serialize())
logger.info("Converted TensorRT model engine file is saved for C++ inference.")

4. Parser analyzes onnx model

If you don’t want to use tool conversion, you can also write your own code, use tensorrt’s parser interface to parse the onnx model, and build the engine engine. This method is relatively simple, does not rely on other libraries, and supports dynamic inference model conversion. The python code example is as follows:

# --*-- coding:utf-8 --*--
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import time
import cv2, os
import numpy as np
import math

TRT_LOGGER = trt.Logger()

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        """
        host_mem: cpu memory
        device_mem: gpu memory
        """
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="", fp16_mode=False, save_engine=False,input_dynamic=False):
    """
    params max_batch_size:      预先指定大小好分配显存
    params onnx_file_path:      onnx文件路径
    params engine_file_path:    待保存的序列化的引擎文件路径
    params fp16_mode:           是否采用FP16
    params save_engine:         是否保存引擎
    returns:                    ICudaEngine
    """
    # 如果已经存在序列化之后的引擎,则直接反序列化得到cudaEngine
    if os.path.exists(engine_file_path):
        print("Reading engine from file: {}".format(engine_file_path))
        with open(engine_file_path, 'rb') as f, \
                trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())  # 反序列化
    else:  # 由onnx创建cudaEngine

        # 使用logger创建一个builder
        # builder创建一个计算图 INetworkDefinition
        explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        # In TensorRT 7.0, the ONNX parser only supports full-dimensions mode, meaning that your network definition must be created with the explicitBatch flag set. For more information, see Working With Dynamic Shapes.

        with trt.Builder(TRT_LOGGER) as builder, \
                builder.create_network(explicit_batch) as network, \
                trt.OnnxParser(network, TRT_LOGGER) as parser:  # 使用onnx的解析器绑定计算图,后续将通过解析填充计算图
            # builder.max_workspace_size = 1 << 30  # 预先分配的工作空间大小,即ICudaEngine执行时GPU最大需要的空间
            config = builder.create_builder_config()
            config.max_workspace_size = 1 << 30

            builder.max_batch_size = max_batch_size  # 执行时最大可以使用的batchsize
            if fp16_mode:
                config.set_flag(trt.BuilderFlag.FP16)
            # builder.fp16_mode = fp16_mode

            # 解析onnx文件,填充计算图
            if not os.path.exists(onnx_file_path):
                quit("ONNX file {} not found!".format(onnx_file_path))
            print('loading onnx file from path {} ...'.format(onnx_file_path))
            with open(onnx_file_path, 'rb') as model:  # 二值化的网络结果和参数
                print("Begining onnx file parsing")
                parser.parse(model.read())  # 解析onnx文件
            # parser.parse_from_file(onnx_file_path) # parser还有一个从文件解析onnx的方法
            print("Completed parsing of onnx file")
            # 填充计算图完成后,则使用builder从计算图中创建CudaEngine
            print("Building an engine from file{}' this may take a while...".format(onnx_file_path))
            if input_dynamic:                              # 动态推理
                profile = builder.create_optimization_profile()
                profile.set_shape("input",(1,3,32,32),(1,3,32,320),(1,3,32,640))                
                config.add_optimization_profile(profile)
            #################
            print(network.get_layer(network.num_layers - 1).get_output(0).shape)
            engine =  builder.build_engine(network, config)
            print("Completed creating Engine")
            if save_engine:  # 保存engine供以后直接反序列化使用
                with open(engine_file_path, 'wb') as f:
                    f.write(engine.serialize())  # 序列化
            return engine

if __name__== "__main__":
    # These two modes are depend on hardwares
    fp16_mode = True
    max_batch_size = 1
    onnx_model_path = "./repvgg_a1.onnx"
    trt_engine_path = "./repvgg_a1.engine"
    # Build an cudaEngine
    engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode,True,True)

C++ version parsing onnx code example:

//step1:创建logger:日志记录器
class Logger : public ILogger           
 {
    
    
     void log(Severity severity, const char* msg) override
     {
    
    
         // suppress info-level messages
         if (severity != Severity::kINFO)
             std::cout << msg << std::endl;
     }
 } gLogger;


//step2:创建builder
IBuilder* builder = createInferBuilder(gLogger);

//step3:创建network
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);  
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);

//step4:创建parser
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, gLogger);

//step5:使用parser解析模型填充network
const char* onnx_filename="./model.onnx"
parser->parseFromFile(onnx_filename, ILogger::Severity::kWARNING);
for (int i = 0; i < parser.getNbErrors(); ++i)
{
    
    
    std::cout << parser->getError(i)->desc() << std::endl;
}

//step6:标记网络输出
for (auto &s : OUTPUT_BLOB_NAMES)
    network->markOutput(*blobNameToTensor->find(s.c_str()));

//step7:创建config并设置最大batchsize和最大工作空间
IBuilderConfig* config = builder->createBuilderConfig();
config->setMaxBatchSize(maxBatchSize);//设置最大batchsize
config->setMaxWorkspaceSize(1 << 30);//2^30 ,这里是1G

//step8:创建engine
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
assert(engine);

//step9:序列化保存engine到planfile
IHostMemory *serializedModel = engine->serialize();
assert(serializedModel != nullptr)
std::ofstream p("xxxxx.engine");
p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());

//step10:释放资源
serializedModel->destroy();
engine->destroy();
parser->destroy()
network->destroy();
config->destroy();
builder->destroy();

5. tensorrtx

The model construction method of tensorrtx is rather strange. First, use tensorrt's own API to build the network, and then assign the weights. In this way, as long as the network is well established during conversion, there will basically be no problem with the conversion. It is very good. It solves the problem that some operators are not supported in the process of converting onnx to trt, but the process is relatively complicated, and dynamic scale reasoning is not supported. Currently, trt supports onnx very well, and basically all onnx models can be converted, so If this method is not too troublesome, you can try it. Example process:

Clone tensorrtx

git clone https://github.com/wang-xinyu/tensorrtx.git

Generate the yolov5.wts file, download the weight file yolov5s.pt, copy tensorrtx/yolov5/gen_wts.py to ultralytics/yolov5, and execute

python gen_wts.py

Compile tensorrtx/yolov5 and generate yolov5.engine file

mkdir build
cd build
cmake ..
make

By default, s-model and fp16 inference are generated. As well as the engine with batch 1, other models of yolov5 can modify relevant parameters in the code.

#define USE_FP16
#define DEVICE 0                   // GPU ID
#define NMS_THRESH 0.4
#define CONF_THRESH 0.5
#define BATCH_SIZE 1

#define NET s            // s m x l 

Copy the file yolov5.wts to the tensorrtx/yolov5/build directory and execute the following command to generate yolov5.engine

sudo ./yolov5 -s yolov5s.wts yolov5.engine s
sudo ./yolov5 -d yolov5s.engine ../samples

6. onnx-tensorrt

onnx-tensorrt is an official conversion warehouse of onnx, which provides many corresponding branches of Tensorrt version. For example, I use 8.2-EA here. The correct compilation method is:

First git down onnx-tensorrt:

git clone --recursive -b 8.2-EA https://github.com/onnx/onnx-tensorrt.git

Compile:

cd onnx-tensorrt
mkdir build
cd build
# /path/to/TensorRT-8.2.4.2改成自己的TensorRT绝对路径
cmake .. -DTENSORRT_ROOT=/path/to/TensorRT-8.2.4.2
make -j8
make install

After compilation is completed, if your cuda environment variables are configured, you do not need to configure them again. If they are not configured, you need to configure them:

Enter in the terminal: vim ~/.bashrc

#cuda
export PATH=/usr/local/cuda-11.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH

Save and exit, run source ~/.bashrc to refresh and take effect.

The onnx-tensorrt conversion command is as follows, serialized into engine:

onnx2trt my_model.onnx -o my_engine.trt

Convert to readable txt text:

onnx2trt my_model.onnx -t my_model.onnx.txt

Use onnx-tensorrt on the python side:

#安装tensorrt
python3 -m pip install <tensorrt_install_dir>/python/tensorrt-8.x.x.x-cp<python_ver>-none-linux_x86_64.whl
#安装onnx
python3 -m pip install onnx==1.8.0
#安装onnx-tensorrt,在onnx-tensorrt目录下运行
python3 setup.py install

Python code usage example for reasoning:

import onnx
import onnx_tensorrt.backend as backend
import numpy as np

model = onnx.load("/path/to/model.onnx")
engine = backend.prepare(model, device='CUDA:0')
input_data = np.random.random(size=(32, 3, 224, 224)).astype(np.float32)
output_data = engine.run(input_data)[0]
print(output_data)
print(output_data.shape)

七、Torch-TensorRT

Torch-TensorRT

8. Manually parse ONNX (C++ version)

Onnx2TensorRT

Guess you like

Origin blog.csdn.net/qq_39056987/article/details/124588857