Comparison of pytorch gpu reasoning, onnxruntime gpu reasoning, tensorrt gpu reasoning, and installation tutorials with detailed code explanations

Files for testing that need to be downloaded
Test pictures:
https://upload.wikimedia.org/wikipedia/commons/2/26/YellowLabradorLooking_new.jpg -O dog.jpg
category files:
https://raw.githubusercontent.com/Lasagne /Recipes/master/examples/resnet50/imagenet_classes.txt
The packaged version can also be downloaded here:
https://download.csdn.net/download/m0_59156726/88478676

1. pytorch reasoning

The model directly uses resnet50 that comes with torchvison.
Torchvison refers to using the pre-trained model in PyTorch for image classification.

Just look at the code, it’s simple and clear.

import time
from torchvision import models, transforms
import torch
from PIL import Image

# 使用resnet50, torchvision 0.13及以后的新版本写法
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)

# 加载类别
with open('imagenet_class/imagenet_class.txt') as f:
    classes = [line.strip() for line in f.readlines()]

device = torch.device("cuda:0")

# 加载到gpu
resnet.to(device)

# 推理模式
resnet.eval()

# 图像预处理
transform = transforms.Compose([
 transforms.Resize(256),
 transforms.CenterCrop(224),
 transforms.ToTensor(),
 transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
])

# 加载图片
img = Image.open("./imagenet_class/YellowLabradorLooking_new.jpg")

# 处理图片
img_t = transform(img)

# 加载到gpu
img_t = img_t.unsqueeze(0).to(device)

# 循环推理看耗时
for i in range(10):
    # infer, size(1,1000)
    t1 = time.time()
    out = resnet(img_t)
    t2 = time.time()
    print("time:", t2 - t1)
# size(1,1000)
out_sorted, indices = torch.sort(out, descending=True)
percentage = torch.nn.functional.softmax(out, dim=1)[0] * 100

# 前top5
top5_list = [(classes[idx], percentage[idx].item()) for idx in indices[0][:5]]

# 打印结果及gpu推理时间
print(top5_list)

Results: The first inference time is relatively long, the subsequent time is 10ms on average, the top1 probability is 52.3%, and the category can be correctly identified.
Insert image description here

2. onnxruntime gpu inference

2.1 Environment preparation

All nv gpu inferences need to use cuda cudnn.
By default, python has installed the pytorch gpu version, because when installing the pytorch gpu environment, you must choose the installation of cuda and cudnn. If you have not installed it, you can find the installation guide yourself, which is just one line of code. The following is the official pytorch cuda11.8 installation instructions, reference: https://pytorch.org/
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Generally, when pytorch is installed, cuda and cudnn will be installed. Check the version and
print(torch. version )
print(torch.version.cuda)
print(torch.backends.cudnn.version())

Then download and install the adapted onnxruntime version according to the cuda version
pip install onnxruntime-gpu==xx.xx.xx
If nothing unexpected happens, just pip install onnxruntime-gpu

2.2 Model conversion

Still take resnet50 as an example

import time
from torchvision import models, transforms
import torch
from PIL import Image

# 使用resnet50, torchvision 0.13及以后的新版本写法
resnet_ = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)

# 模型转换， 详细参数请自行查阅
input_shape = (1,3,224,224)
dummy_input = torch.randn(input_shape)
torch.onnx.export(resnet, dummy_input, "resnet50.onnx", verbose=True, opset_version=11, input_names=["input0"], output_names=["output0"])

2.3 onnxruntime reasoning

with open('imagenet_class/imagenet_class.txt') as f:
    classes = [line.strip() for line in f.readlines()]

import onnxruntime as ort
# 构建providers
providers = [
    ('CUDAExecutionProvider', {
    
    
     'device_id': 0,
     'arena_extend_strategy': 'kNextPowerOfTwo',
     'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
     'cudnn_conv_algo_search': 'EXHAUSTIVE',
     'do_copy_in_default_stream': True,
     }),
    'CPUExecutionProvider',
]

# 加载模型
ort_session = ort.InferenceSession("resnet50.onnx", providers=providers)

# 图像预处理
transform = transforms.Compose([
 transforms.Resize(256),
 transforms.CenterCrop(224),
 transforms.ToTensor(),
 transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
])

# 加载图片
img = Image.open("./imagenet_class/YellowLabradorLooking_new.jpg")

# 处理图片
img_t = transform(img)
img_numpy = img_t.numpy()[None,:]
for i in range(10):
    t1 = time.time()
    input_name = ort_session.get_inputs()[0].name

    # size(1,1000)
    out = ort_session.run(None, {
    
    input_name: img_numpy})[0]
    t2 = time.time()
    print("time:", t2 - t1)
# size(1,1000) 降序
out = torch.from_numpy(out)
out_sorted, indices = torch.sort(out, descending=True)
percentage = torch.nn.functional.softmax(out, dim=1)[0] * 100

# 前top5
top5_list = [(classes[idx], percentage[idx].item()) for idx in indices[0][:5]]

# 打印结果及gpu推理时间
print(top5_list)

Results: The first inference time is relatively long, the subsequent time is about 4ms on average, the top1 probability is 52.3%, and the category can be correctly identified. The category recognition accuracy is no different from pytorch, and the time is 6ms faster than pytorch, which is still relatively fast.
Insert image description here

3. tensorrt reasoning

3.1 Installation preparation

Download tensorrt according to the corresponding version of cuda
https://developer.nvidia.com/nvidia-tensorrt-8x-download
Detailed installation tutorial reference, you don’t need to read it, just read the following
https://blog.csdn.net/hjxu2016/article /details/122868139

I downloaded this one because I am using cuda11.8 version.
Insert image description here
After downloading and decompressing, configure the lib directory into the environment variable so that the program can find the dll.

pip installation
Find the python directory after decompressing the installation package. You can see that there are many in it. The detailed explanation
is https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html
Insert image description here

Next, go to the directory and install directly according to the python version:
pip install tensorrt-8.6.1-cp38-none-win_amd64.whl

After the success is displayed, check whether the installation is correct.

import tensorrt as trt

Error: The dll cannot be found. It may be that the blogger has already configured the environment variables. The reason is that the blogger did not restart the IDE after configuring the environment variables, causing the IDE to not load the new environment variables. Then reopen pycharm, no errors are reported, and the installation is successful.
Insert image description here

3.2 onnx to tensorrt model

Use the tools in the installation package to convert, and other methods can be Baidu by yourself. For convenience, also add this directory to the environment variable.
Insert image description here
Execute the command conversion. Here is a usage reference https://blog.csdn.net/qq_43673118/article/details/123547503 , including common parameters. In particular, the default precision is fp32
trtexec --onnx=resnet50.onnx --saveEngine=resnet_engine.trt

3.3 tensorrt reasoning

Official reasoning reference, the API here is a bit old and there will be warnings
https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb
You need to use cuda driver to install
pip install Pycuda
writes its own inference
API reference:
https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/Engine.html

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time


# 加载tensort 构建runtime
def load_engine(engine_file_path):
    with open(engine_file_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
        return runtime.deserialize_cuda_engine(f.read())


# 图片预处理

def preprocess(input_file):
    # 图像预处理
    transform = transforms.Compose([
     transforms.Resize(256),
     transforms.CenterCrop(224),
     transforms.ToTensor(),
     transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
    ])

    # 加载图片
    img = Image.open(input_file)

    # 处理图片
    img_t = transform(img)

    # NCHW (1,3,244,244)
    return img_t.numpy()[None,:]


def infer(engine, input_file):
    input_image = preprocess(input_file)
    t1 = time.time()
    with engine.create_execution_context() as context:

        # 根据输入设置输入的size, 因为只有一个输入，因此只需设置一个即可。这里可以不用设置，
        # 由于我们onnx转换的不是动态shape,而是固定1，3，224，224。所以获得的shape必定是（1，3，224，224）这里只是展示set的用法
        # input0_shape = context.get_tensor_shape("input0")
        # 老API context.set_binding_shape(engine.get_binding_index("input"), img.size())
        context.set_input_shape("input0", input_image.shape)

        # Allocate host and device buffers， 分配内存 cpu gpu 内存
        bindings = []
        # 遍历输入输出
        for binding in engine:
            # binding_idx = engine.get_binding_index(binding)
            size = trt.volume(context.get_tensor_shape(binding))  # 1 * 3 * 224 * 224
            dtype = trt.nptype(engine.get_tensor_dtype(binding))  # 1 * 3 * 224 * 224

            # 老API engine.binding_is_input(binding)
            if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
                input_buffer = np.ascontiguousarray(input_image)
                input_memory = cuda.mem_alloc(input_image.nbytes)
                bindings.append(int(input_memory))
            else:
                output_buffer = cuda.pagelocked_empty(size, dtype)
                output_memory = cuda.mem_alloc(output_buffer.nbytes)
                bindings.append(int(output_memory))

        # stream
        stream = cuda.Stream()

        # Transfer input data from CPU to the GPU.
        cuda.memcpy_htod_async(input_memory, input_buffer, stream)
        # Run inference
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)

        # Transfer prediction output from GPU to CPU.
        cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
        # Synchronize the stream
        stream.synchronize()

    t2 = time.time()
    # 打印top5推理结果
    with open('imagenet_class/imagenet_class.txt') as f:
        classes = [line.strip() for line in f.readlines()]
        out = torch.from_numpy(output_buffer)
        out_sorted, indices = torch.sort(out, descending=True)
        percentage = torch.nn.functional.softmax(out, dim=0) * 100

        # 前top5
        top5_list = [(classes[idx], percentage[idx].item()) for idx in indices[:5]]

        # 打印结果及gpu推理时间
        print("time: ", t2 - t1)
        print(top5_list)
def run():

    engine_file_path = "resnet_engine.trt"
    input_file = "./imagenet_class/YellowLabradorLooking_new.jpg"
    class_txt = "./imagenet_class/imagenet_class.txt"

    engine = load_engine(engine_file_path)
    for i in range(10):
        infer(engine, input_file)

run()

The results are similar to onnx. It seems that tensorrt still has advantages. This is just a small model. It must have more advantages on other models.
Insert image description here