9.4.tensorRT advanced (4) package series-use pybind11 to develop extension modules for python

Preface

I have read the tensorRT high-performance deployment course from scratch launched by Teacher Du before, but I didn’t take notes and I forgot many things. I’ll do it again this time and take notes.

Learn tensorRT advanced in this course - use pybind11 to develop extension modules for python

Please see the mind map below for the course syllabus

Insert image description here

1. pybind11

In this section we learn how to write c++ extension modules for python, using pybind11

1. The reasoning encapsulation of yolov5 is implemented here.

2. Encapsulates a c++ class corresponding to python

3. Most of the bottom layer of python is encapsulated in c++, which can take advantage of the computing performance of c++ and the convenience of python

Let's look directly at the code, first look at demo.py, the code is as follows:

import yolo
import os
import cv2

if not os.path.exists("yolov5s.trtmodel"):
    yolo.compileTRT(
        max_batch_size=1,
        source="yolov5s.onnx",
        output="yolov5s.trtmodel",
        fp16=False,
        device_id=0
    )

infer = yolo.Yolo("yolov5s.trtmodel")
if not infer.valid:
    print("invalid trtmodel")
    exit(0)

image = cv2.imread("rq.jpg")
boxes = infer.commit(image).get()

for box in boxes:
    l, t, r, b = map(int, [box.left, box.top, box.right, box.bottom])
    cv2.rectangle(image, (l, t), (r, b), (0, 255, 0), 2, 16)

cv2.imwrite("detect.jpg", image)

demo.py mainly demonstrates how to use the compiled yolo extension library to perform reasoning on the YOLO model, where the yolo module is compiled through C++

Let's take a look at the corresponding C++ code. We mainly learn to use the third-party library pybind11

#include <opencv2/opencv.hpp>
#include <common/ilogger.hpp>
#include "builder/trt_builder.hpp"
#include "app_yolo/yolo.hpp"
#include "pybind11.hpp"

using namespace std;
namespace py = pybind11;

class YoloInfer {
    
     
public:
	YoloInfer(
		string engine, Yolo::Type type, int device_id, float confidence_threshold, float nms_threshold,
		Yolo::NMSMethod nms_method, int max_objects, bool use_multi_preprocess_stream
	){
    
    
		instance_ = Yolo::create_infer(
			engine, 
			type,
			device_id,
			confidence_threshold,
			nms_threshold,
			nms_method, max_objects, use_multi_preprocess_stream
		);
	}

	bool valid(){
    
    
		return instance_ != nullptr;
	}

	shared_future<ObjectDetector::BoxArray> commit(const py::array& image){
    
    

		if(!valid())
			throw py::buffer_error("Invalid engine instance, please makesure your construct");

		if(!image.owndata())
			throw py::buffer_error("Image muse be owner, slice is unsupport, use image.copy() inside, image[1:-1, 1:-1] etc.");

		cv::Mat cvimage(image.shape(0), image.shape(1), CV_8UC3, (unsigned char*)image.data(0));
		return instance_->commit(cvimage);
	}

private:
	shared_ptr<Yolo::Infer> instance_;
}; 

bool compileTRT(
    int max_batch_size, string source, string output, bool fp16, int device_id, int max_workspace_size
){
    
    
    TRT::set_device(device_id);
    return TRT::compile(
        fp16 ? TRT::Mode::FP16 : TRT::Mode::FP32,
        max_batch_size, source, output, {
    
    }, nullptr, "", "", max_workspace_size
    );
}

PYBIND11_MODULE(yolo, m){
    
    

    py::class_<ObjectDetector::Box>(m, "ObjectBox")
		.def_property("left",        [](ObjectDetector::Box& self){
    
    return self.left;}, [](ObjectDetector::Box& self, float nv){
    
    self.left = nv;})
		.def_property("top",         [](ObjectDetector::Box& self){
    
    return self.top;}, [](ObjectDetector::Box& self, float nv){
    
    self.top = nv;})
		.def_property("right",       [](ObjectDetector::Box& self){
    
    return self.right;}, [](ObjectDetector::Box& self, float nv){
    
    self.right = nv;})
		.def_property("bottom",      [](ObjectDetector::Box& self){
    
    return self.bottom;}, [](ObjectDetector::Box& self, float nv){
    
    self.bottom = nv;})
		.def_property("confidence",  [](ObjectDetector::Box& self){
    
    return self.confidence;}, [](ObjectDetector::Box& self, float nv){
    
    self.confidence = nv;})
		.def_property("class_label", [](ObjectDetector::Box& self){
    
    return self.class_label;}, [](ObjectDetector::Box& self, int nv){
    
    self.class_label = nv;})
		.def_property_readonly("width", [](ObjectDetector::Box& self){
    
    return self.right - self.left;})
		.def_property_readonly("height", [](ObjectDetector::Box& self){
    
    return self.bottom - self.top;})
		.def_property_readonly("cx", [](ObjectDetector::Box& self){
    
    return (self.left + self.right) / 2;})
		.def_property_readonly("cy", [](ObjectDetector::Box& self){
    
    return (self.top + self.bottom) / 2;})
		.def("__repr__", [](ObjectDetector::Box& obj){
    
    
			return iLogger::format(
				"<Box: left=%.2f, top=%.2f, right=%.2f, bottom=%.2f, class_label=%d, confidence=%.5f>",
				obj.left, obj.top, obj.right, obj.bottom, obj.class_label, obj.confidence
			);	
		});

    py::class_<shared_future<ObjectDetector::BoxArray>>(m, "SharedFutureObjectBoxArray")
		.def("get", &shared_future<ObjectDetector::BoxArray>::get);

    py::enum_<Yolo::Type>(m, "YoloType")
		.value("V5", Yolo::Type::V5)
		.value("V3", Yolo::Type::V3)
		.value("X", Yolo::Type::X);

	py::enum_<Yolo::NMSMethod>(m, "NMSMethod")
		.value("CPU",     Yolo::NMSMethod::CPU)
		.value("FastGPU", Yolo::NMSMethod::FastGPU);

    py::class_<YoloInfer>(m, "Yolo")
		.def(py::init<string, Yolo::Type, int, float, float, Yolo::NMSMethod, int, bool>(), 
			py::arg("engine"), 
			py::arg("type")                 = Yolo::Type::V5, 
			py::arg("device_id")            = 0, 
			py::arg("confidence_threshold") = 0.4f,
			py::arg("nms_threshold") = 0.5f,
			py::arg("nms_method")    = Yolo::NMSMethod::FastGPU,
			py::arg("max_objects")   = 1024,
			py::arg("use_multi_preprocess_stream") = false
		)
		.def_property_readonly("valid", &YoloInfer::valid, "Infer is valid")
		.def("commit", &YoloInfer::commit, py::arg("image"));

    m.def(
		"compileTRT", compileTRT,
		py::arg("max_batch_size"),
		py::arg("source"),
		py::arg("output"),
		py::arg("fp16")                         = false,
		py::arg("device_id")                    = 0,
		py::arg("max_workspace_size")           = 1ul << 28
	);
}

The PYBIND11_MODULE macro is the core part of pybind11. It is used to define Python modules and bind C++ classes and functions to Python, that is, it is used to define Python extension modules.

Here, we define a module named yolo and use m as a reference to the module. The following are the details in this module definition:

1. ObjectBox class binding

py::class_<ObjectDetector::Box>(m, "ObjectBox")
    ...

Here we use py::class_ to define a Python class named ObjectBox . This class corresponds to ObjectDetector::Box in C++. In the definition of this class, .def_property is used to define multiple properties of Box, and .def_property_readonly means that the property is only readable. .def is also used in this class. Defines a magic method __repr__ in a python class for printing box information

2. SharedFutureObjectBoxArray class binding

py::class_<shared_future<ObjectDetector::BoxArray>>(m, "SharedFutureObjectBoxArray")
    .def("get", &shared_future<ObjectDetector::BoxArray>::get);

Here we define a Python class for the shared_future<ObjectDetector::BoxArray> type, named SharedFutureObjectBoxArray . This allows us to process and get YOLO detection results asynchronously in Python.

3. Enumeration binding

py::enum_<Yolo::Type>(m, "YoloType")
    .value("V5", Yolo::Type::V5)
    ...

py::enum_<Yolo::NMSMethod>(m, "NMSMethod")
    ...

Here we define two Python enumeration classes, named YoloType and NMSMethod , which are mainly used for specifying Yolo types and NMS methods.

4. YoloInfer class binding

py::class_<YoloInfer>(m, "Yolo")
    ...

This is the most important part. We define a Python class named Yolo for the YoloInfer class . The definition of this class includes the binding of multiple constructor parameters, properties and methods, such as engine , type , device_id , etc. We also define a commit method in this class for reasoning, which is associated with YoloInfer::commit

5. compileTRT function binding

m.def(
    "compileTRT", compileTRT,
    py::arg("max_batch_size"),
    ...
);

Finally, we define a Python function for the compileTRT function in C++ . This allows us to compile the model using TensorRT in Python

In general, the content defined by PYBIND11_MODULE(yolo, m) provides us with a complete Python interface for inference and related operations of the YOLO model. In this way, we can use the inference code of the efficient YOLO model written in C++ directly in Python, while still taking advantage of the flexibility and ease of use of Python.

The Makefile file also needs to be modified accordingly. The main modifications are as follows:

1. Contains the header file path of python

include_paths := src              \
	src/tensorRT                  \
    $(cuda_home)/include/cuda     \
	$(cuda_home)/include/tensorRT \
	$(cpp_pkg)/opencv4.2/include  \
	$(cuda_home)/include/protobuf \
	/datav/software/anaconda3/include/python3.9

2. Contains the library file path of python

library_paths := $(cuda_home)/lib64 $(syslib) $(cpp_pkg)/opencv4.2/lib /datav/software/anaconda3/lib

3. Add the python library that needs to be linked

link_sys       := stdc++ dl protobuf python3.9

4. Compile into a dynamic library

$(workdir)/$(name) : $(cpp_objs) $(cu_objs)
	@echo Link $@
	@mkdir -p $(dir $@)
	@$(cc) -shared $^ -o $@ $(link_flags)

The complete Makefile content is as follows:

cc        := g++
name      := yolo.so
workdir   := workspace
srcdir    := src
objdir    := objs
stdcpp    := c++11
cuda_home := /usr/local/cuda-11.6
syslib    := /home/jarvis/anaconda3/envs/yolov8/lib
cpp_pkg   := /usr/local/include
trt_home  := /opt/TensorRT-8.4.1.5
pro_home  := /home/jarvis/lean/protobuf-3.11.4
cuda_arch := 
nvcc      := $(cuda_home)/bin/nvcc -ccbin=$(cc)

# 定义cpp的路径查找和依赖项mk文件
cpp_srcs := $(shell find $(srcdir) -name "*.cpp")
cpp_objs := $(cpp_srcs:.cpp=.cpp.o)
cpp_objs := $(cpp_objs:$(srcdir)/%=$(objdir)/%)
cpp_mk   := $(cpp_objs:.cpp.o=.cpp.mk)

# 定义cu文件的路径查找和依赖项mk文件
cu_srcs := $(shell find $(srcdir) -name "*.cu")
cu_objs := $(cu_srcs:.cu=.cu.o)
cu_objs := $(cu_objs:$(srcdir)/%=$(objdir)/%)
cu_mk   := $(cu_objs:.cu.o=.cu.mk)

# 定义opencv和cuda需要用到的库文件
link_cuda      := cudart cudnn
link_trtpro    := 
link_tensorRT  := nvinfer nvinfer_plugin
link_opencv    := opencv_core opencv_imgproc opencv_imgcodecs
link_sys       := stdc++ dl protobuf python3.8
link_librarys  := $(link_cuda) $(link_tensorRT) $(link_sys) $(link_opencv)

# 定义头文件路径，请注意斜杠后边不能有空格
# 只需要写路径，不需要写-I
include_paths := src              \
	src/tensorRT                  \
    $(cuda_home)/include     \
	$(trt_home)/include \
	$(cpp_pkg)/opencv4  \
	$(pro_home)/include\
	/home/jarvis/anaconda3/envs/yolov8/include/python3.8

# 定义库文件路径，只需要写路径，不需要写-L
library_paths := $(cuda_home)/lib64 $(syslib) $(cpp_pkg)/opencv4.2/lib /usr/local/lib ${
    
    trt_home}/lib ${
    
    pro_home}/lib

# 把library path给拼接为一个字符串，例如a b c => a:b:c
# 然后使得LD_LIBRARY_PATH=a:b:c
empty := 
library_path_export := $(subst $(empty) $(empty),:,$(library_paths))

# 把库路径和头文件路径拼接起来成一个，批量自动加-I、-L、-l
run_paths     := $(foreach item,$(library_paths),-Wl,-rpath=$(item))
include_paths := $(foreach item,$(include_paths),-I$(item))
library_paths := $(foreach item,$(library_paths),-L$(item))
link_librarys := $(foreach item,$(link_librarys),-l$(item))

# 如果是其他显卡，请修改-gencode=arch=compute_75,code=sm_75为对应显卡的能力
# 显卡对应的号码参考这里：https://developer.nvidia.com/zh-cn/cuda-gpus#compute
# 如果是 jetson nano，提示找不到-m64指令，请删掉 -m64选项。不影响结果
cpp_compile_flags := -std=$(stdcpp) -w -g -O0 -m64 -fPIC -fopenmp -pthread
cu_compile_flags  := -std=$(stdcpp) -w -g -O0 -m64 $(cuda_arch) -Xcompiler "$(cpp_compile_flags)"
link_flags        := -pthread -fopenmp -Wl,-rpath='$$ORIGIN'

cpp_compile_flags += $(include_paths)
cu_compile_flags  += $(include_paths)
link_flags        += $(library_paths) $(link_librarys) $(run_paths)

# 如果头文件修改了，这里的指令可以让他自动编译依赖的cpp或者cu文件
ifneq ($(MAKECMDGOALS), clean)
-include $(cpp_mk) $(cu_mk)
endif

$(name)   : $(workdir)/$(name)

all       : $(name)
run       : $(name)
	@cd $(workdir) && python demo.py $(run_args)

$(workdir)/$(name) : $(cpp_objs) $(cu_objs)
	@echo Link $@
	@mkdir -p $(dir $@)
	@$(cc) -shared $^ -o $@ $(link_flags)

$(objdir)/%.cpp.o : $(srcdir)/%.cpp
	@echo Compile CXX $<
	@mkdir -p $(dir $@)
	@$(cc) -c $< -o $@ $(cpp_compile_flags)

$(objdir)/%.cu.o : $(srcdir)/%.cu
	@echo Compile CUDA $<
	@mkdir -p $(dir $@)
	@$(nvcc) -c $< -o $@ $(cu_compile_flags)

# 编译cpp依赖项，生成mk文件
$(objdir)/%.cpp.mk : $(srcdir)/%.cpp
	@echo Compile depends C++ $<
	@mkdir -p $(dir $@)
	@$(cc) -M $< -MF $@ -MT $(@:.cpp.mk=.cpp.o) $(cpp_compile_flags)
    
# 编译cu文件的依赖项，生成cumk文件
$(objdir)/%.cu.mk : $(srcdir)/%.cu
	@echo Compile depends CUDA $<
	@mkdir -p $(dir $@)
	@$(nvcc) -M $< -MF $@ -MT $(@:.cu.mk=.cu.o) $(cu_compile_flags)

# 定义清理指令
clean :
	@rm -rf $(objdir) $(workdir)/$(name) $(workdir)/*.trtmodel $(workdir)/*.onnx

# 防止符号被当做文件
.PHONY : clean run $(name)

# 导出依赖库路径，使得能够运行起来
export LD_LIBRARY_PATH:=$(library_path_export)

OK! Let’s first execute make run

Insert image description here

Figure 1 make run error

The error message is as follows:

relocation R_X86_64_TPOFF32 against symbol ... can not be used when making a shared object; recompile with -fPIC

The problem is that libprotobuf.a is a static library, and it is not compiled with the -fPIC option, which results in the inability to link to the corresponding protobuf dynamic library when creating a dynamic library.

Later, the blogger discovered that the protobuf we used to compile has always been a static library, so we need to recompile protobuf to generate a dynamic library.

The compilation of dynamic libraries is also relatively simple. Just add the dynamic library compilation option to the static library, as shown below:

cmake . -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_BUILD_SHARED_LIBS=ON

For details, please refer to: Ubuntu20.04 software installation guide , protobuf library compiled under Linux

After the compilation is completed, we re-specify the path of protobuf in the Makefile and re-execute make run. The running effect is as follows:

Insert image description here

Figure 2 make run successful

Compilation is a bit time-consuming. You can see that yolo.so is successfully compiled. Next, we execute demo.py. It needs to be explained here that we execute demo.py through make run. The run command in the Makefile will cd to the workspace. Next, execute python demo.py, as shown below:

run       : $(name)
	@cd $(workdir) && python demo.py $(run_args)

Why don't we manually cd workspace and then python demo.py? This is because we set the environment variables in the Makefile, and if demo.py is executed directly in the command line without setting these environment variables, the program may not be able to find the necessary shared libraries or other dependencies, resulting in unnecessary errors. .

You can export and import the necessary environment variables yourself. We execute demo.py through Makefile and the effect is as follows:

Insert image description here

Figure 3 demo.py

The rendering of the reasoning is as follows:

Insert image description here

Figure 4 Inference renderings

OK! The above is a demonstration of writing an extension library for Python using C++

When you think that using Python is not efficient enough, or when some functions are more convenient to write in C++, you should consider writing a library in C++ and letting Python call it, so that its performance is high enough and your work efficiency is also sufficient. Rather than using the Python version of tensorRT or the Python version of CUDA, it is better to write CUDA directly in C++ and tensorRT in C++. Its performance is higher than that in Python, and it is more operable and more convenient. (Suggestions from Teacher Du)

2. Supplementary knowledge

2.1 pybind11 introduction

pybind11 is a C++11 library for creating bindings for Python. It provides a simple interface so that C++ classes and functions can be used in Python without the need for an intermediate layer like SWIG or Boost.Python

GitHub address: https://github.com/pybind/pybind11

Here are some of the main features of pybind11:

1. Ease of use : Use pybind11 to easily create bindings between Python and C++

2. Header files : pybind11 is a header-only library , which means there is no need to pre-compile anything. You just include the header file and start writing the binding code

3. Type conversion : pybind11 can automatically handle many type conversions between C++ and Python

4. Extensibility : Python extensions can be created for C++ classes and functions, and even support inheritance, overloading and other C++ features

5. Performance : Compared with other binding generators, the performance of pybind11 is very good

6. Compatible with modern C++ : pybind11 uses the C++11 standard, which makes it very compatible with modern C++ code

Summarize

In this course, we learned to use pybind11 to write C++ extension modules for Python. It allows us to use high-performance reasoning code written in C++ directly in Python, while also taking advantage of the flexibility and convenience of Python, which is very beneficial to our daily development.