Summary of TensorRT dry goods

TensorRT dry goods summary
1, TensorRT installation
2, Pytorch generation engine
3, TRT reasoning actual combat

1. TensorRT installation
1)
DeepStream is installed on Xavier Xavier, and TensorRT is included in DeepStream. The DeepStream application introduces deep neural networks and other complex processing tasks into the stream processing pipeline to achieve close proximity of video and other sensor data. real-time analysis. Extracting meaningful insights from these sensors creates opportunities to improve operational efficiency and safety. For example, cameras are currently the most used IoT sensor. Cameras are found in our homes, on the streets, in parking lots, in shopping malls, in warehouses, in factories – they are everywhere. The potential uses of video analytics are enormous: access control, loss prevention, automated checkout, surveillance, security, automated inspection (QA), parcel sorting (smart logistics), traffic control/engineering, industrial automation, etc.
More specifically, the DeepStream application is a set of modular plugins that are connected to form a processing pipeline. Each plugin represents a block of functionality, for example, inference or multi-stream decoding with TensorRT. Hardware-accelerated plugins interact with the underlying hardware (if applicable) to provide optimal performance. For example, decoding plugins interact with NVDEC, and inference plugins interact with GPU or DLA. Each plugin can be instantiated as many times as needed in the pipeline.
insert image description here

2) PC
(1) View cuda and cudnn versions
insert image description here

(2) According to the cuda version and ubuntu/windows, select the corresponding version to download from the link https://developer.nvidia.com/nvidia-tensorrt-8x-download, and install it after decompression.
insert image description here

If it is Win10, add the bin directory and lib directory to the environment variable, and restart the computer.
insert image description here

如果是linux:

insert image description here

Then source ~/.bashrc
(3) test
insert image description here

insert image description here

Note: The model file (plan file) generated by TRT Builder is not cross-platform. For example, the plan file generated on the server gpu cannot run on the board. It is also a card on the server, and cross-architecture is not acceptable. Generated on Tesla T4, cannot run on P40. Cards with the same architecture are available. The cards generated on the P40 can run on the P4, but a warning will be reported.

2. Pytorch generates engine
insert image description here

The common route is pytorch -> onnx -> engine/trt/plan, and pytorch is converted to onnx (omitted).
There are two ways to convert Onnx to engine:
1) Use code to write
insert image description here
insert image description here

Some terms:
Builder (network original data): the entry point for model building, the tensorRT internal representation of the network and the executable program engine are all generated by the member methods of this object
BuiderConfig (network original data options): responsible for setting some parameters of the model , such as whether to start fp16 mode, in8 mode, etc.
Network (calculation graph content): the main body of the network, in the process of using the api to build the network, will continuously add some layers to it, and mark the input and output nodes of the network, in other In the two workflows, Parser is used to load the network from the Onnx file, so there is no need to manually add
SerializedNetwork layer by layer: the internal representation of the model network in TRT, which can be used to generate an executable inference engine or serialize and save it As a file, it is convenient to read and use in the future
. Engine: The reasoning engine is the core of the model calculation, which can be understood as the code segment of the executable program.
Context: The context used to calculate the GPU. Analogous to the process concept process on the cpu, it is
related to the main buffer of the execution inference engine (data memory, video memory related): we will need to move the data from the cpu end to the gpu end, after performing inference calculations, and then transfer the data from the gpu end Move back to the cpu side. These steps involve memory application, copying, and release.
Excute: the specific process of invoking the computing core to perform inference calculations needs to be completed with the help of the CUDA library or the expansion of the python library
. Aftermath: There are still some peripheral work that needs to be done manually

The step that the builder uses to generate the engine is called serialize, and the operation when the engine is used in tensorrt's reasoning mode is called deserialize. The purpose of serialization is mainly to convert the model to make it suitable for Tensorrt framework, and the converted engine can be saved for subsequent reasoning. (Serialization often takes a period of time or a few minutes, so we often generate the engine to save the file first, and then call it directly during inference. Different machines usually need to generate the corresponding engine file locally, after all, the calling hardware Or the location of the library file may be different, and the related optimization is also different).
insert image description here
insert image description here

2) Using the trtexec tool,
first put the path of trtexec into the environment variable
insert image description here

insert image description here

trtexec  --onnx=onnx的路径  --saveEngine=engine的名字  --fp16/int8/best
--best 参数,相当于 --int8 --fp16 同时开。
一般来说,只开 fp16 可以把速度提一倍并且几乎不损失精度;但是开 --int8 会大大损失精度,速度会比 fp16 快,但不一定能快一倍。
这里不推荐使--int8,因为trtexec校准就象征性做了一下,真想自己部署 int8 模型还得自己写校准。
--useDLACore=0  --allowGPUFallback是将DLA上支持的算子转移到DLA上,如卷积、池化等,对不支持的算子采用GPU回退模式。
--shapes=onnx输入节点名称:1x3xHxW用在动态batch的onnx,例如:
--shapes=0:1x3x512x512,其中0就是onnx的输入节点名称,H为512、W为512。

insert image description here

PASSED will be displayed if the generation is successful.
insert image description here

Count and report the time after multiple inferences. There will be many time reports. Enqueue Time is the time when GPU tasks are queued. H2D Latency is the time to copy the network input data from the main memory host to the video memory device. D2H Latency is the time to copy the network output on the video memory back to the main memory. GPU Compute Time is the real network inference time.
3. TRT reasoning in practice,
taking yolox as an example
insert image description here

insert image description here

Run the command:

./yolox  ../yolox_x_1018.engine  -i  ../../../../assets/cyline_20220607111827.avi

The instantiated logger given by TensorRT official documentation:
insert image description here
 
 
 

References:
1. https://blog.csdn.net/Tosonw/article/details/104154090
2. https://zhuanlan.zhihu.com/p/502032016
3. https://zhuanlan.zhihu.com/p /527238167
4. https://zhuanlan.zhihu.com/p/571164208
5. https://zhuanlan.zhihu.com/p/467239946
6. https://blog.csdn.net/qq_36936443/article/details/ 124458745 (This link explains in great detail)
7. https://blog.51cto.com/u_11495341/3036153
8. Nvidia tensorrt official document link https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index .html

Guess you like

Origin blog.csdn.net/sinat_41886501/article/details/129057752