TensorRT cheat sheet

TensorRT Cheat Checklist

question plan
How to check the accuracy of calculation results on TensorRT, find out the layers with insufficient accuracy, and optimize the calculation graph Use polygraphy
How to analyze the inference time of different layers and find the layer that takes more time Get the timeline of the running phase through nsight systems and visualize it
The engine build time is too long, how to save time during multiple builds? Using Timing Cache
Dynamic Shape mode performance degrades when min-opt-max span is large 1. Create multiple OptimizationProfiles during construction. The range of each profile should be as small as possible to facilitate TensorRT optimization.
2. Select the corresponding Profile according to the data shape during inference. The disadvantage is that it will increase the memory usage.
How to overlap calculation and data copy time to increase GPU utilization MultiStream, using CUDA stream and CUDA event for asynchronous calls while using Pinned memory
How to make an engine available to multiple threads Using multiple Contexts, multiple Contexts can independently perform inference calculations based on the same engine.
How to optimize Kernel calls and reduce the occurrence of Launch Bound,
that is, the time it takes for the CPU to initiate a function call and the GPU to actually execute it is too long, and most of the time is wasted on steps such as preparation and data copying.
1. Use Cuda Graph
2. Complete most Launch preparations in advance
3. CUDA workflow optimization
Some Layer algorithms lead to large errors. How to block this selection? Completed through the algorithm selector
1. First use tools such as plotgraph to find that the Tactic results of some layers are not ideal
2. Mask these layers through the algorithm selector
3. Rebuild the engine
Want to update the weights of the model, but don’t want to rebuild the engine? Use Refit
Video memory usage during build/runtime is too large, how to reduce it? In BuilderConfig, operator alternatives for the three libraries cuBLAS, cuBLASLt, and cuDNN are enabled by default. When TensorRT optimizes the kernel, it will select the implementation with the best performance and put it into the engine.
You can manually block one or more of the libraries, thereby prohibiting the selection of algorithms from these libraries for implementation. This can save memory and video memory usage, and reduce engine build time. The disadvantage is that it may cause engine performance degradation or build failure
. Subsequent versions of TensorRT will completely disconnect from external libraries.

Guess you like

Origin blog.csdn.net/weixin_41817841/article/details/127859247