Tencent and NVIDIA open source TensorRT plug-in automatic generation tool TPAT

On March 25, 2022, the TensorRT plug-in automatic generation tool TPAT, developed by Tencent and NVIDIA, was officially announced as open source.

TensorRT is currently the most widely used GPU inference framework, but due to the limited number of supported operators, users face the pain point of handwriting plugins to support operators. TPAT can support all operators in the Open Neural Network Exchange (ONNX) format, and generate TensorRT plug-ins end-to-end. While liberating labor costs, the performance is comparable to handwriting.

TPAT Github address: https://github.com/Tencent/TPAT

background

TensorRT is the fastest GPU inference engine today, which enables low-latency and high-throughput deployment of deep learning models on GPUs. It supports mainstream deep learning frameworks such as Caffe, TensorFlow, Mxnet, and Pytorch, and is developed and maintained by NVIDIA. Almost all GPU inference businesses in the industry use TensorRT.

However, TensorRT also has shortcomings, that is, its deployment process is cumbersome, so the model provided by the algorithm engineer needs to be deployed and launched by the system engineer, which is very time-consuming and labor-intensive. In traditional TensorRT workflows, handwritten plugins are often the most time-consuming part.


Difficulties in TensorRT handwritten operator
plugins⦁ TensorRT officially only supports very limited common operators (Conv/FC/BN/Relu…), for unsupported operators, users need handwritten plugins to implement;
⦁ The writing of plugins requires GPU and cuda Knowledge, NVIDIA engineers usually need 1~2 weeks to write an operator implementation; if the model contains multiple unsupported operators, it takes more time to write and debug plugins one by one.

TPAT overview

TPAT realizes the automatic generation of TensorRT plug-ins, and the deployment and launch of TensorRT can be basically streamlined and no longer requires manual participation. The steps of handwritten plug-ins will be replaced by TPAT. It only takes 30-60 minutes for TPAT to automatically generate an operator plug-in (this time is used to search for the high-performance CUDA Kernel of the operator), so TensorRT will become a true end-to-end reasoning framework.

TPAT Highlights

⦁ Coverage: Supports all operators of onnx/tensorflow/pyTorch
⦁ Fully automatic: End-to-end fully automatic generation of user-specified TensorRT Plugin
⦁ High performance: The performance of most operators exceeds that of handwritten Plugin

Architecture design

TPAT accepts the ONNX-Model input by the user, specifies the operator and batch size that need to generate the TensorRT Plugin, based on the TVM deep learning compiler, performs AutoTune on fixed-shape operators, and automatically generates high-performance CUDA Kernel. CUDA Kernel and Runtime The necessary parameters are filled into the TensorRT Plugin template to generate a dynamic link library, which can be directly loaded into TensorRT to run.

Performance data of some TPAT operators

Use TPAT to automatically generate operators that TensorRT-7.2 does not support, and use TPAT to optimize TensorRT-7.2's native implementation of poor performance operators;

Contrast Handwriting Plugin

Optimize TensorRT native operators

Some operators in the internal business model were tested, and the performance of TPAT surpassed the handwriting of CUDA engineers in almost all aspects, and the end-to-end design can greatly reduce the labor input; for the implementation of TensorRT native operators, the performance of TPAT is not inferior. Features of AutoTune can optimize the implementation of native operators in TensorRT that perform less well.

TPAT open source

Subsequent TPAT open source plans:

  • Multi-precision support for operators, including Float16, Int8.
  • Subgraph optimization using TPAT
  • Support for dynamic shapes

Attachment: TPAT use case

Support Onehot operator using TPAT (TensorRT-7.2.2.3)

  • The input includes the ONNX_Model of the onehot operator, the name of the Onehot operator, batch_size
  • TPAT generates high-performance CUDA Kernel with the help of TVM's Relay and AutoScheduler components;
  • After the template is filled, the dynamic link library of the available onehot operator Plugin is directly generated.

 

 

 

Guess you like

Origin www.oschina.net/news/188262/tencent-nvidia-tpat-open-source