Tensorrt uses GPU for acceleration. Natural GPU is suitable for parallel computing, so increasing batchsize is one of the common ways to optimize tensorrt
tensorrt defaults to batchsize=1, and then do a few experiments to observe
The model is directly downloaded from the onnx file of this website
After getting the onnx file, we need to convert it into a tensorrt engine file
/opt/TensorRT-7.1.3.4/bin/trtexec --onnx=ctdet_coco_dlav0_512.onnx --saveEngine=ctdet_coco_dlav0_512_256.trt --best --batch=256 --workspace=4096
The trt file will be directly obtained if the operation is successful
Next, test performance
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=256
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=128
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=64
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=32
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=16
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=8
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=4
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=2
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=1
This article only examines throughput and running time, and summarizes the results as follows
Compute Total Time BatchSize (S) throughput (QPS)
. 1 2.94049 364.344
2 725.753 2.93884
. 4 2.77403 1365.88
. 8 1619.66 1.65327
16 0.922257 1818.5
32 1848.01 .479588
64 .238533 1859.18
128 .12173 1842.8
256 0.0641229 1817.42
visualized at calculation time ( Total Compute Time )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("data.txt",sep="\t")
x =df['batchsize'].values
y1 =df['total compute time(s)'].values
y2 =df['throughput(qps)'].values
plt.plot(x, y1, 'ro--')
#plt.plot(x, y1, 'b--')
plt.xlabel('batchsize')
plt.ylabel('total compute time(s)')
for a, b in zip(x, y1):
plt.text(a+15,b-0.15,'(%d,%.4f)'%(a,b),ha='center', va='bottom',fontdict={'size': 10, 'color': 'g'})
plt.show()
Similarly, visualize throughput
From the above two pictures, we can see that increasing the size of batchsize has a great impact on the calculation time; and increasing the size of batchsize can easily reach saturation of the throughput.
Careful students can find that when transferring trt files from onnx, --best is used
This parameter is
--best Enable all precisions to achieve the best performance (default = disabled)
Belongs to mixed precision, neither pure fp32, nor fp16, int8
Next, compare the impact of different quantifications on calculation time and throughput
Three experiments were done, best, fp16 and int8
The results are summarized into three files
best.txt
fp16.txt
int8.txt
The visualization is as follows:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.tight_layout()
df_int8=pd.read_csv("int8.txt",sep="\t")
df_best=pd.read_csv("best.txt",sep="\t")
df_fp16=pd.read_csv("fp16.txt",sep="\t")
x =df_int8['batchsize'].values
y1_int8 =df_int8['total compute time(s)'].values
y1_best =df_best['total compute time(s)'].values
y1_fp16 =df_fp16['total compute time(s)'].values
plt.plot(x, y1_int8, 'ro--',label='int8')
plt.plot(x, y1_best, 'bo--',label='best')
plt.plot(x, y1_fp16, 'go--',label='fp16')
plt.xlabel('batchsize')
plt.ylabel('total compute time(s)')
plt.legend()
plt.show()