Performance comparison of tensorrt under different batchsize

Tensorrt uses GPU for acceleration. Natural GPU is suitable for parallel computing, so increasing batchsize is one of the common ways to optimize tensorrt

tensorrt defaults to batchsize=1, and then do a few experiments to observe

The model is directly downloaded from the onnx file of this website

After getting the onnx file, we need to convert it into a tensorrt engine file

/opt/TensorRT-7.1.3.4/bin/trtexec --onnx=ctdet_coco_dlav0_512.onnx --saveEngine=ctdet_coco_dlav0_512_256.trt --best --batch=256 --workspace=4096

The trt file will be directly obtained if the operation is successful

 Next, test performance

/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=256
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=128
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=64
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=32
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=16
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=8
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=4
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=2
/opt/TensorRT-7.1.3.4/bin/trtexec --loadEngine=ctdet_coco_dlav0_512_256.trt --batch=1

This article only examines throughput and running time, and summarizes the results as follows

Compute Total Time BatchSize (S) throughput (QPS)
. 1 2.94049 364.344
2 725.753 2.93884
. 4 2.77403 1365.88
. 8 1619.66 1.65327
16 0.922257 1818.5
32 1848.01 .479588
64 .238533 1859.18
128 .12173 1842.8
256 0.0641229 1817.42
visualized at calculation time ( Total Compute Time )

import numpy as  np
import pandas as pd
import matplotlib.pyplot as plt

df=pd.read_csv("data.txt",sep="\t")
x =df['batchsize'].values
y1 =df['total compute time(s)'].values
y2 =df['throughput(qps)'].values
plt.plot(x, y1, 'ro--')
#plt.plot(x, y1, 'b--')
plt.xlabel('batchsize')
plt.ylabel('total compute time(s)')
for a, b in zip(x, y1):
    plt.text(a+15,b-0.15,'(%d,%.4f)'%(a,b),ha='center', va='bottom',fontdict={'size': 10, 'color':  'g'})
plt.show()

 Similarly, visualize throughput

From the above two pictures, we can see that increasing the size of batchsize has a great impact on the calculation time; and increasing the size of batchsize can easily reach saturation of the throughput.

Careful students can find that when transferring trt files from onnx, --best is used

This parameter is

--best                      Enable all precisions to achieve the best performance (default = disabled)

Belongs to mixed precision, neither pure fp32, nor fp16, int8

Next, compare the impact of different quantifications on calculation time and throughput

Three experiments were done, best, fp16 and int8

The results are summarized into three files

best.txt

fp16.txt

int8.txt

The visualization is as follows:

import numpy as  np
import pandas as pd
import matplotlib.pyplot as plt
plt.tight_layout()

df_int8=pd.read_csv("int8.txt",sep="\t")
df_best=pd.read_csv("best.txt",sep="\t")
df_fp16=pd.read_csv("fp16.txt",sep="\t")

x =df_int8['batchsize'].values
y1_int8 =df_int8['total compute time(s)'].values
y1_best =df_best['total compute time(s)'].values
y1_fp16 =df_fp16['total compute time(s)'].values

plt.plot(x, y1_int8, 'ro--',label='int8')
plt.plot(x, y1_best, 'bo--',label='best')
plt.plot(x, y1_fp16, 'go--',label='fp16')

plt.xlabel('batchsize')
plt.ylabel('total compute time(s)')
plt.legend()
plt.show()

You can see from the above that the performance of best is almost the same as int8, which is better than fp16

Guess you like

Origin blog.csdn.net/zhou_438/article/details/112823818