Torch TorchScript Onnx TensorRT 推理速度对比实验

本博文主要在ubuntu上进行主流推理框架在ubuntu上的速度对比实验，代码来源于pytorch-classifier，是博主自己整理的一个基于pytorch图像分类框架，这是详细的介绍基于pytorch实现的图像分类代码，这是代码的实践案例使用pytorch实现花朵分类，本博文也是pytorch-classifier v1.2版本更新日志.

pytorch-classifier v1.2 更新日志

新增export.py,支持导出(onnx, torchscript, tensorrt)模型.

metrice.py支持onnx,torchscript,tensorrt的推理.

 此处在predict.py中暂不支持onnx,torchscript,tensorrt的推理的推理,原因是因为predict.py中的热力图可视化没办法在onnx、torchscript、tensorrt中实现,后续单独推理部分会额外写一部分代码.
 在metrice.py中,onnx和torchscript和tensorrt的推理也不支持tsne的可视化,那么我在metrice.py中添加onnx,torchscript,tensorrt的推理的目的是为了测试fps和精度.
 所以简单来说,使用metrice.py最好还是直接用torch模型,torchscript和onnx和tensorrt的推理的推理模型后续会写一个单独的推理代码.

main.py,metrice.py,predict.py,export.py中增加–device参数,可以指定设备.
优化程序和修复一些bug.

训练命令:

python main.py --model_name efficientnet_v2_s --config config/config.py --batch_size 128 --Augment AutoAugment --save_path runs/efficientnet_v2_s --device 0 \
--pretrained --amp --warmup --ema --imagenet_meanstd

GPU 推理速度测试 sh脚本:

batch_size=1 # 1 2 4 8 16 32 64
python metrice.py --task fps --save_path runs/efficientnet_v2_s --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --half --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --half --model_type torchscript --batch_size $batch_size
python export.py --save_path runs/efficientnet_v2_s --export onnx --simplify --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --model_type onnx --batch_size $batch_size
python export.py --save_path runs/efficientnet_v2_s --export onnx --simplify --half --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --model_type onnx --batch_size $batch_size
python export.py --save_path runs/efficientnet_v2_s --export tensorrt --simplify --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --model_type tensorrt --batch_size $batch_size
python export.py --save_path runs/efficientnet_v2_s --export tensorrt --simplify --half --batch_size $batch_size 
python metrice.py --task fps --save_path runs/efficientnet_v2_s --model_type tensorrt --half --batch_size $batch_size

CPU 推理速度测试 sh脚本:

python export.py --save_path runs/efficientnet_v2_s --export onnx --simplify --dynamic --device cpu
batch_size=1
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type onnx --batch_size $batch_size
batch_size=2
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type onnx --batch_size $batch_size
batch_size=4
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type onnx --batch_size $batch_size
batch_size=8
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type onnx --batch_size $batch_size
batch_size=16
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type torchscript --batch_size $batch_size
python metrice.py --task fps --save_path runs/efficientnet_v2_s --device cpu --model_type onnx --batch_size $batch_size

各导出模型在cpu和gpu上的fps实验:

实验环境:

System	CPU	GPU	RAM	Model
Ubuntu20.04	i7-12700KF	RTX-3090	32G DDR5 6400	efficientnet_v2_s

GPU

model	Torch FP32 FPS	Torch FP16 FPS	TorchScript FP32 FPS	TorchScript FP16 FPS	ONNX FP32 FPS	ONNX FP16 FPS	TensorRT FP32 FPS	TensorRT FP16 FPS
batch-size 1	93.77	105.65	233.21	260.07	177.41	308.52	311.60	789.19
batch-size 2	94.32	108.35	208.53	253.83	166.23	258.98	275.93	713.71
batch-size 4	95.98	108.31	171.99	255.05	130.43	190.03	212.75	573.88
batch-size 8	94.03	85.76	118.79	210.58	87.65	122.31	147.36	416.71
batch-size 16	61.93	76.25	75.45	125.05	50.33	69.01	87.25	260.94
batch-size 32	34.56	58.11	41.93	72.29	26.91	34.46	48.54	151.35
batch-size 64	18.64	31.57	23.15	38.90	12.67	15.90	26.19	85.47

CPU

model	Torch FP32 FPS	Torch FP16 FPS	TorchScript FP32 FPS	TorchScript FP16 FPS	ONNX FP32 FPS	ONNX FP16 FPS	TensorRT FP32 FPS	TensorRT FP16 FPS
batch-size 1	27.91	Not Support	46.10	Not Support	79.27	Not Support	Not Support	Not Support
batch-size 2	25.26	Not Support	24.98	Not Support	45.62	Not Support	Not Support	Not Support
batch-size 4	14.02	Not Support	13.84	Not Support	23.90	Not Support	Not Support	Not Support
batch-size 8	7.53	Not Support	7.35	Not Support	12.01	Not Support	Not Support	Not Support
batch-size 16	3.07	Not Support	3.64	Not Support	5.72	Not Support	Not Support	Not Support

总结

我们可以看到在GPU上，tensorrt的速度是无须质疑的，在fp16的加持下，fps在batch_size为1的时候达到恐怖的789.19，就算在batch_size为64的时候，fps也可以达到85.47，比其他高出几倍之多.在CPU上，我们可以看到选用ONNX推理是比较合适的。

主流推理框架在ubuntu上的速度对比实验