AI模型部署-TensorRT模型INT8量化的Python实现

AI模型部署：TensorRT模型INT8量化的Python实现

本文首发于公众号【DeepDriving】，欢迎关注。

概述

目前深度学习模型的参数在训练阶段基本上都是采用32位浮点（FP32）来表示，以便能有更大的动态范围用于在训练过程中更新参数。然而在推理阶段，采用FP32的精度会消耗较多的计算资源和内存空间，为此，在部署模型的时候往往会采用降低模型精度的方法，用16位浮点（FP16）或者8位有符号整型（INT8）来表示。从FP32转换为FP16一般不会有什么精度损失，但是FP32转换为INT8则可能会造成较大的精度损失，尤其是当模型的权重分布在较大的动态范围内时。

虽然有一定的精度损失，但是转换为INT8也会带来很多好处，比如减少对存储空间、内存、CPU的占用，提升计算吞吐量等，这在计算资源受限的嵌入式平台是很有意义的。

把模型参数张量从FP32转换为INT8，也就是将浮点张量的动态范围映射到[-128,127]的范围，可以使用下面的公式：

$x_{q}=Clip(Round(x_{f}/scale))$

其中， $Cl i p$ 和 $R o u n d$ 分别代表截断和取整操作。从上面的公式可以看出，把FP32转换为INT8的关键是需要设置一个比例因子 $sc a l e$ 来做映射，这个映射过程就叫做量化，上面的公式是对称量化的公式。

量化的关键在于寻找一个合适的比例因子，使得量化后的模型精度尽量接近原始模型。对模型进行量化的方式有两种：

训练后量化（Post-training quantization，PTQ）是在模型训练好后，再通过一个校准（Calibration）流程去计算比例因子，从而实现量化过程。
量化感知训练（Quantization-aware training，QAT）是在模型训练过程中去计算比例因子，允许在训练过程中补偿量化和反量化操作带来的精度误差。

本文只介绍如何调用TensorRT的Python接口实现INT8量化。关于INT8量化的理论知识，由于牵涉的内容比较多，等我有空再专门写一篇文章来做介绍。

TensorRT INT8量化的具体实现

TensorRT中的校准器

在训练后量化过程中，TensorRT需要计算模型中每个张量的比例因子，这个过程被称为校准。校准过程中需要提供具有代表性的数据，以便TensorRT在这个数据集上运行模型然后收集每个张量的统计信息用于寻找一个最佳的比例因子。寻找最佳比例因子需要平衡离散化误差（随着每个量化值表示的范围变大而变大）和截断误差（其值被限制在可表示范围的极限内）这两个误差源，TensorRT提供了几种不同的校准器：

IInt8EntropyCalibrator2：当前推荐的熵校准器，默认情况下校准发生在层融合之前，推荐用于CNN模型中。
IInt8MinMaxCalibrator：该校准器使用激活分布的整个范围来确定比例因子，默认情况下校准发生在层融合之前，推荐用于NLP任务的模型中。
IInt8EntropyCalibrator：该校准器是TensorRT最原始的熵校准器，默认情况下校准发生在层融合之后，目前已不推荐使用。
IInt8LegacyCalibrator：该校准器需要用户进行参数化，默认情况下校准发生在层融合之后，不推荐使用。

TensorRT构建INT8模型引擎时，会执行下面的步骤：

构建一个32位的模型引擎，然后在校准数据集上运行这个引擎，然后为每个张量激活值的分布记录一个直方图；
从直方图构建一个校准表，为每个张量计算出一个比例因子；
根据校准表和模型的定义构建一个INT8的引擎。

校准的过程可能会比较慢，不过第二步生成的校准表可以输出到文件并可以被重用，如果校准表文件已存在，那么校准器就直接从该文件中读取校准表而无需执行前面两步。另外，与引擎文件不同的是，校准表是可以跨平台使用的。因此，我们在实际部署模型过程中可以先在带通用GPU的计算机上生成校准表，然后在Jetson Nano等嵌入式平台上去使用。为了编码方便，我们可以用Python编程来实现INT8量化过程来生成校准表。

具体实现

1. 加载校准数据

首先定义一个数据加载类用于加载校准数据，这里的校准数据为JPG格式的图片，图片读取后需要根据模型的输入数据要求进行缩放、归一化、交换通道等预处理操作：

class CalibDataLoader:
    def __init__(self, batch_size, width, height, calib_count, calib_images_dir):
        self.index = 0
        self.batch_size = batch_size
        self.width = width
        self.height = height
        self.calib_count = calib_count
        self.image_list = glob.glob(os.path.join(calib_images_dir, "*.jpg"))
        assert (
            len(self.image_list) > self.batch_size * self.calib_count
        ), "{} must contains more than {} images for calibration.".format(
            calib_images_dir, self.batch_size * self.calib_count
        )
        self.calibration_data = np.zeros((self.batch_size, 3, height, width), dtype=np.float32)

    def reset(self):
        self.index = 0

    def next_batch(self):
        if self.index < self.calib_count:
            for i in range(self.batch_size):
                image_path = self.image_list[i + self.index * self.batch_size]
                assert os.path.exists(image_path), "image {} not found!".format(image_path)
                image = cv2.imread(image_path)
                image = Preprocess(image, self.width, self.height)
                self.calibration_data[i] = image
            self.index += 1
            return np.ascontiguousarray(self.calibration_data, dtype=np.float32)
        else:
            return np.array([])

    def __len__(self):
        return self.calib_count

预处理操作代码如下：

def Preprocess(input_img, width, height):
    img = cv2.cvtColor(input_img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (width, height)).astype(np.float32)
    img = img / 255.0
    img = np.transpose(img, (2, 0, 1))
    return img

2. 实现校准器

想要实现校准器的功能，需继承TensorRT提供的四个校准器类中的一个，然后重写父校准器的几个方法：

get_batch_size: 用于获取batch的大小
get_batch: 用于获取一个batch的数据
read_calibration_cache: 用于从文件中读取校准表
write_calibration_cache: 用于把校准表从内存中写入文件中

由于我需要量化的是CNN模型，所以选择继承IInt8EntropyCalibrator2校准器：

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

class Calibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file=""):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.data_loader = data_loader
        self.d_input = cuda.mem_alloc(self.data_loader.calibration_data.nbytes)
        self.cache_file = cache_file
        data_loader.reset()

    def get_batch_size(self):
        return self.data_loader.batch_size

    def get_batch(self, names):
        batch = self.data_loader.next_batch()
        if not batch.size:
            return None
        # 把校准数据从CPU搬运到GPU中
        cuda.memcpy_htod(self.d_input, batch)

        return [self.d_input]

    def read_calibration_cache(self):
        # 如果校准表文件存在则直接从其中读取校准表
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        # 如果进行了校准，则把校准表写入文件中以便下次使用
        with open(self.cache_file, "wb") as f:
            f.write(cache)
            f.flush()

3. 生成INT8引擎

关于生成FP32模型引擎的流程我之前在一篇文章里专门介绍过，不过那篇文章里是用C++实现的。调用Python接口实现其实更简单，具体代码如下：

def build_engine():
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    config = builder.create_builder_config()
    parser = trt.OnnxParser(network, TRT_LOGGER)
    assert os.path.exists(onnx_file_path), "The onnx file {} is not found".format(onnx_file_path)
    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            print("Failed to parse the ONNX file.")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    print("Building an engine from file {}, this may take a while...".format(onnx_file_path))

    # build tensorrt engine
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 * (1 << 30))  
    if mode == "INT8":
        config.set_flag(trt.BuilderFlag.INT8)
        calibrator = Calibrator(data_loader, calibration_table_path)
        config.int8_calibrator = calibrator
    else mode == "FP16":
        config.set_flag(trt.BuilderFlag.FP16)

    engine = builder.build_engine(network, config)
    if engine is None:
        print("Failed to create the engine")
        return None
    with open(engine_file_path, "wb") as f:
        f.write(engine.serialize())

    return engine

上面的代码首先用OnnxParser去解析模型，然后通过config设置引擎的精度。如果是构建INT8引擎，那么需要设置相应的Flag，并且要把之前实现的校准器对象传入其中，这样在构建引擎时TensorRT就会自动读取校准数据去生成校准表。

测试结果

为了验证INT8量化的效果，我用YOLOv5的几个模型在GeForce GTX 1650 Ti显卡上做了一下对比测试。不同精度的推理耗时测试结果如下：

模型	输入尺寸	模型精度	推理耗时（ms）
yolov5s.onnx	640x640	INT8	7
yolov5m.onnx	640x640	INT8	10
yolov5l.onnx	640x640	INT8	15
yolov5s.onnx	640x640	FP32	12
yolov5m.onnx	640x640	FP32	23
yolov5l.onnx	640x640	FP32	45

yolov5l模型FP32和INT8精度的目标检测结果分别如下面两张图片所示：

可以看到，检测结果还是比较接近的。

参考资料

https://developer.nvidia.com/zh-cn/blog/tensorrt-int8-cn/
https://developer.nvidia.com/blog/chieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-861/developer-guide/index.html#working-with-int8
https://github.com/xuanandsix/Tensorrt-int8-quantization-pipline.git