AI model deployment-Python implementation of INT8 quantification of TensorRT model

AI model deployment: Python implementation of TensorRT model INT8 quantification

This article was first published on the public account [DeepDriving], welcome to pay attention.

Overview

32At present, the parameters of deep learning models are basically expressed in bit floating point ( ) during the training phase FP32, so that a larger dynamic range can be used to update parameters during the training process. However, in the inference stage, FP32the accuracy used will consume more computing resources and memory space. For this reason, when deploying the model, a method of reducing the model accuracy is often used, using bit-floating point ( ) 16or FP16bit 8-signed integer ( INT8) To represent. There is generally no loss of accuracy from FP32converting to , but converting to 2 may cause a larger loss of accuracy, especially when the weights of the model are distributed over a large dynamic range.FP16FP32INT8

Although there is a certain loss of accuracy, the conversion INT8will also bring many benefits, such as reducing CPUthe occupation of storage space and memory, improving computing throughput, etc. This is very meaningful in embedded platforms with limited computing resources.

To convert the model parameter tensor from FP32to INT8, that is, the range to which the dynamic range of the floating point tensor is mapped [-128,127], you can use the following formula:

x q = C l i p ( R o u n d ( x f / s c a l e ) ) x_{q}=Clip(Round(x_{f}/scale)) xq=Clip(Round(xf/scale))

Among them, C lip ClipClip R o u n d Round R o u n d represents truncation and rounding operations respectively. As can be seen from the above formula,FP32converting toINT8is to set a scale factorscale scalesc a l e is used for mapping. This mapping process is called quantization. The above formula is a symmetric quantization formula.

The key to quantization is to find an appropriate scaling factor so that the accuracy of the quantized model is as close as possible to the original model. There are two ways to quantify a model:

  • Post-training quantization ( ) is to calculate the scale factor through a calibration ( ) process Post-training quantization,PTQafter the model is trained to achieve the quantization process.Calibration

  • Quantization-aware training ( Quantization-aware training,QAT) calculates the scale factor during the model training process, allowing the accuracy error caused by quantization and inverse quantization operations to be compensated during the training process.

This article only introduces how to call the TensorRTinterface Pythonto achieve INT8quantification. Regarding INT8the theoretical knowledge of quantification, since it involves a lot of content, I will write a special article to introduce it when I have time.

Specific implementation of TensorRT INT8 quantification

Calibrator in TensorRT

During the post-training quantization process, TensorRTthe scale factor of each tensor in the model needs to be calculated. This process is called calibration. The calibration process requires providing representative data in order to TensorRTrun the model on this dataset and then collect statistics for each tensor to find an optimal scaling factor. Finding the optimal scale factor requires balancing the two error sources of discretization error (which becomes larger as the range represented by each quantized value becomes larger) and truncation error (whose values ​​are restricted to the limits of the representable range), TensorRTproviding Several different calibrators:

  • IInt8EntropyCalibrator2 : The currently recommended entropy calibrator, by default calibration occurs before layer fusion, recommended for use CNNin models.

  • IInt8MinMaxCalibrator : This calibrator uses the entire range of the activation distribution to determine the scaling factor. By default calibration occurs before layer fusion, recommended NLPin models for the task.

  • IInt8EntropyCalibrator : This calibrator is TensorRTthe original entropy calibrator. By default calibration occurs after layer fusion, and its use is currently deprecated.

  • IInt8LegacyCalibrator : This calibrator requires user parameterization, by default calibration occurs after layer fusion, and is not recommended.

TensorRTWhen building INT8a model engine, the following steps are performed:

  1. Build a 32bit model engine, then run this engine on the calibration data set, and record a histogram of the distribution of activation values ​​for each tensor;
  2. Construct a calibration table from the histogram and calculate a scale factor for each tensor;
  3. Build an engine based on the calibration table and model definition INT8.

The calibration process may be slow, but the calibration table generated in the second step can be output to a file and can be reused. If the calibration table file already exists, the calibrator will read the calibration table directly from the file without executing the previous two steps. step. In addition, unlike engine files, calibration tables can be used across platforms. Therefore, during the actual deployment of the model, we can first GPUgenerate the calibration table on a general-purpose computer and then Jetson Nanouse it on an embedded platform. For coding convenience, we can use Pythonprogramming to implement INT8the quantization process to generate a calibration table.

Implementation
1. Load calibration data

First define a data loading class for loading calibration data. The calibration data here is a JPGpicture in the format. After the picture is read, it needs to be scaled, normalized, exchanged channels and other preprocessing operations according to the input data requirements of the model:

class CalibDataLoader:
    def __init__(self, batch_size, width, height, calib_count, calib_images_dir):
        self.index = 0
        self.batch_size = batch_size
        self.width = width
        self.height = height
        self.calib_count = calib_count
        self.image_list = glob.glob(os.path.join(calib_images_dir, "*.jpg"))
        assert (
            len(self.image_list) > self.batch_size * self.calib_count
        ), "{} must contains more than {} images for calibration.".format(
            calib_images_dir, self.batch_size * self.calib_count
        )
        self.calibration_data = np.zeros((self.batch_size, 3, height, width), dtype=np.float32)

    def reset(self):
        self.index = 0

    def next_batch(self):
        if self.index < self.calib_count:
            for i in range(self.batch_size):
                image_path = self.image_list[i + self.index * self.batch_size]
                assert os.path.exists(image_path), "image {} not found!".format(image_path)
                image = cv2.imread(image_path)
                image = Preprocess(image, self.width, self.height)
                self.calibration_data[i] = image
            self.index += 1
            return np.ascontiguousarray(self.calibration_data, dtype=np.float32)
        else:
            return np.array([])

    def __len__(self):
        return self.calib_count

The preprocessing operation code is as follows:

def Preprocess(input_img, width, height):
    img = cv2.cvtColor(input_img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (width, height)).astype(np.float32)
    img = img / 255.0
    img = np.transpose(img, (2, 0, 1))
    return img
2. Implement the calibrator

To implement the function of a calibrator, you need to inherit TensorRTone of the four provided calibrator classes, and then override several methods of the parent calibrator:

  • get_batch_size: batchsize used to get
  • get_batch: used to get batchdata of a
  • read_calibration_cache: Used to read the calibration table from a file
  • write_calibration_cache: Used to write the calibration table from memory to a file

Since it is the model that I need to quantify CNN, I choose to inherit IInt8EntropyCalibrator2the calibrator:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

class Calibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file=""):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.data_loader = data_loader
        self.d_input = cuda.mem_alloc(self.data_loader.calibration_data.nbytes)
        self.cache_file = cache_file
        data_loader.reset()

    def get_batch_size(self):
        return self.data_loader.batch_size

    def get_batch(self, names):
        batch = self.data_loader.next_batch()
        if not batch.size:
            return None
        # 把校准数据从CPU搬运到GPU中
        cuda.memcpy_htod(self.d_input, batch)

        return [self.d_input]

    def read_calibration_cache(self):
        # 如果校准表文件存在则直接从其中读取校准表
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        # 如果进行了校准,则把校准表写入文件中以便下次使用
        with open(self.cache_file, "wb") as f:
            f.write(cache)
            f.flush()
3. Generate INT8 engine

FP32I have introduced the process of generating model engines in an article before, but in that article it was C++implemented. Calling Pythonthe interface implementation is actually simpler. The specific code is as follows:

def build_engine():
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    config = builder.create_builder_config()
    parser = trt.OnnxParser(network, TRT_LOGGER)
    assert os.path.exists(onnx_file_path), "The onnx file {} is not found".format(onnx_file_path)
    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            print("Failed to parse the ONNX file.")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    print("Building an engine from file {}, this may take a while...".format(onnx_file_path))

    # build tensorrt engine
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 * (1 << 30))  
    if mode == "INT8":
        config.set_flag(trt.BuilderFlag.INT8)
        calibrator = Calibrator(data_loader, calibration_table_path)
        config.int8_calibrator = calibrator
    else mode == "FP16":
        config.set_flag(trt.BuilderFlag.FP16)

    engine = builder.build_engine(network, config)
    if engine is None:
        print("Failed to create the engine")
        return None
    with open(engine_file_path, "wb") as f:
        f.write(engine.serialize())

    return engine

The above code is first OnnxParserused to parse the model and then configset the accuracy of the engine. If you are building INT8an engine, you need to set the corresponding settings Flagand pass the previously implemented calibrator object into it, so that the TensorRTcalibration data will be automatically read to generate a calibration table when building the engine.

Test Results

In order to verify INT8the effect of quantification, I did a comparative test on YOLOv5several models I used on the graphics card. GeForce GTX 1650 TiThe results of the inference time-consuming test with different precisions are as follows:

Model Enter dimensions Model accuracy Reasoning time (ms)
yolov5s.onnx 640x640 INT8 7
yolov5m.onnx 640x640 INT8 10
yolov5l.onnx 640x640 INT8 15
yolov5s.onnx 640x640 FP32 12
yolov5m.onnx 640x640 FP32 23
yolov5l.onnx 640x640 FP32 45

yolov5lThe target detection results of the model FP32and INT8accuracy are shown in the following two pictures:

It can be seen that the test results are still relatively close.

References

  1. https://developer.nvidia.com/zh-cn/blog/tensorrt-int8-cn/
  2. https://developer.nvidia.com/blog/chieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/
  3. https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-861/developer-guide/index.html#working-with-int8
  4. https://github.com/xuanandsix/Tensorrt-int8-quantization-pipline.git

Guess you like

Origin blog.csdn.net/weixin_44613415/article/details/131850160