AI model deployment: Python implementation of TensorRT model INT8 quantification
This article was first published on the public account [DeepDriving], welcome to pay attention.
Overview
32
At present, the parameters of deep learning models are basically expressed in bit floating point ( ) during the training phase FP32
, so that a larger dynamic range can be used to update parameters during the training process. However, in the inference stage, FP32
the accuracy used will consume more computing resources and memory space. For this reason, when deploying the model, a method of reducing the model accuracy is often used, using bit-floating point ( ) 16
or FP16
bit 8
-signed integer ( INT8
) To represent. There is generally no loss of accuracy from FP32
converting to , but converting to 2 may cause a larger loss of accuracy, especially when the weights of the model are distributed over a large dynamic range.FP16
FP32
INT8
Although there is a certain loss of accuracy, the conversion INT8
will also bring many benefits, such as reducing CPU
the occupation of storage space and memory, improving computing throughput, etc. This is very meaningful in embedded platforms with limited computing resources.
To convert the model parameter tensor from FP32
to INT8
, that is, the range to which the dynamic range of the floating point tensor is mapped [-128,127]
, you can use the following formula:
x q = C l i p ( R o u n d ( x f / s c a l e ) ) x_{q}=Clip(Round(x_{f}/scale)) xq=Clip(Round(xf/scale))
Among them, C lip ClipClip和 R o u n d Round R o u n d represents truncation and rounding operations respectively. As can be seen from the above formula,FP32
converting toINT8
is to set a scale factorscale scalesc a l e is used for mapping. This mapping process is called quantization. The above formula is a symmetric quantization formula.
The key to quantization is to find an appropriate scaling factor so that the accuracy of the quantized model is as close as possible to the original model. There are two ways to quantify a model:
-
Post-training quantization ( ) is to calculate the scale factor through a calibration ( ) process
Post-training quantization,PTQ
after the model is trained to achieve the quantization process.Calibration
-
Quantization-aware training (
Quantization-aware training,QAT
) calculates the scale factor during the model training process, allowing the accuracy error caused by quantization and inverse quantization operations to be compensated during the training process.
This article only introduces how to call the TensorRT
interface Python
to achieve INT8
quantification. Regarding INT8
the theoretical knowledge of quantification, since it involves a lot of content, I will write a special article to introduce it when I have time.
Specific implementation of TensorRT INT8 quantification
Calibrator in TensorRT
During the post-training quantization process, TensorRT
the scale factor of each tensor in the model needs to be calculated. This process is called calibration. The calibration process requires providing representative data in order to TensorRT
run the model on this dataset and then collect statistics for each tensor to find an optimal scaling factor. Finding the optimal scale factor requires balancing the two error sources of discretization error (which becomes larger as the range represented by each quantized value becomes larger) and truncation error (whose values are restricted to the limits of the representable range), TensorRT
providing Several different calibrators:
-
IInt8EntropyCalibrator2 : The currently recommended entropy calibrator, by default calibration occurs before layer fusion, recommended for use
CNN
in models. -
IInt8MinMaxCalibrator : This calibrator uses the entire range of the activation distribution to determine the scaling factor. By default calibration occurs before layer fusion, recommended
NLP
in models for the task. -
IInt8EntropyCalibrator : This calibrator is
TensorRT
the original entropy calibrator. By default calibration occurs after layer fusion, and its use is currently deprecated. -
IInt8LegacyCalibrator : This calibrator requires user parameterization, by default calibration occurs after layer fusion, and is not recommended.
TensorRT
When building INT8
a model engine, the following steps are performed:
- Build a
32
bit model engine, then run this engine on the calibration data set, and record a histogram of the distribution of activation values for each tensor; - Construct a calibration table from the histogram and calculate a scale factor for each tensor;
- Build an engine based on the calibration table and model definition
INT8
.
The calibration process may be slow, but the calibration table generated in the second step can be output to a file and can be reused. If the calibration table file already exists, the calibrator will read the calibration table directly from the file without executing the previous two steps. step. In addition, unlike engine files, calibration tables can be used across platforms. Therefore, during the actual deployment of the model, we can first GPU
generate the calibration table on a general-purpose computer and then Jetson Nano
use it on an embedded platform. For coding convenience, we can use Python
programming to implement INT8
the quantization process to generate a calibration table.
Implementation
1. Load calibration data
First define a data loading class for loading calibration data. The calibration data here is a JPG
picture in the format. After the picture is read, it needs to be scaled, normalized, exchanged channels and other preprocessing operations according to the input data requirements of the model:
class CalibDataLoader:
def __init__(self, batch_size, width, height, calib_count, calib_images_dir):
self.index = 0
self.batch_size = batch_size
self.width = width
self.height = height
self.calib_count = calib_count
self.image_list = glob.glob(os.path.join(calib_images_dir, "*.jpg"))
assert (
len(self.image_list) > self.batch_size * self.calib_count
), "{} must contains more than {} images for calibration.".format(
calib_images_dir, self.batch_size * self.calib_count
)
self.calibration_data = np.zeros((self.batch_size, 3, height, width), dtype=np.float32)
def reset(self):
self.index = 0
def next_batch(self):
if self.index < self.calib_count:
for i in range(self.batch_size):
image_path = self.image_list[i + self.index * self.batch_size]
assert os.path.exists(image_path), "image {} not found!".format(image_path)
image = cv2.imread(image_path)
image = Preprocess(image, self.width, self.height)
self.calibration_data[i] = image
self.index += 1
return np.ascontiguousarray(self.calibration_data, dtype=np.float32)
else:
return np.array([])
def __len__(self):
return self.calib_count
The preprocessing operation code is as follows:
def Preprocess(input_img, width, height):
img = cv2.cvtColor(input_img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (width, height)).astype(np.float32)
img = img / 255.0
img = np.transpose(img, (2, 0, 1))
return img
2. Implement the calibrator
To implement the function of a calibrator, you need to inherit TensorRT
one of the four provided calibrator classes, and then override several methods of the parent calibrator:
get_batch_size
:batch
size used to getget_batch
: used to getbatch
data of aread_calibration_cache
: Used to read the calibration table from a filewrite_calibration_cache
: Used to write the calibration table from memory to a file
Since it is the model that I need to quantify CNN
, I choose to inherit IInt8EntropyCalibrator2
the calibrator:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
class Calibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, data_loader, cache_file=""):
trt.IInt8EntropyCalibrator2.__init__(self)
self.data_loader = data_loader
self.d_input = cuda.mem_alloc(self.data_loader.calibration_data.nbytes)
self.cache_file = cache_file
data_loader.reset()
def get_batch_size(self):
return self.data_loader.batch_size
def get_batch(self, names):
batch = self.data_loader.next_batch()
if not batch.size:
return None
# 把校准数据从CPU搬运到GPU中
cuda.memcpy_htod(self.d_input, batch)
return [self.d_input]
def read_calibration_cache(self):
# 如果校准表文件存在则直接从其中读取校准表
if os.path.exists(self.cache_file):
with open(self.cache_file, "rb") as f:
return f.read()
def write_calibration_cache(self, cache):
# 如果进行了校准,则把校准表写入文件中以便下次使用
with open(self.cache_file, "wb") as f:
f.write(cache)
f.flush()
3. Generate INT8 engine
FP32
I have introduced the process of generating model engines in an article before, but in that article it was C++
implemented. Calling Python
the interface implementation is actually simpler. The specific code is as follows:
def build_engine():
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
parser = trt.OnnxParser(network, TRT_LOGGER)
assert os.path.exists(onnx_file_path), "The onnx file {} is not found".format(onnx_file_path)
with open(onnx_file_path, "rb") as model:
if not parser.parse(model.read()):
print("Failed to parse the ONNX file.")
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
print("Building an engine from file {}, this may take a while...".format(onnx_file_path))
# build tensorrt engine
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 * (1 << 30))
if mode == "INT8":
config.set_flag(trt.BuilderFlag.INT8)
calibrator = Calibrator(data_loader, calibration_table_path)
config.int8_calibrator = calibrator
else mode == "FP16":
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_engine(network, config)
if engine is None:
print("Failed to create the engine")
return None
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
The above code is first OnnxParser
used to parse the model and then config
set the accuracy of the engine. If you are building INT8
an engine, you need to set the corresponding settings Flag
and pass the previously implemented calibrator object into it, so that the TensorRT
calibration data will be automatically read to generate a calibration table when building the engine.
Test Results
In order to verify INT8
the effect of quantification, I did a comparative test on YOLOv5
several models I used on the graphics card. GeForce GTX 1650 Ti
The results of the inference time-consuming test with different precisions are as follows:
Model | Enter dimensions | Model accuracy | Reasoning time (ms) |
---|---|---|---|
yolov5s.onnx | 640x640 | INT8 | 7 |
yolov5m.onnx | 640x640 | INT8 | 10 |
yolov5l.onnx | 640x640 | INT8 | 15 |
yolov5s.onnx | 640x640 | FP32 | 12 |
yolov5m.onnx | 640x640 | FP32 | 23 |
yolov5l.onnx | 640x640 | FP32 | 45 |
yolov5l
The target detection results of the model FP32
and INT8
accuracy are shown in the following two pictures:
It can be seen that the test results are still relatively close.
References
- https://developer.nvidia.com/zh-cn/blog/tensorrt-int8-cn/
- https://developer.nvidia.com/blog/chieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/
- https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-861/developer-guide/index.html#working-with-int8
- https://github.com/xuanandsix/Tensorrt-int8-quantization-pipline.git