TensorRT INT8 quantization principle and implementation (very detailed)

Table of contents

1. What is model quantization?

2. Why model quantification?

3. What is the goal of model quantification?

4. Necessary conditions for model quantification

5. Classification of model quantification

5.1 Linear Quantization and Nonlinear Quantization

5.2 Layer-by-layer quantization, group-by-group quantization and channel-by-channel quantization

5.3 N-bit quantization

5.4 Weight Quantization and Weight Activation Quantization

5.4.1 The concept of weight and activation

5.4.2 Weight Quantization and Weight Activation Quantization

5.4.3 Activate Quantization Mode

 5.5 Quantization during training and quantization after training

6. The Mathematical Basis of Quantification

6.1 Fixed-point and floating-point numbers

6.2 Linear Quantization (Linear Mapping)

6.2.1 Quantization

6.2.2 Dequantization

Seven, TensorRT INT8 quantization principle

7.1 What is TensorRT?

7.2 The premise of using TensorRT INT8 quantization

7.3 INT8 quantization process

7.4 INT8 Calibration

7.4.1 Why the process needs to be calibrated

7.4.2 Purpose of INT8 Calibration

7.4.3 How to implement INT8 calibration

7.5 Accuracy and speed improvement effect after quantization

7.6 Summary

Eight, C++ realizes TensorRT INT8 quantization

8.1 Program flow

8.2 Calibrator

8.3 BatchStream

9. Quantitative effect test

9.1 Test environment

9.2 Physical properties

9.2.1 Engine Dimensions

9.2.2 Operation Power Consumption

9.2.3 Storage Occupancy

9.3 Detection effect

9.3.1 Inference Speed

9.3.2 Detection Accuracy

9.4 Summary

X. Appendix

sampleINT8.cpp

References


1. What is model quantization?

        Model quantification is composed of two words, model and quantification. To accurately understand model quantification, we need to see what these two words mean.

        In the context of computer vision and deep learning, the model refers specifically to the convolutional neural network, which is used to extract image/video visual features.

        Quantization refers to the process of approximating the continuous value of the signal to a finite number of discrete values, which can be understood as a method of information compression. When thinking about this concept on a computer system, quantization has several similar terms, with low precision probably being the most general term. Regular precision generally uses FP32 (32-bit floating point, single precision) to store model weights; low precision means FP16 (half-precision floating point), INT8 (8-bit fixed-point integer) and other numerical formats. At present, low precision often refers to INT8, so some people call quantization "fixed-point", but strictly speaking, the range represented is narrowed. Fixed-point quantization specifically refers to linear quantization with a scale of a power of 2, which is a more practical quantization method.

        In short, what we often call model quantization is a model compression technique that converts floating-point storage (operation) into integer storage (operation). For example, the original representation of a weight or bias needs to be represented by FP32, but after quantization with INT 8, only one INT8 is required to represent it.

        Note: The following is mainly based on the introduction of INT8 quantization.

2. Why model quantification?

        Existing deep learning frameworks, such as TensorFlow, Pytorch, Caffe, MixNet, etc., often use FP32 data precision to represent weights, biases, activation values, etc. when training deep neural networks. While the performance of deep learning models is improving, calculations are becoming more and more complex, and computing overhead and memory requirements are gradually increasing. AlexNet with only 8 layers requires 61 million network parameters and 729 million floating-point calculations, and consumes about 233MB of memory. The network parameters of the subsequent VGG-16 reached 138 million, the number of floating-point calculations was 15.6 billion, and it required about 553MB of memory. To overcome the vanishing gradient problem of deep networks. He proposed the ResNet network, which for the first time achieved a top-5 classification error of less than 5% in the ILSVRC competition. The shallow ResNet-50 network parameters reached 25 million, the number of floating-point calculations was as high as 4.12 billion, and the memory cost was about 102MB. .

Network Model Size(MB) GFLOPS
AlexNet 214 0.72
VGG-13 532 11.3
VGG-16 552 15.6
VGG-19 576 19.6
ResNet-50 102 4.12
ResNet-101 178 7.84
ResNet-152 240 23.1
GoogleNet 27 1.6
InceptionV3 89 6
MobileNet 38 0.58
SequeezeNet 30 0.84

Table 1 Model size and number of floating point operations of different models

        Huge network parameters mean larger memory storage, and increased number of floating-point calculations means increased training costs and calculation time, which greatly limits the use of resource-constrained devices, such as smartphones, smart bracelets, etc. deployment. As shown in Table 2, the inference time of the deep model on the Samsung Galaxy S6 far exceeds that of the Titan X desktop graphics card, and the real-time performance is poor, which cannot meet the needs of practical applications.

equipment
Samsung Galaxy S6 Titan X
Model AlexNet 117 0.54
GoogleNet 273 1.83
VGG-16 1926 10.67

Table 2 Inference time of different models on different devices (unit: ms)

3. What is the goal of model quantification?

 Figure 1 Numerical memory usage and computing power consumption with different precisions

  1. Smaller model size.
  2. Lower computing power consumption.
  3. Lower memory usage.
  4. Faster calculation speed.
  5. Flat inference accuracy.

4. Necessary conditions for model quantification

        Does quantization necessarily speed up computation? The answer is no, many quantized algorithms fail to bring about substantial speedups.

        Introduce a concept: theoretical calculation peak value. In the field of high-performance computing, this concept is generally defined as: the number of calculations that can be completed in a unit clock cycle multiplied by the chip frequency.

        What kind of quantitative method can bring potential and implementable speed improvement? We conclude that two conditions need to be met:

  1. Calculation of quantized values ​​has higher peak performance on the deployed hardware.
  2. The additional computation (overhead) introduced by the quantization algorithm is less.

        To accurately understand the above conditions, a certain basic knowledge of high-performance computing is required, and the discussion will not be discussed due to space limitations. The following conclusions are directly given here: the quantitative methods with a higher probability of known speed-up mainly fall into the following three categories,

  1. Binarization , which can use simple bit operations to calculate a large number of numbers at the same time. Comparing from nvdia gpu to x86 platform, 1bit computing has a theoretical performance improvement of 5 to 128 times respectively. And it only introduces an additional quantization operation, which can enjoy the acceleration benefits of SIMD (Single Instruction Multiple Data).
  2. Linear quantization can be subdivided into asymmetric and symmetric. On nvdia gpu, x86 and arm platforms, all support 8-bit computing, and the efficiency improvement ranges from 1 to 16 times. Among them, tensor core even supports 4-bit computing, which is also a very promising direction. Since the additional quantization/inverse quantization calculations introduced by linear quantization are standard vector operations, SIMD can also be used for acceleration, and the additional calculations are not time-consuming.
  3. Log quantization , a special quantization method. It can be imagined that the multiplication of two exponents with the same base is equivalent to the addition of their exponents, which reduces the calculation intensity. At the same time addition is transformed into an index calculation. However, I have not seen any acceleration libraries that implement logarithmic quantization on the three major platforms, and the acceleration effect may not be obvious. Log quantization is only used on some dedicated chips.

5. Classification of model quantification

5.1 Linear Quantization and Nonlinear Quantization

        According to whether the mapping function is linear or not, it can be divided into two categories, that is, linear quantization and nonlinear quantization. This paper mainly studies the linear quantization technology.

5.2 Layer-by-layer quantization, group-by-group quantization and channel-by-channel quantization

        According to the granularity of quantization (the range of shared quantization parameters), it can be divided into layer-by-layer quantization, group-by-group quantization, and channel-by-channel quantization.

  • Layer-by-layer quantization , taking a layer as a unit, the weight of the entire layer shares a set of scaling factors S and offset Z;
  • Group-by-group quantification , in groups, each group uses a set of S and Z;
  • The channel-by-channel quantity is calculated in units of channels, and each channel uses a separate set of S and Z;

        When group=1, group-by-group quantization is equivalent to layer-by-layer quantization; when group=num_filters (ie dw convolution), group-by-group quantization is equivalent to channel-by-channel quantization.

5.3 N-bit quantization

        According to the number of bits required to store a weight element, it can be divided into 8-bit quantization, 4-bit quantization, 2-bit quantization, and 1-bit quantization.

5.4 Weight Quantization and Weight Activation Quantization

5.4.1 The concept of weight and activation

        Let's look at a simple deep learning network, as shown in Figure 2

Figure 2 Schematic diagram of deep learning network dimensions 

        The filter is the weight, and the input and output data are the activation values ​​of the previous layer and the current layer respectively. Assuming that the input data is [3,224,224] and the filter is [2,3,3,3], the following formula can be used The calculated output data is [2,222,222]

OH=(H+2P-FH)/S + 1

OW=(W+2P-FW)/S + 1

         Therefore, there are 2*3*3*3=54 weights (excluding bias), 3*224*224=150528 activation values ​​in the upper layer, and 2*222*222= activation values ​​in the next layer 98568, obviously the number of activation values ​​is much larger than the weight.

         If you want to know about deep learning network related knowledge, you can read this article. (17 messages) "Introduction to Deep Learning - Theory and Implementation Based on Python" Reading Notes_Nicholson's Blog-CSDN Blog

5.4.2 Weight Quantization and Weight Activation Quantization

        According to the parameters that need to be quantized, they can be classified into two categories: weight quantization and weight activation quantization.

  • Weight quantization , that is, only quantization operations need to be performed on the weights in the network. Since the weights of the network are generally saved, we can obtain the corresponding quantization parameters S and Z according to the weights in advance without requiring additional calibration data sets. Generally speaking, in the inference process, the number of weight values ​​is much smaller than the activation value, and the method of only quantizing the weights can bring average compression and acceleration effects.
  • Weight activation quantization , which quantifies not only the weights in the network, but also the activation values. Since the range of the activation layer is usually not easy to obtain in advance, it needs to be calculated during the network reasoning process or roughly predicted according to the model.

5.4.3 Activate Quantization Mode

        According to the quantization method of the activation value, it can be divided into online quantization and offline quantization.

  • On-line quantization means that S and Z of the activation value are dynamically calculated according to the actual activation value during the actual inference process;
  • Offline quantization , that is, S and Z whose activation values ​​are determined in advance, requires the support of some calibration data sets in small batches. Since there is no need to dynamically calculate the quantization parameters, the inference speed of offline quantization is usually faster.

        The following three methods are generally used to determine relevant quantization parameters.

  • The exponential smoothing method is to send the calibration data set into the model, collect the output feature map of each quantization layer, calculate the S and Z values ​​of each batch, and update the S and Z values ​​​​through the exponential smoothing method.
  • Histogram truncation method , that is, in the process of calculating the quantization parameters Z and S, some feature maps will have far away singular values, resulting in very large max, so the form of histogram truncation can be used, such as discarding the largest previous 1% data, the value of the previous 1% cut-off point is used as the quantization parameter for max calculation.
  • KL divergence calibration method , that is, by calculating the KL divergence (also known as relative entropy, used to describe the difference between the two distributions) to evaluate the difference between the two distributions before and after quantification, search and select the KL divergence The smallest quantization parameters Z and S are used as the final result. This approach is used in TensorRT.

 5.5 Quantization during training and quantization after training

        Post-training quantization (Post-Training Quantization, PTQ), PTQ does not require retraining, so it is a lightweight quantization method. In most cases, PTQ is sufficient to achieve INT8 quantization close to FP32 performance. However, it also has limitations, especially for quantization of lower bits activated, such as 4bit, 2bit. This is where training-time quantization comes in.

        Quantization during training is also called Quantization-Aware-Training (QAT), which can obtain high-precision low-bit quantization, but the disadvantage is also obvious, that is, the training code needs to be modified, and the quantization error of the gradient in the backpropagation process is very large. Large, it is also easy to lead to failure to converge.

        This article mainly discusses post-training quantization .

6. The Mathematical Basis of Quantification

6.1 Fixed-point and floating-point numbers

        The quantization process can be divided into two parts: converting the model from FP32 to INT8, and using INT8 for inference. This section explains the arithmetic behind these two parts. Without an understanding of the underlying arithmetic principles, it is common to get confused when considering the details of quantization.

        People who do computer science rarely understand how arithmetic operations are performed. Since quantization bridges fixed-point and floating-point , it is necessary to understand the basics of them before touching related research and solutions.

        Both fixed-point and floating-point are numerical representations. The difference between them is where the point that separates the integer part from the fractional part is located. Fixed-point reserves a specific number of digits for integers and decimals, while floating-point reserves a specific number of digits for the significand and exponent.

Fixed-point Floating-point
Format IIIII.FFFFF significand×base^expose
Decimal 12345.78901,00123.90000 1.2345678901×10^4,1.239×10^2
Hex A1C7D.FF014,00000.000FD A.1C7DFF014×16^4,F.D×16^-4
Binary 10111.01011,00110.00000 1.011101011×2^4,1.1×2^2

Table 3 Formats and examples of fixed-point and floating-point

        Among the built-in data types of the instruction set, fixed point is an integer and floating point is a binary format. In general, fixed point at the instruction set level is contiguous because it is an integer and the gap between two adjacent representable numbers is 1. The floating point represents a real number, and its numerical gap is determined by the exponent, so it has a very wide range of values. At the same time, it can also be known that the numerical gap of floating point is uneven. Within the same exponent range, the number of representable values ​​is also the same, and the closer the value is to zero, the more accurate it is. For example, [1,2) has the same number of floating-point values ​​as [0.5,1), [2,4), [4,8), and so on. In addition, we can also know that the value of the fixed-point number is consistent with the true value we want to represent, while the value of the floating-point number deviates from the true value we want to represent.

Value range number of possible values
FP32 [(2^23-2)×2^127,(2-2^-23)×2^127] 2^32
INT32 [-2^16,2^16-1] 2^32

Table 4 FP32 and INT32 value range and the number of possible values

 Figure 3 Schematic diagram of the relationship between floating-point numbers and fixed-point numbers

        For example, assume that the number of values ​​that can be represented in each index range is 2, for example, the value in the range [2^0,2^1) can only be expressed as {1, 1.5} when converted to a floating point number.

index

range of truth

FP32 representable values

maximum error

0

[2^0,2^1)

{1,1.5}

approximately equal to 0.5

3

[2^3,2^4)

{8,12}

approximately equal to 4

Table 5 Examples of different value gaps of floating point numbers

6.2 Linear Quantization (Linear Mapping)

6.2.1 Quantization

        TensorRT uses linear quantization, which can be expressed by the following mathematical expressions:

X_{int}^{}=clip(\left \lfloor\frac{X}{S}\right \rceil+Z;-{2_{}}^{b-1},2{_{}}^{b-1}-1)

        Among them, X represents the original FP32 value; Z represents the Zero Point of the mapping; S represents the scaling factor Scale; \left \lfloor.\right \rceilrepresents a mathematical function of approximate rounding, which can be rounded, rounded up, rounded down, etc X_{int}^{}.; A quantized integer value.

        The clip function is as follows:

clip(x;a,c)=\begin{cases} a,& \text{ if } x<a, \\ x,& \text{ if } a\leq x\leq c, \\ c,& \text{ if } x> c. \end{cases}

        According to whether the parameter Z is zero or not, linear quantization can be divided into two categories—symmetric quantization and asymmetric quantization. TensorRT uses time-symmetric quantization, that is, Z=0.

Figure 4 Symmetric signed quantization, symmetric unsigned quantization and asymmetric quantization 

6.2.2 Dequantization

        According to the quantization formula, it is not difficult to deduce that the inverse quantization formula is as follows:

X=S(X_{int}^{}-Z)=S(clip(\left \lfloor\frac{X}{S}\right \rceil+Z;-{2_{}}^{b-1},2{_{}}^{b-1}-1)-Z)

        When Z=0, X_{min}^{}=-{2_{}}^{b-1}S, X_{max}^{}=({2_{}}^{b-1}-1)S.

        It can be found that when S is large, the quantization domain can be expanded, but at the same time, the FP32 range that can be represented by a single INT8 value also becomes wider, so the error (quantization error) between the INT8 value and the FP32 value will increase; and when S is small , although the quantization error is reduced, the quantization domain is also reduced, and the discarded parameters will increase.

        For example, assuming Z=0, use rounding down.

S FP32 value range represented when INT8=1 maximum error quantization domain
10 [10,20) approximately equal to 10 [-1280,1280)
100 [100,200) approximately equal to 100 [-12800,12800)

Table 6 Effects of different scaling scales

Seven, TensorRT INT8 quantization principle

7.1 What is TensorRT?

        At the heart of NVIDIA® TensorRT™ is a C++ library that facilitates high-performance inference on NVIDIA Graphics Processing Units (GPUs). It is designed to work in a complementary manner with training frameworks like TensorFlow, Caffe, PyTorch, MXNet, etc. It is specifically focused on running already trained networks quickly and efficiently on GPUs to generate results. Some training frameworks (such as TensorFlow) have integrated TensorRT, so it can be used to accelerate inference within the framework. 

Figure 5 TensorRT is a programmable reasoning accelerator

7.2 The premise of using TensorRT INT8 quantization

  1. The hardware must be an Nvidia graphics card with a computing capability greater than or equal to 6.1. Computing capabilities of Nvidia GPUs can be found on this web.        CUDA GPU | NVIDIA Developer
  2. The software has restrictions on platforms, compilers, etc.; some network layers cannot be used, and there are certain restrictions on the input and output of compatible network layers. For details, refer to TensorRT-Support-Matrix-Guide.pdf

 Figure 6 Graphics card models supported by TensorRT quantization

Figure 7 Network layer supported by TensorRT quantization

 Figure 8 Platforms and compilers supported by TensorRT quantization

7.3 INT8 quantization process

        The basic formula for convolution is as follows.

Y=WX+b

         Where X is the output of the previous layer, that is, the original input or the activation value of the previous layer; W is the weight of the current layer; b is the bias of the current layer; Y is the output of the current layer, that is, the activation value of the current layer.

        The official TensorRT document tells us that the bias can be ignored during the quantization process, as shown in Figure 9, so the basic formula of convolution is simplified to the following form.

Y=WX

 Figure 9 TensorRT official website documentation states that bias can be omitted

        After removing the bias, the whole quantization process is actually very simple, see Figure 10-14 below for details.

  1. Convert activation values ​​and weights from FP32 to INT8 by linear mapping;
  2. 执行卷积层运算得到INT32位激活值,如果直接使用INT8保存的话会造成过多的累计损失;
  3. 通过再量化的方式转换回INT8作为下一层的输入;
  4. 当网络为最后一层时,使用反量化转换回FP32。

 图10 FP32卷积层推理流程

  图11 FP32卷积层推理流程-细节展开图

图12 INT8卷积层推理流程-量化

 图13 INT8卷积层推理流程-激活与再量化

图14 INT8卷积层推理流程-反量化 

       整个流程中的关键部分就是FP32至INT8的量化、INT32至INT8的再量化以及INT8至FP32的反量化,其实三者的基本原理都是6.2所讲的线性量化(线性映射)。

       量化和反量化就不再赘述,但是再量化需要展开解释一下。前文提到了TensorRT的量化忽略了bias,简化了量化流程,但是有些层确实需要bias时,该怎么处理呢?这里影响到的主要就是再量化的过程,具体流程见图15-16。

 图15 INT8卷积层推理流程-含bias的再量化 

图16 官方伪代码 

7.4 INT8校准

7.4.1 为什么需要校准这个过程

        我们首先要明确的是,需要INT8校准的前提是使用到了激活量化,具体原因读者可以回顾一下5.4节。

        那么为什么需要这个过程有三个主要原因:

1.网络的激活值不会保存在网络参数中,属于运行过程中产生的值,因此我们难以预先确定它的范围;

2.读者回顾6.2.2的分析可知,当S取大时,可以扩大量化域,但同时,单个INT8数值可表示的FP32范围也变广了,因此INT8数值与FP32数值的误差(量化误差)会增大;而当S取小时,量化误差虽然减小了,但是量化域也缩小了,被舍弃的参数会增多。

3.但是为什么对于不同模型都可行?如果有些模型缩小量化域导致的精度下降更明显,那INT8量化后的精度是不是必然有大幅下降了呢?

        其实不然,读者回顾6.1可知,量化属于浮点数向定点数转换的过程,由于浮点数的可表示数值间隙密度不同,导致零点附近的浮点数可表示数值很多,大约2^31个,约等于可表示数值量的一半。因此,越是靠近零点的浮点数表示越准确,越是远离原点的位置越有可能是噪声,并且网络的权重和激活大多分布在零点附近,因此适当的缩小量化域能提升量化精度几乎是必然的。

7.4.2 INT8校准的目的

        通过以上分析也很清晰了,INT8校准就是一种权衡,为了找到合适的缩放参数,使得量化后的INT8数值能更准确的表示出量化前的FP32数值,并且又不能舍弃太多远离零点的非噪声参数。

 图17 官方INT8校准示意图

7.4.3 如何实现INT8校准

7.4.3.1 校准前激活分布

        举个例子,我们使用同一批图片在不同模型上训练,然后从不同网络层中可以得到对应的激活值分布,见图18。

 图18 同一批数据在不同网络不同层上得到的激活值分布(官方)

        可以发现分布都不相同,那么如何选取最优的阈值呢?

        这就需要一个定量的衡量指标,回顾前文5.4.3可以知道常用的手段是指数平滑法、直方图截断法和KL散度校准法,TensorRT使用的是KL散度校准法。

7.4.3.2 KL散度校准法原理

        KL散度校准法也叫相对熵,一听这个名字是不是感觉和交叉熵、信息熵啥的有关系,没错,我们来看KL的公式:

        KL(p\left | \right |q)=H(p,q)-H(p)=\sum_{k=1}^{N}p_{k}^{}\log _{2}^{}\frac{1}{q_{k}^{}}-\sum_{k=1}^{N}p_{k}^{}\log _{2}^{}\frac{1}{p_{k}^{}}=\sum_{k=1}^{N}p_{k}^{}\log _{2}^{}\frac{p_{k}^{}}{q_{k}^{}}

        其中p表示真实分布,q表示非真实分布、模型分布或p的近似分布。

        可以发现相对熵=交叉熵-信息熵。那么交叉熵和信息熵又是什么呢?

        信息熵表示的是随机变量分布的混乱程度或整个系统的不确定性,随机变量越混乱(无序性)或系统的不确定性就越大,熵越大。当随机分布为均匀分布时,熵最大。

        交叉熵,其用来衡量在给定的真实分布下,使用非真实分布所指定的策略消除系统的不确定性所需要付出的努力的大小交叉熵一定是大于等于信息熵的。

        相对熵,用来衡量真实分布与非真实分布的差异大小。

        具体可以看这边文章。

如何通俗的解释交叉熵与相对熵? - 知乎 (zhihu.com)

        现在问题就很简单了,我们的目的就是改变量化域,实则就是改变真实的分布,并使得修改后得真实分布在量化后与量化前相对熵越小越好。

 7.4.3.3 具体实现流程

  1. 需要准备500张(TensorRT官方推荐)校准用的数据集;
  2. 使用校准数据集在FP32精度的网络下推理,并收集激活值的直方图;
  3. 不断调整阈值,并计算相对熵,得到最优解。

        官方伪代码如下:

 图19 KL校准的官方伪代码

  1. 将校准集下得到的直方图划分成2048Bins(官方推荐)
  2. [128,2048]范围内循环执行以下3-5步骤
  3. 将第ibin后的所有数值累加到第i-1bin上,并对前ibins归一化,作为P分布(真实分布)
  4. P量化得到Q并归一化
  5. 计算PQ的相对熵
  6. 得到最小相对熵的i,阈值T=i+0.5*bin的宽度

 7.4.3.4 校准后数据分布

 图20 校准后分布1(官方)

 图21 校准后分布2(官方)

 图22 校准后分布3(官方)

7.5 量化后准确度和速度提升效果

        加速后的准确度几乎没有下降。

 图23 量化后准确度(官方)

         量化后的加速效果比较,从结果上可以发现,在不同GPU上推理的加速效果不同,并且随着batchsize提高,加速效果也在提升。

 图24 量化后速度提升效果(官方)

7.6 总结

  • 一种自动化,无参数的 FP32 到 INT8 的转换方法;
  • 通过最小化KL散度来选择量化的阈值;
  • 量化后精度几乎持平,速度有很大提升。

八、C++实现TensorRT INT8量化

8.1 程序流程

        TensorRT在做的其实只有一件事,就是把不同框架下训练得到的模型转换成Engine,然后使用Engine进行推理。这里支持的框架包括ONNX、TensorFlow等,见图25。

 图25 TensorRT流程(官方)

        下面我们来讲一下具体的细节,读者也可以参考官方示例,sampleINT8,由于代码比较长,我就贴在附录里了,感兴趣的读者可以参考。

 图26 TensorRT INT8程序流程图

  1. 构建Builder,并使用Builder构建Network用于存储模型信息;
  2. 使用Network构建Parser用于从onnx文件中解析模型信息并回传给Network;
  3. 使用Builder构建Profile用于设置动态维度,并从dynamicBinding中获取动态维度信息;
  4. 构建Calibrator用于校准模型,并通过BatchStream加载校准数据集;
  5. 使用Builder构建Config用于配置生成Engine的参数,包括Calibrator和Profile;
  6. Builder使用Network中的模型信息和Config中的参数生成Engine以及校准参数calParameter;
  7. 通过BatchStream加载待测试数据集并传入Engine,输出最终结果result。

        其中值得注意的是Calibrator和BatchStream,这两个类都是需要根据项目需要重写的。

8.2 Calibrator

        为了将校准数据集输入TensorRT,我们需要用到IInt8Calibrator抽象类,TensorRT一共提供了四种IInt8Calibrator:

  1. IEntropyCalibratorV2:最适合卷积网络CNN的校准器,并且本文也是使用这个类实现的;
  2. IMinMaxCalibrator:这适合自然语言处理NLP中;
  3. IEntropyCalibrator:已经弃用;
  4. ILegacyCalibrator:已经弃用,且需要用户手动设置参数。

        IInt8Calibrator实现的功能也很简单:

  1. getBatchSize:获取校准过程中的batchSize;
  2. getBatch:获取校准过程中的输入;
  3. writeCalibrationCache:由于校准花费的时间比较长,调用该函数将校准参数结果写入本地文件,方便下次直接读取。
  4. readCalibrationCache:读取保存在本地的校准参数文件,在生成引擎过程中会自动调用。

         废话不多说了,直接上代码,个人的开发代码不方便上传就给大家看看官方代码了,我在这里修改得主要是一些函数的参数罢了,大体的功能是一样的。

/*
 * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#ifndef ENTROPY_CALIBRATOR_H
#define ENTROPY_CALIBRATOR_H

#include "BatchStream.h"
#include "NvInfer.h"

//! \class EntropyCalibratorImpl
//!
//! \brief Implements common functionality for Entropy calibrators.
//!
template <typename TBatchStream>
class EntropyCalibratorImpl
{
public:
    EntropyCalibratorImpl(
        TBatchStream stream, int firstBatch, std::string networkName, const char* inputBlobName, bool readCache = true)
        : mStream{stream}
        , mCalibrationTableName("CalibrationTable" + networkName)
        , mInputBlobName(inputBlobName)
        , mReadCache(readCache)
    {
        nvinfer1::Dims dims = mStream.getDims();
        mInputCount = samplesCommon::volume(dims);
        CHECK(cudaMalloc(&mDeviceInput, mInputCount * sizeof(float)));
        mStream.reset(firstBatch);
    }

    virtual ~EntropyCalibratorImpl()
    {
        CHECK(cudaFree(mDeviceInput));
    }

    int getBatchSize() const noexcept
    {
        return mStream.getBatchSize();
    }

    bool getBatch(void* bindings[], const char* names[], int nbBindings) noexcept
    {
        if (!mStream.next())
        {
            return false;
        }
        CHECK(cudaMemcpy(mDeviceInput, mStream.getBatch(), mInputCount * sizeof(float), cudaMemcpyHostToDevice));
        ASSERT(!strcmp(names[0], mInputBlobName));
        bindings[0] = mDeviceInput;
        return true;
    }

    const void* readCalibrationCache(size_t& length) noexcept
    {
        mCalibrationCache.clear();
        std::ifstream input(mCalibrationTableName, std::ios::binary);
        input >> std::noskipws;
        if (mReadCache && input.good())
        {
            std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(),
                std::back_inserter(mCalibrationCache));
        }
        length = mCalibrationCache.size();
        return length ? mCalibrationCache.data() : nullptr;
    }

    void writeCalibrationCache(const void* cache, size_t length) noexcept
    {
        std::ofstream output(mCalibrationTableName, std::ios::binary);
        output.write(reinterpret_cast<const char*>(cache), length);
    }

private:
    TBatchStream mStream;
    size_t mInputCount;
    std::string mCalibrationTableName;
    const char* mInputBlobName;
    bool mReadCache{true};
    void* mDeviceInput{nullptr};
    std::vector<char> mCalibrationCache;
};

//! \class Int8EntropyCalibrator2
//!
//! \brief Implements Entropy calibrator 2.
//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.
//!
template <typename TBatchStream>
class Int8EntropyCalibrator2 : public IInt8EntropyCalibrator2
{
public:
    Int8EntropyCalibrator2(
        TBatchStream stream, int firstBatch, const char* networkName, const char* inputBlobName, bool readCache = true)
        : mImpl(stream, firstBatch, networkName, inputBlobName, readCache)
    {
    }

    int getBatchSize() const noexcept override
    {
        return mImpl.getBatchSize();
    }

    bool getBatch(void* bindings[], const char* names[], int nbBindings) noexcept override
    {
        return mImpl.getBatch(bindings, names, nbBindings);
    }

    const void* readCalibrationCache(size_t& length) noexcept override
    {
        return mImpl.readCalibrationCache(length);
    }

    void writeCalibrationCache(const void* cache, size_t length) noexcept override
    {
        mImpl.writeCalibrationCache(cache, length);
    }

private:
    EntropyCalibratorImpl<TBatchStream> mImpl;
};

#endif // ENTROPY_CALIBRATOR_H

         从代码中可以看到,校准器类中没有直接实现getBatchSize和getBatch,而是使用TBatchStream模板类实现,这就是BatchSteam的作用了。

8.3 BatchStream

        BatchStream类继承于IBatchStream,它实现的功能就是从给定的数据集中读取数据和标签,实现预处理并能按要求的BatchSize遍历数据和标签,具体如下:

  1. reset:设置起始的Batch索引;
  2. next:索引加一,准备读取下一个batch,直到数据集遍历完成;
  3. skip:跳转到指定索引的Batch;
  4. getBatch:获取当前索引的数据;
  5. getLabels:获取当前索引的标签;
  6. getBatchesRead:获取当前索引;
  7. getBatchSize:获取BatchSize;
  8. getDims:获取当前数据的维度;
  9. readDataFile:读取数据集中的数据;
  10. readLabelsFile:读取数据集中的标签。

        前八个是IBatchSize中已经定义的函数,后两个是自己定义的私有函数,不是必须的。        

        同样给大家看看官方代码,这里包含了一个针对MNIST任务的重写类,读者也得按需重写。

/*
 * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
#ifndef BATCH_STREAM_H
#define BATCH_STREAM_H

#include "NvInfer.h"
#include "common.h"
#include <algorithm>
#include <stdio.h>
#include <vector>

class IBatchStream
{
public:
    virtual void reset(int firstBatch) = 0;
    virtual bool next() = 0;
    virtual void skip(int skipCount) = 0;
    virtual float* getBatch() = 0;
    virtual float* getLabels() = 0;
    virtual int getBatchesRead() const = 0;
    virtual int getBatchSize() const = 0;
    virtual nvinfer1::Dims getDims() const = 0;
};

class MNISTBatchStream : public IBatchStream
{
public:
    MNISTBatchStream(int batchSize, int maxBatches, const std::string& dataFile, const std::string& labelsFile,
        const std::vector<std::string>& directories)
        : mBatchSize{batchSize}
        , mMaxBatches{maxBatches}
        , mDims{3, {1, 28, 28}} //!< We already know the dimensions of MNIST images.
    {
        readDataFile(locateFile(dataFile, directories));
        readLabelsFile(locateFile(labelsFile, directories));
    }

    void reset(int firstBatch) override
    {
        mBatchCount = firstBatch;
    }

    bool next() override
    {
        if (mBatchCount >= mMaxBatches)
        {
            return false;
        }
        ++mBatchCount;
        return true;
    }

    void skip(int skipCount) override
    {
        mBatchCount += skipCount;
    }

    float* getBatch() override
    {
        return mData.data() + (mBatchCount * mBatchSize * samplesCommon::volume(mDims));
    }

    float* getLabels() override
    {
        return mLabels.data() + (mBatchCount * mBatchSize);
    }

    int getBatchesRead() const override
    {
        return mBatchCount;
    }

    int getBatchSize() const override
    {
        return mBatchSize;
    }

    nvinfer1::Dims getDims() const override
    {
        return Dims{4, {mBatchSize, mDims.d[0], mDims.d[1], mDims.d[2]}};
    }

private:
    void readDataFile(const std::string& dataFilePath)
    {
        std::ifstream file{dataFilePath.c_str(), std::ios::binary};

        int magicNumber, numImages, imageH, imageW;
        file.read(reinterpret_cast<char*>(&magicNumber), sizeof(magicNumber));
        // All values in the MNIST files are big endian.
        magicNumber = samplesCommon::swapEndianness(magicNumber);
        ASSERT(magicNumber == 2051 && "Magic Number does not match the expected value for an MNIST image set");

        // Read number of images and dimensions
        file.read(reinterpret_cast<char*>(&numImages), sizeof(numImages));
        file.read(reinterpret_cast<char*>(&imageH), sizeof(imageH));
        file.read(reinterpret_cast<char*>(&imageW), sizeof(imageW));

        numImages = samplesCommon::swapEndianness(numImages);
        imageH = samplesCommon::swapEndianness(imageH);
        imageW = samplesCommon::swapEndianness(imageW);

        // The MNIST data is made up of unsigned bytes, so we need to cast to float and normalize.
        int numElements = numImages * imageH * imageW;
        std::vector<uint8_t> rawData(numElements);
        file.read(reinterpret_cast<char*>(rawData.data()), numElements * sizeof(uint8_t));
        mData.resize(numElements);
        std::transform(
            rawData.begin(), rawData.end(), mData.begin(), [](uint8_t val) { return static_cast<float>(val) / 255.f; });
    }

    void readLabelsFile(const std::string& labelsFilePath)
    {
        std::ifstream file{labelsFilePath.c_str(), std::ios::binary};
        int magicNumber, numImages;
        file.read(reinterpret_cast<char*>(&magicNumber), sizeof(magicNumber));
        // All values in the MNIST files are big endian.
        magicNumber = samplesCommon::swapEndianness(magicNumber);
        ASSERT(magicNumber == 2049 && "Magic Number does not match the expected value for an MNIST labels file");

        file.read(reinterpret_cast<char*>(&numImages), sizeof(numImages));
        numImages = samplesCommon::swapEndianness(numImages);

        std::vector<uint8_t> rawLabels(numImages);
        file.read(reinterpret_cast<char*>(rawLabels.data()), numImages * sizeof(uint8_t));
        mLabels.resize(numImages);
        std::transform(
            rawLabels.begin(), rawLabels.end(), mLabels.begin(), [](uint8_t val) { return static_cast<float>(val); });
    }

    int mBatchSize{0};
    int mBatchCount{0}; //!< The batch that will be read on the next invocation of next()
    int mMaxBatches{0};
    Dims mDims{};
    std::vector<float> mData{};
    std::vector<float> mLabels{};
};

        到这里就把整个程序实现的流程和关键部分说明了。

九、量化效果测试

        最后为了实际测试量化的效果,我选取不同复杂度的模型进行了测试,分别是Alexnet、Resnet50、VGG13,具体的参数量和FLOPS(每秒浮点运算次数,Floating-Point Operations Per Second)可以参考第二章的表1和下图27。

 a 参数量比较

 b 计算量比较

图27 三种模型的比较

        参考第三章,测试的对比组就很清晰了,分别是引擎尺寸(引擎尺寸与模型尺寸还有一些差异,比较引擎尺寸更加有意义)、运算功耗、显存占用、推理速度以及检测精度。其中,前三种都影响到了硬件的需求,我们把他们分为物理性能;后两类体现了引擎的检测效果,可以分为另一类检测性能。

         另外,运算功耗和显存占用需要使用Nvidia的命令行指令,在anaconda prompt中输入如下指令即可显示显卡的详细信息,见图28。

nvidia-smi -l 2 #l参数表示刷新间隔(s)

 图28 显卡信息

表格参数详解:

  • GPU:本机中的GPU编号(有多块显卡的时候,从0开始编号)图上GPU的编号是:0

  • Fan:风扇转速(0%-100%),N/A表示没有风扇

  • Name:GPU类型,图上GPU的类型是:Tesla T4

  • Temp:GPU的温度(GPU温度过高会导致GPU的频率下降)

  • Perf:GPU的性能状态,从P0(最大性能)到P12(最小性能),图上是:P0

  • Persistence-M:持续模式的状态,持续模式虽然耗能大,但是在新的GPU应用启动时花费的时间更少,图上显示的是:off

  • Pwr:Usager/Cap:能耗表示,Usage:用了多少,Cap总共多少

  • Bus-Id:GPU总线相关显示,domain:bus:device.function

  • Disp.A:Display Active ,表示GPU的显示是否初始化

  • Memory-Usage:显存使用率

  • Volatile GPU-Util:GPU使用率

  • Uncorr. ECC:关于ECC的东西,是否开启错误检查和纠正技术,0/disabled,1/enabled

  • Compute M:计算模式,0/DEFAULT,1/EXCLUSIVE_PROCESS,2/PROHIBITED

  • Processes:显示每个进程占用的显存使用率、进程号、占用的哪个GPU

        也可以使用以下指令直接将结果写入文件,具体详见GPU之nvidia-smi命令详解 - 简书 (jianshu.com)

nvidia-smi -l 2 --format=csv --filename=gpucost.csv --query-gpu=timestamp,memory.total,memeory.used

9.1 测试环境

  • GPU NVIDIA GeForce RTX 2060
  • CUDA 10.2.89
  • CUDNN 7.6.5
  • TensorRT 7.2.1

9.2 物理性能

       注:除BatchSize和DataSize作为变量时,其他测试下BatchSize=5,DataSize=500张。

9.2.1引擎尺寸

        结合图27a可以发现,引擎尺寸与模型的参数量正相关,且随着量化位数降低,有明显的下降,从FP32-INT8能下降约80%,从FP16-INT8也有约50%。

 图29 引擎大小

9.2.2 运算功耗

        结合图27b可以发现,整体趋势是,模型计算量FLOPS越大,运算功耗下降比例越高。

  图30 运行功耗

9.2.3 显存占用

        结合图27b可以发现,对于大模型计算量较大的模型来说,显存占用下降比较明显,但似乎上限是50%。

 图31 显存占用

9.3 检测效果

        检测效果是我们最关心的点,毕竟量化的主要目的就是在保持准确度的前提下提升速度。

9.3.1 推理速度

        从图上我们可以发现,INT8量化后的推理速度相对于FP32和FP16都有显著提升。相对于FP32来说模型FLOPS越高,加速效果越好,VGG-13的提速效果达到了7倍,但是从FP16到INT8的效果就与模型复杂度关系不大了,约2倍不到。

  图32 推理速度

 9.3.2 检测准确度

        可以看到INT8量化后数据集的准确度相比FP32下几乎没有变化,这直接验证了INT8量化的可行性。

        在这基础上,我们进一步测试一下校准数据集大小以及推理时的BatchSize对检测准确度的影响。首先比较了校准数据集大小的影响。可以看到随着数据集变大,准确度略有下降。但是考虑到官方推荐是500张校准数据集,之后有机会可以进一步验证是否与其他因素有关。

BatchSize=5 FP32 FP16 INT8(数据集大小)
100 300 500 700 900
准确度(%)
Network AlexNet 87.84% 87.73% 88.64% 88.52% 88.52% 88.52% 88.41%
ResNet50 97.61% 97.61% 97.73% 97.61% 97.61% 97.50% 97.39%
VGG-13 97.39% 97.39% 97.39% 97.27% 97.16% 97.16% 97.05%
FP32转INT8 准确度变化(%)
AlexNet 87.84% 87.73% 0.80% 0.68% 0.68% 0.68% 0.57%
ResNet50 97.61% 97.61% 0.11% 0.00% 0.00% -0.11% -0.23%
VGG-13 97.39% 97.39% 0.00% -0.11% -0.23% -0.23% -0.34%

表7 检测准确度测试

 图33 检测准确度测试

        然后我们比较了一下BatchSize的影响,从图上可以更清晰的看到准确度有先上升后下降的趋势,在BatchSize=8时达到了最高值,按照官方的数据,应该不会有下降的过程,我认为是显卡性能的关系,有机会可以用其他显卡再测试一下。

DataSize=500 BatchSize
1 2 4 8 16 32 64 128
单张图片推理时间(us)
Network AlexNet 132.199 122.701 122.666 119.574 122.919 121.464 142.217 177.197
ResNet50 476.866 446.152 436.84 405.513 448.45 473.267 454.716 433.675
VGG-13 617.144 618.869 614.11 609.547 625.358 633.083 634.107 691.968
单张图片推理时间变化率(%)
AlexNet 0.00% 7.18% 7.21% 9.55% 7.02% 8.12% -7.58% -34.04%
ResNet50 0.00% 6.44% 8.39% 14.96% 5.96% 0.75% 4.64% 9.06%
VGG-13 0.00% -0.28% 0.49% 1.23% -1.33% -2.58% -2.75% -12.12%

9.4 总结

  1. FP32-FP16及FP16-INT8转换均能减少约50%引擎尺寸,并能有效降低运算功耗和显存占用;
  2. 从FP32-INT8可大幅提升推理速度,且与模型FLOPS成正比,但从FP16-INT8只能提高2倍左右;
  3. INT8量化后准确度相比FP32几乎没有下降,但随校准数据集增大有略微下降(后半句存疑);
  4. INT8量化后推理速度随BatchSize增大而增大,但会受显卡运算能力限制(后半句存疑);

十、附录

sampleINT8.cpp

/*
 * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

//!
//! SampleINT8.cpp
//! This file contains the implementation of the sample. It creates the network using
//! the caffe model.
//! It can be run with the following command line:
//! Command: ./sample_int8 [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]
//!

#include "BatchStream.h"
#include "EntropyCalibrator.h"
#include "argsParser.h"
#include "buffers.h"
#include "common.h"
#include "logger.h"

#include "NvCaffeParser.h"
#include "NvInfer.h"
#include <cuda_runtime_api.h>

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <sstream>

using samplesCommon::SampleUniquePtr;

const std::string gSampleName = "TensorRT.sample_int8";

//!
//! \brief The SampleINT8Params structure groups the additional parameters required by
//!         the INT8 sample.
//!
struct SampleINT8Params : public samplesCommon::CaffeSampleParams
{
    int nbCalBatches;        //!< The number of batches for calibration
    int calBatchSize;        //!< The calibration batch size
    std::string networkName; //!< The name of the network
};

//! \brief  The SampleINT8 class implements the INT8 sample
//!
//! \details It creates the network using a caffe model
//!
class SampleINT8
{
public:
    SampleINT8(const SampleINT8Params& params)
        : mParams(params)
        , mEngine(nullptr)
    {
        initLibNvInferPlugins(&sample::gLogger.getTRTLogger(), "");
    }

    //!
    //! \brief Function builds the network engine
    //!
    bool build(DataType dataType);

    //!
    //! \brief Runs the TensorRT inference engine for this sample
    //!
    bool infer(std::vector<float>& score, int firstScoreBatch, int nbScoreBatches);

    //!
    //! \brief Cleans up any state created in the sample class
    //!
    bool teardown();

private:
    SampleINT8Params mParams; //!< The parameters for the sample.

    nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.

    std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network

    //!
    //! \brief Parses a Caffe model and creates a TensorRT network
    //!
    bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
        SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
        SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, DataType dataType);

    //!
    //! \brief Reads the input and stores it in a managed buffer
    //!
    bool processInput(const samplesCommon::BufferManager& buffers, const float* data);

    //!
    //! \brief Scores model
    //!
    int calculateScore(
        const samplesCommon::BufferManager& buffers, float* labels, int batchSize, int outputSize, int threshold);
};

//!
//! \brief Creates the network, configures the builder and creates the network engine
//!
//! \details This function creates the network by parsing the caffe model and builds
//!          the engine that will be used to run the model (mEngine)
//!
//! \return Returns true if the engine was created successfully and false otherwise
//!
bool SampleINT8::build(DataType dataType)
{

    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if (!builder)
    {
        return false;
    }

    if ((dataType == DataType::kINT8 && !builder->platformHasFastInt8())
        || (dataType == DataType::kHALF && !builder->platformHasFastFp16()))
    {
        return false;
    }

    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
    if (!network)
    {
        return false;
    }

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
    if (!config)
    {
        return false;
    }

    auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
    if (!parser)
    {
        return false;
    }

    auto constructed = constructNetwork(builder, network, config, parser, dataType);
    if (!constructed)
    {
        return false;
    }

    ASSERT(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    ASSERT(mInputDims.nbDims == 3);

    return true;
}

//!
//! \brief Uses a caffe parser to create the network and marks the
//!        output layers
//!
//! \param network Pointer to the network that will be populated with the network
//!
//! \param builder Pointer to the engine builder
//!
bool SampleINT8::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
    SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
    SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, DataType dataType)
{
    mEngine = nullptr;
    const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor
        = parser->parse(locateFile(mParams.prototxtFileName, mParams.dataDirs).c_str(),
            locateFile(mParams.weightsFileName, mParams.dataDirs).c_str(), *network,
            dataType == DataType::kINT8 ? DataType::kFLOAT : dataType);

    for (auto& s : mParams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }

    // Calibrator life time needs to last until after the engine is built.
    std::unique_ptr<IInt8Calibrator> calibrator;

    config->setAvgTimingIterations(1);
    config->setMinTimingIterations(1);
    config->setMaxWorkspaceSize(1_GiB);
    if (dataType == DataType::kHALF)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if (dataType == DataType::kINT8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    builder->setMaxBatchSize(mParams.batchSize);

    if (dataType == DataType::kINT8)
    {
        MNISTBatchStream calibrationStream(mParams.calBatchSize, mParams.nbCalBatches, "train-images-idx3-ubyte",
            "train-labels-idx1-ubyte", mParams.dataDirs);
        calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>(
            calibrationStream, 0, mParams.networkName.c_str(), mParams.inputTensorNames[0].c_str()));
        config->setInt8Calibrator(calibrator.get());
    }

    if (mParams.dlaCore >= 0)
    {
        samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
        if (mParams.batchSize > builder->getMaxDLABatchSize())
        {
            sample::gLogError << "Requested batch size " << mParams.batchSize
                              << " is greater than the max DLA batch size of " << builder->getMaxDLABatchSize()
                              << ". Reducing batch size accordingly." << std::endl;
            return false;
        }
    }

    // CUDA stream used for profiling by the builder.
    auto profileStream = samplesCommon::makeCudaStream();
    if (!profileStream)
    {
        return false;
    }
    config->setProfileStream(*profileStream);

    SampleUniquePtr<IHostMemory> plan{builder->buildSerializedNetwork(*network, *config)};
    if (!plan)
    {
        return false;
    }

    SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};
    if (!runtime)
    {
        return false;
    }

    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
        runtime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
    if (!mEngine)
    {
        return false;
    }

    return true;
}

//!
//! \brief Runs the TensorRT inference engine for this sample
//!
//! \details This function is the main execution function of the sample. It allocates the buffer,
//!          sets inputs and executes the engine.
//!
bool SampleINT8::infer(std::vector<float>& score, int firstScoreBatch, int nbScoreBatches)
{
    float ms{0.0f};

    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    if (!context)
    {
        return false;
    }

    MNISTBatchStream batchStream(mParams.batchSize, nbScoreBatches + firstScoreBatch, "train-images-idx3-ubyte",
        "train-labels-idx1-ubyte", mParams.dataDirs);
    batchStream.skip(firstScoreBatch);

    Dims outputDims = context->getEngine().getBindingDimensions(
        context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str()));
    int64_t outputSize = samplesCommon::volume(outputDims);
    int top1{0}, top5{0};
    float totalTime{0.0f};

    while (batchStream.next())
    {
        // Read the input data into the managed buffers
        ASSERT(mParams.inputTensorNames.size() == 1);
        if (!processInput(buffers, batchStream.getBatch()))
        {
            return false;
        }

        // Memcpy from host input buffers to device input buffers
        buffers.copyInputToDevice();

        cudaStream_t stream;
        CHECK(cudaStreamCreate(&stream));

        // Use CUDA events to measure inference time
        cudaEvent_t start, end;
        CHECK(cudaEventCreateWithFlags(&start, cudaEventBlockingSync));
        CHECK(cudaEventCreateWithFlags(&end, cudaEventBlockingSync));
        cudaEventRecord(start, stream);

        bool status = context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr);
        if (!status)
        {
            return false;
        }

        cudaEventRecord(end, stream);
        cudaEventSynchronize(end);
        cudaEventElapsedTime(&ms, start, end);
        cudaEventDestroy(start);
        cudaEventDestroy(end);

        totalTime += ms;

        // Memcpy from device output buffers to host output buffers
        buffers.copyOutputToHost();

        CHECK(cudaStreamDestroy(stream));

        top1 += calculateScore(buffers, batchStream.getLabels(), mParams.batchSize, outputSize, 1);
        top5 += calculateScore(buffers, batchStream.getLabels(), mParams.batchSize, outputSize, 5);

        if (batchStream.getBatchesRead() % 100 == 0)
        {
            sample::gLogInfo << "Processing next set of max 100 batches" << std::endl;
        }
    }

    int imagesRead = (batchStream.getBatchesRead() - firstScoreBatch) * mParams.batchSize;
    score[0] = float(top1) / float(imagesRead);
    score[1] = float(top5) / float(imagesRead);

    sample::gLogInfo << "Top1: " << score[0] << ", Top5: " << score[1] << std::endl;
    sample::gLogInfo << "Processing " << imagesRead << " images averaged " << totalTime / imagesRead << " ms/image and "
                     << totalTime / batchStream.getBatchesRead() << " ms/batch." << std::endl;

    return true;
}

//!
//! \brief Cleans up any state created in the sample class
//!
bool SampleINT8::teardown()
{
    //! Clean up the libprotobuf files as the parsing is complete
    //! \note It is not safe to use any other part of the protocol buffers library after
    //! ShutdownProtobufLibrary() has been called.
    nvcaffeparser1::shutdownProtobufLibrary();
    return true;
}

//!
//! \brief Reads the input and stores it in a managed buffer
//!
bool SampleINT8::processInput(const samplesCommon::BufferManager& buffers, const float* data)
{
    // Fill data buffer
    float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));
    std::memcpy(hostDataBuffer, data, mParams.batchSize * samplesCommon::volume(mInputDims) * sizeof(float));
    return true;
}

//!
//! \brief Scores model
//!
int SampleINT8::calculateScore(
    const samplesCommon::BufferManager& buffers, float* labels, int batchSize, int outputSize, int threshold)
{
    float* probs = static_cast<float*>(buffers.getHostBuffer(mParams.outputTensorNames[0]));

    int success = 0;
    for (int i = 0; i < batchSize; i++)
    {
        float *prob = probs + outputSize * i, correct = prob[(int) labels[i]];

        int better = 0;
        for (int j = 0; j < outputSize; j++)
        {
            if (prob[j] >= correct)
            {
                better++;
            }
        }
        if (better <= threshold)
        {
            success++;
        }
    }
    return success;
}

//!
//! \brief Initializes members of the params struct using the command line args
//!
SampleINT8Params initializeSampleParams(const samplesCommon::Args& args, int batchSize)
{
    SampleINT8Params params;
    // Use directories provided by the user, in addition to default directories.
    params.dataDirs = args.dataDirs;
    params.dataDirs.emplace_back("data/mnist/");
    params.dataDirs.emplace_back("int8/mnist/");
    params.dataDirs.emplace_back("samples/mnist/");
    params.dataDirs.emplace_back("data/samples/mnist/");
    params.dataDirs.emplace_back("data/int8/mnist/");
    params.dataDirs.emplace_back("data/int8_samples/mnist/");

    params.batchSize = batchSize;
    params.dlaCore = args.useDLACore;
    params.nbCalBatches = 10;
    params.calBatchSize = 50;
    params.inputTensorNames.push_back("data");
    params.outputTensorNames.push_back("prob");
    params.prototxtFileName = "deploy.prototxt";
    params.weightsFileName = "mnist_lenet.caffemodel";
    params.networkName = "mnist";
    return params;
}

//!
//! \brief Prints the help information for running this sample
//!
void printHelpInfo()
{
    std::cout << "Usage: ./sample_int8 [-h or --help] [-d or --datadir=<path to data directory>] "
                 "[--useDLACore=<int>]"
              << std::endl;
    std::cout << "--help, -h      Display help information" << std::endl;
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories."
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "batch=N         Set batch size (default = 32)." << std::endl;
    std::cout << "start=N         Set the first batch to be scored (default = 16). All batches before this batch will "
                 "be used for calibration."
              << std::endl;
    std::cout << "score=N         Set the number of batches to be scored (default = 1800)." << std::endl;
}

int main(int argc, char** argv)
{
    if (argc >= 2 && (!strncmp(argv[1], "--help", 6) || !strncmp(argv[1], "-h", 2)))
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }

    // By default we score over 57600 images starting at 512, so we don't score those used to search calibration
    int batchSize = 32;
    int firstScoreBatch = 16;
    int nbScoreBatches = 1800;

    // Parse extra arguments
    for (int i = 1; i < argc; ++i)
    {
        if (!strncmp(argv[i], "batch=", 6))
        {
            batchSize = atoi(argv[i] + 6);
        }
        else if (!strncmp(argv[i], "start=", 6))
        {
            firstScoreBatch = atoi(argv[i] + 6);
        }
        else if (!strncmp(argv[i], "score=", 6))
        {
            nbScoreBatches = atoi(argv[i] + 6);
        }
    }

    if (batchSize > 128)
    {
        sample::gLogError << "Please provide batch size <= 128" << std::endl;
        return EXIT_FAILURE;
    }

    if ((firstScoreBatch + nbScoreBatches) * batchSize > 60000)
    {
        sample::gLogError << "Only 60000 images available" << std::endl;
        return EXIT_FAILURE;
    }

    samplesCommon::Args args;
    samplesCommon::parseArgs(args, argc, argv);

    SampleINT8 sample(initializeSampleParams(args, batchSize));

    auto sampleTest = sample::gLogger.defineTest(gSampleName, argc, argv);

    sample::gLogger.reportTestStart(sampleTest);

    sample::gLogInfo << "Building and running a GPU inference engine for INT8 sample" << std::endl;

    std::vector<std::string> dataTypeNames = {"FP32", "FP16", "INT8"};
    std::vector<std::string> topNames = {"Top1", "Top5"};
    std::vector<DataType> dataTypes = {DataType::kFLOAT, DataType::kHALF, DataType::kINT8};
    std::vector<std::vector<float>> scores(3, std::vector<float>(2, 0.0f));
    for (size_t i = 0; i < dataTypes.size(); i++)
    {
        sample::gLogInfo << dataTypeNames[i] << " run:" << nbScoreBatches << " batches of size " << batchSize
                         << " starting at " << firstScoreBatch << std::endl;

        if (!sample.build(dataTypes[i]))
        {
            if (!samplesCommon::isDataTypeSupported(dataTypes[i]))
            {
                sample::gLogWarning << "Skipping " << dataTypeNames[i]
                                    << " since the platform does not support this data type." << std::endl;
                continue;
            }
            return sample::gLogger.reportFail(sampleTest);
        }
        if (!sample.infer(scores[i], firstScoreBatch, nbScoreBatches))
        {
            return sample::gLogger.reportFail(sampleTest);
        }
    }

    auto isApproximatelyEqual = [](float a, float b, double tolerance) { return (std::abs(a - b) <= tolerance); };
    const double tolerance{0.01};
    const double goldenMNIST{0.99};

    if ((scores[0][0] < goldenMNIST) || (scores[0][1] < goldenMNIST))
    {
        sample::gLogError << "FP32 accuracy is less than 99%: Top1 = " << scores[0][0] << ", Top5 = " << scores[0][1]
                          << "." << std::endl;
        return sample::gLogger.reportFail(sampleTest);
    }

    for (unsigned i = 0; i < topNames.size(); i++)
    {
        for (unsigned j = 1; j < dataTypes.size(); j++)
        {
            if (scores[j][i] != 0.0f && !isApproximatelyEqual(scores[0][i], scores[j][i], tolerance))
            {
                sample::gLogError << "FP32(" << scores[0][i] << ") and " << dataTypeNames[j] << "(" << scores[j][i]
                                  << ") " << topNames[i] << " accuracy differ by more than " << tolerance << "."
                                  << std::endl;
                return sample::gLogger.reportFail(sampleTest);
            }
        }
    }

    if (!sample.teardown())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);
}

参考资料

(17条消息) 模型量化详解_WZZ18191171661的博客-CSDN博客_模型量化原理

模型量化了解一下? - 知乎 (zhihu.com)

神经网络量化简介 (qq.com)

Nvidia TensorRT文档——开发者指南 - 简书 (jianshu.com)

https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

如何通俗的解释交叉熵与相对熵? - 知乎 (zhihu.com)

GPU之nvidia-smi命令详解 - 简书 (jianshu.com)

[1]高晗, 田育龙, 许封元,等. 深度学习模型压缩与加速综述[J]. 软件学报, 2021, 32(1):25.

[2] Nagel M ,  Fournarakis M ,  Amjad R A , et al. A White Paper on Neural Network Quantization.  2021.

Guess you like

Origin blog.csdn.net/Nichlson/article/details/121085747