Getting started with CUDA and TensorRT

CUDA

Official tutorial: CUDA C++ Programming Guide (nvidia.com)

1. Basic knowledge

First look at the relationship between graphics card, GPU, and CUDA:

Introduction to Graphics Card, GPU and CUDA - Wu Yiqi's Blog - CSDN Blog

Latency: the time interval for an instruction to return;

Throughput: the number of instructions processed per unit time;

CPUs

ppIMVc8.png

The design is latency oriented;

It mainly has the following characteristics:

1. Large memory: multi-level cache structure improves memory access speed;

2. Complex control: branch prediction mechanism (if-else judgment), pipeline data forwarding;

3. Powerful computing unit: complex computing speed of integer floating point type is fast;

GPUs

ppIMu7j.png

The design is designed according to the throughput orientation;

It mainly has the following characteristics:

1. Small cache: improve memory throughput;

2. Simple control: no branch prediction, no data forwarding;

3. Simplified computing unit: a large number of threads are required to tolerate delays, and pipelines achieve high throughput;

Note: In fact, the video memory is the same as the memory, and it is also a storage space for temporarily storing data, but the video memory is for GPU storage, and the memory is for CPU storage;

Summarize:

Compared with the GPU, the delay of a single complex instruction is more than 10 times faster;

Compared with the CPU, the GPU can execute more than 10 times the number of instructions per unit time;

think:

What kind of problems are suitable for GPU?

Calculation-intensive: the ratio of numerical calculations is much larger than that of memory operations, so the delay of memory access can be covered by calculations;

Data parallelism: large tasks can be disassembled into small tasks that execute the same instructions, so the demand for complex processes is low;

CUDA

CUDA: Launched by Nvidia in 2007, the original intention was to add an easy-to-use programming interface to the GPU, so that developers do not need to learn complex shading languages ​​or graphics processing primitives;

OpenCL: It is an open standard for parallel programming of heterogeneous platforms released in 2008, and it is also a programming framework;

The following is the specific structure diagram of CUDA programming:

insert image description here

Among them, Device represents the GPU, Host represents the CPU, and Kernel represents the function running on the GPU;

Terminology: Hierarchies of the memory model:

  • Each thread processor (SP) has its own registers (registers)
  • Each SP has its own local memory (local memory), registers and local memory can only be accessed by the thread itself;
  • Each multi-core processor (SM) has its own shared memory (shared memory), which can be accessed by all threads in the thread block;
  • All SMs of a GPU share a global memory (global memory), which can be used by threads of different thread blocks;

Term: Software:

In CUDA, the specific corresponding structure is as follows

The thread processor (SP) corresponds to the thread;

The multi-core processor (SM) corresponds to the thread block;

The device side (device) corresponds to the thread block assembly;

A kernel can only be executed on one GPU at a time;

The concept of thread block:

Divide the thread array into multiple blocks. Threads in a block cooperate through shared memory, atomic operations, and barrier synchronization. Threads in different blocks cannot cooperate;

The concept of grid parallel thread block combination:

Kernels in CUDA are executed by a grid of threads, each thread has an index used to compute memory addresses and make control decisions;

For the data to be processed by each thread, the specific id is defined for the thread block and thread, and the data to be processed is determined according to the index;

Thread bundle (warp) concept:

SIMT (Single Instruction Multiple Thread) architecture adopted by SM, warp (thread warp) is the most basic execution unit, a warp contains 32 parallel threads, these threads execute the same instruction with different data resources, warp is essentially a thread on the GPU The smallest unit of operation;

When a kernel is executed, the thread blocks in the grid are allocated to the SM, and the thread of a thread block can only be scheduled on one SM. SM can generally schedule multiple thread blocks, and a large number of threads may be allocated to different SMs. , each thread has its own program counter and status register, and executes instructions with the thread's own data, which is the so-called SIMT;

Since the size of the warp is 32, the size of the thread contained in the block is generally set to a multiple of 32;

insert image description here

Case: Vector Addition

Description: Vector addition meets the conditions for GPU operation, the calculation is simple and supports parallelism, and the memory access is less;

Implement the addition of two vectors in the CPU:

void vecAdd(float* A, float* B, float* C, int n)
{
    for (i = 0, i < n, i++)
    C[i] = A[i] + B[i];
}

It is necessary to use a loop to add the data in the two vectors in turn, and then save the result in the new vector;

Implement the addition of two vectors in the GPU:

It is mainly divided into the following steps:

  • Allocate memory space for data

    cudaError_t cudaMalloc (void **devPtr, size_t size): The two parameters are the address and the requested memory size;

    Function: Allocate objects in the global memory of the device;

    cudaError_t cudaMalloc (void **devPtr, size_t size): The parameter is the address of the pointer object to be released

    Function: release the object from the global memory of the device;

  • Define the kernel function (computation function)

  • data transmission

    This involves a data copy, the function is cudaError_t cudaMemcpy (void *dst, const void *src, size_t count, cudaMemcpyKind kind)

    Four options supported by cudaMemcpyKind: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice

    Function: the transmission of memory data between the host end and the device end;

The definition and call of the kernel function:

  • Functions executed on the GPU;
  • Generally modified by the identifier __global__;
  • The function call passes <<<parameter 1, parameter 2>>>, and the parameters are used to describe the number of threads in the kernel function and the organization of threads;
  • Organized in the form of a grid, each thread grid is composed of several thread blocks (blocks), and each thread block is composed of several threads (threads);
  • The execution parameters of the kernel function must be declared when calling;
  • When programming, you must first allocate enough space for the array or variable used in the kernel function, and then call the kernel function, otherwise an error will occur during GPU calculation;

Identifiers for CUDA programming:

__ global __ : The return of the kernel function must use void, which is also the most common;

insert image description here

CUDA programming process:

insert image description here

There are several compilation methods, such as file-by-file compilation, compiling the entire cuda file into a dynamic library, and cmake compilation (the most commonly used)!

Code:

Implementation of vector addition in CPU:

void vecAdd(float* A, float* B, float* C, int n) {
    for (int i = 0; i < n; i++) {
        C[i] = A[i] + B[i];
    }
}

Implementation of vector addition in GPU:

void vecAddKernel(float* A_d, float* B_d, float* C_d, int n)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) C_d[i] = A_d[i] + B_d[i];
}

// 下面是调用,上面是定义核函数
vecAddKernel <<< blockPerGrid, threadPerBlock >>> (da, db, dc, n);

After compiling in the actual environment, run the two codes to see the time-consuming:

insert image description here

It can be seen that in the flying environment, the GPU will be about 9 times faster than the CPU for a vector of 100,000 dimensions;

Case: Matrix multiplication

As the most common calculation in deep learning tasks, matrix multiplication is also the focus of GPU optimization;

Normal matrix multiplication is the multiplication between rows and columns to get an element, that is, a thread is responsible for calculating an element;

insert image description here

The main problem here is that if the data is read too frequently, a lot of time will be wasted on reading the data;

Optimization idea:

Put the data accessed multiple times into the shared memory, reduce the number of repeated readings, and make full use of the advantages of low latency of the shared memory;

Memory read speed in CUDA:

  • Respective thread registers (1 cycle)
  • Thread block shared memory (5 cycles)
  • Grid global memory (500 cycles)
  • Grid constant memory (5 cycles)

insert image description here

Shared memory in CUDA:

Concept: a special type of memory whose content is explicitly declared and used in the source code;

(located in the processor, accessed at higher speeds, accessed by memory access instructions, alias scratch memory)

Features:

  • The reading speed is equivalent to the cache. On many graphics cards, the cache and memory use the same piece of hardware, and the size can be configured;
  • Shared memory belongs to thread blocks and can be accessed by all threads in a thread block;
  • There are two ways to apply for shared memory space, static application and dynamic application;
  • The size of the shared memory is only a dozen K, excessive use of shared memory will reduce the parallelism of the program;

Instructions:

  • Use the __shared__ keyword;
  • Note that there is an intersection of data, and the data on the boundary should be copied in;

Here is a thread synchronization function - __syncthreads():

Concept: It is a built-in function of cuda, which is used for thread communication within a block;

Two ways to apply for shared memory:

1. Static mode:

__shared__ int s[64];

The shared memory size is clear;

2. Dynamic mode:

__shared__ int s[64];

The shared memory size is not clear;

Tiled matrix multiplication:

Principle: Decompose the execution of the kernel into multiple stages, so that the data access of each stage is concentrated on a subset of (Md and Nd);

insert image description here

Of course, you need to use the built-in function __syncthreads() to ensure that all elements in the tiling matrix are loaded and used;

Theoretical acceleration ratio: the original matrix multiplication needs to be fetched 2mnk times from the global memory, and the tiled matrix multiplication only needs to fetch 2mnk/block_size times, which speeds up the block_size times. Considering the synchronization function and the read and write of the shared memory, the actual acceleration efficiency lower than this;

Code:

Code for matrix multiplication under CPU:

void  multiplicateMatrixOnHost(float *array_A, float *array_B, float *array_C, int M_p, int K_p, int N_p)
{
	for (int i = 0; i < M_p; i++)
	{
		for (int j = 0; j < N_p; j++)
		{
			float sum = 0;
			for (int k = 0; k < K_p; k++)
			{
				sum += array_A[i*K_p + k] * array_B[k*N_p + j];
			}
			array_C[i*N_p + j] = sum;
		}
	}
}

Matrix multiplication code under GPU:

// 下面是在GPU上不适用共享内存的实现
__global__ void multiplicateMatrixOnDevice(float *array_A, float *array_B, float *array_C, int M_p, int K_p, int N_p)
{
	int ix = threadIdx.x + blockDim.x*blockIdx.x;//row number
	int iy = threadIdx.y + blockDim.y*blockIdx.y;//col number

	if (ix < N_p && iy < M_p)
	{
		float sum = 0;
		for (int k = 0; k < K_p; k++)
		{
			sum += array_A[iy*K_p + k] * array_B[k*N_p + ix];
		}
		array_C[iy*N_p + ix] = sum;
	}
}

// 下面是在GPU上使用共享内存的实现
__global__ void matrixMultiplyShared(float *A, float *B, float *C,
	int numARows, int numAColumns, int numBRows, int numBColumns, int numCRows, int numCColumns)
{
	__shared__ float sharedM[BLOCK_SIZE][BLOCK_SIZE];
	__shared__ float sharedN[BLOCK_SIZE][BLOCK_SIZE];

	int bx = blockIdx.x;
	int by = blockIdx.y;
	int tx = threadIdx.x;
	int ty = threadIdx.y;

	int row = by * BLOCK_SIZE + ty;
	int col = bx * BLOCK_SIZE + tx;

	float Csub = 0.0;

	for (int i = 0; i < (int)(ceil((float)numAColumns / BLOCK_SIZE)); i++)
	{
		if (i*BLOCK_SIZE + tx < numAColumns && row < numARows)
			sharedM[ty][tx] = A[row*numAColumns + i * BLOCK_SIZE + tx];
		else
			sharedM[ty][tx] = 0.0;

		if (i*BLOCK_SIZE + ty < numBRows && col < numBColumns)
			sharedN[ty][tx] = B[(i*BLOCK_SIZE + ty)*numBColumns + col];
		else
			sharedN[ty][tx] = 0.0;
		__syncthreads();			// 线程同步

		for (int j = 0; j < BLOCK_SIZE; j++)
			Csub += sharedM[ty][j] * sharedN[j][tx];
		__syncthreads();			// 线程同步
	}

	if (row < numCRows && col < numCColumns)
		C[row*numCColumns + col] = Csub;
}

// 在cuda的内置库中,可以直接通过cublasSgemm()这个函数实现矩阵在共享内存中的相乘

Here is the running time:

insert image description here

in conclusion:

It can be seen that the GPU has indeed achieved a certain efficiency improvement, but the improvement is not much. This is because the time calculation here also includes the copy of the data. If in the case of multiplying large matrices, the data copy The time is negligible, and the calculation efficiency can be seen that the GPU is much faster than the CPU;

2. Advanced Learning

CUDA Stream

Concept: CUDA Stream is the execution queue of tasks on the GPU, and all CUDA operations (kernel, memory copy, etc.) are executed on the stream;

type:

1. Implicit flow: also known as default flow, NULL flow;

All CUDA operations run in the implicit stream by default, and the GOU and CPU-side calculations in the implicit stream are synchronous, that is, serial;

insert image description here

2. Display flow: display the application flow;

The GPU task in the display stream and the CPU-side calculation are asynchronous, and the execution of the GPU task in different display streams is also asynchronous and parallel;

insert image description here

Simple code example:

// 创建两个流
cudaStream_t stream[2];		// 定义流对象
for (int i = 0; i < 2; ++i){
	cudaStreamCreate(&stream[i]);
}
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
...
// 两个流,每个流有三个命令
for (int i = 0; i < 2; ++i){
	// 从主机内存复制数据到设备内存
	cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]);
	// 执行kernel处理
	MyKernrl <<grid, block, 0, stream[i]>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
	// 从设备内存复制数据到主机内存
	cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyD	eviceToHost, stream[i]);
}
// 同步流
for (int i = 0; i < 2; i++){
	cudaStreamSyncchronize(stream[i]);
	...
}
// 销毁流
for (int i = 0; i < 2; ++i){
	cudaStreamDestory(stream[i]);
}

advantage:

  • CPU calculation and kernel calculation are parallel;
  • Parallel CPU calculation and data transmission;
  • Parallel data transmission and kernel calculation;
  • Kernel computing parallelism;

important point:

The execution of the GPU task in the display stream and the task on the GPU side is asynchronous, so you must pay attention to synchronization when using stream;

The following interfaces function as synchronous streams:

cudaStreamSyncchronize(): Synchronize a stream;

cudaDeviceSynchronize(): Synchronize all streams on the device;

cudaStreamQuery(): Query whether a stream task is completed;

Let's look at a case: Data transmission and GPU computing are parallelized through streams

insert image description here

Note: As can be seen from the following task execution sequence, the data transmission does not overlap together, because the CPU and GPU data transmission is through the PCle bus, and the PCle operation is sequential;

Ask:

1. Why is CUDA Stream effective?

  • The slow transmission speed of PCle bus is the bottleneck, which will cause the GPU to be idle during data transmission, and multi-stream can realize the parallelism of data transmission and kernel calculation;
  • A kernel often does not use the computing power of the entire GPU, and multi-streaming allows multiple kernels to calculate at the same time;
  • It is not that the more streams the better, similar to the multi-core CPU, there is also a limit on the number;

There is also an optimization strategy here, which is to merge small tasks into large tasks;

insert image description here

It should also be noted that when both the default stream and the display stream exist, a parameter needs to be added for compilation;

nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread

CUDA Event

CUDA Event is to insert an event in the stream, which is similar to marking a bit to record whether the stream has been executed to the current position. The Event has two states, executed and not executed;

The most common usage is to measure time:

// 使用event计算时间
float time_elapsed = 0;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, 0);	// 记录当前时间
mul<<<blocks, threads, 0, 0>>>(dev_a, NUM);
cudaEventRecord(stop, 0);	// 记录当前时间

cudaEventSynchronize(start);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time_elapsed, start, stop); // 计算时间差

cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("执行时间: %f(ms)\n", time_elapsed);

CUDA synchronous operations (divided into four categories):

  • device synchronize: It has a great impact, and it must wait for all kernels to be executed before performing CPU-side tasks;
  • stream synchronize: Affects a single stream and CPU, you need to wait for this stream to finish executing before continuing to CPU;
  • event synchronize: affect the CPU, more fine-grained synchronization;
  • synchronizing across streams using an event: advanced control;

NVVP

Concept: NVIDIA Visual Profiler (NVVP) is a cross-platform CUDA program performance analysis tool launched by NVIDIA;

It mainly has the following characteristics:

  • With CUDA installed, no additional installation is required;
  • With a graphical interface, you can quickly find the performance bottleneck in the program;
  • Display CPU and GPU operations in the form of a timeline;
  • You can view various software parameters (speed) and hardware parameters (L1 cache hit rate) of data transmission and kernel;

To open the visualization function on the Window side, you also need to refer to the following articles to install:

(66 messages) Package | Windows10 CUDA10.2 JDK8 environment to install NVidia Visual Profiler (nvvp) installation bug notes_1LOVESJohnny's Blog-CSDN Blog

Cublas

Concept: It is an implementation of BLAS that allows users to use NVIDIA's GPU computing resources. When using cuBLAS, the application should allocate the GPU memory space required by the matrix or vector, load the data, call the required cuBLAS function, and then use it from the GPU. The memory space uploads the calculation results to the host, and cuBLAS also provides some helper functions to write or read data from the GPU;

Learning site: cuBLAS (nvidia.com)

Explanation: This part is mainly about the usage of some linear functions on the GPU. It is mainly used to process some functions of vector scalar, vector matrix, and matrix matrix. Since it will not be used in the short term, no in-depth study will be carried out here;

Hidden

Concept: NVIDIA cuDNN is a GPU-accelerated library for deep neural networks, which emphasizes performance, ease of use, and low memory overhead, and can be inherited into more advanced machine learning frameworks;

Learning site: NVIDIA cuDNN Documentation

Implementation steps:

// 1、创建cuDNN句柄
cudnnStatus_t cudnnCreate(cudnnHandle_t *handle)
// 2、以Host方式调用在Device上运行的函数
比如卷积函数: cudnnConvolutionForward等
//3、释放cuDNN句柄
cudnnStatus_t cudnnDestroy(cudnnHandle_t handle)
// 4、将CUDA流设置&返回成cudnn句柄
cudnnStatus_t cudnnSetStream( cudnnHandle_t handle, cudaStream_t streamId)
cudnnStatus_t cudnnGetStream( cudnnHandle_t handle, cudaStream_t *streamId)

3. TensorRT learning

basic concept

First of all, it is necessary to clarify the positioning of TensorRT, which is a reasoning framework:

insert image description here

It has the following characteristics:

  • High-performance deep learning inference optimizer and acceleration library;
  • Low latency and high throughput;
  • Deploy to hyperscale data centers, embedded or automotive products;
  • Compared with other reasoning frameworks, closed source is also one of its characteristics;

The implementation steps are mainly divided into two steps: conversion optimization engine and execution engine;

insert image description here

The most common optimization methods are quantization (low precision) and operator fusion (such as fusing convolution pooling and activation layers into one layer)

manual

Its use process is divided into the following steps:

insert image description here

There are two ways to build the network, one is API construction, that is, each layer of the network is rebuilt with code, which is relatively complicated; the other is to use Parser to build, that is, a specific network has its own specific The framework has a corresponding loading interface, and only a few lines of code are needed to build the network structure;

model conversion

ONNX转trt:https://github.com/onnx/onnx-tensorrt/tree/6872a9473391a73b96741711d52b98c2c3e25146

Pytorch转trt:NVIDIA-AI-IOT/torch2trt: An easy to use PyTorch to TensorRT converter (github.com)

TensorFlow转trt:tensorflow/tensorflow/compiler/tf2tensorrt at 1cca70b80504474402215d2a4e55bc44621b691d · tensorflow/tensorflow (github.com)

Here is a conversion tool website: https://convertmodel.com/

Some specific conversion techniques need to be explored in practice, but it is best to convert the model to onnx, and then convert it to trt through onnx;

simple case

Many cases are given in the official source code:

https://github.com/NVIDIA/TensorRT/tree/release/6.0/samples/opensource/sampleMNIST

This is one of the MNIST digit recognition examples;

After running the case in AIStudio, you can get the following results:

insert image description here

4. Advanced TensorRT

plugin usage

effect:

1. The operators supported by trt are limited, and unsupported operators can be realized;

2. Carry out in-depth optimization and merge operators;

work process:

insert image description here

Explanation of the API:

First of all be clear that there are two types of definitions:

  • Dynamic Shape: The input dimension is dynamic and inherits the IPluginV2IOExt base class;
  • Static Shape: The input dimension is static, inheriting the IPluginV2DynamicExt base class;

Constructor:

1. For the network definition stage, the constructor called by PluginCreator when creating the plugin needs to pass weight information and parameters. It can also be used in the clone stage, or write a clone constructor;

MyCustomPlugin(int in_channel, nvinfer1::Weights const& weight, nvinfer1::Weights const& bias);

2. In the deserialize stage, it is used to pass the serialized weights and parameters into the plugin and create it;

MyCustomPlugin(void const* serialData, size_t serialLength);

3. Pay attention to delete the default constructor;

MyCustomPlugin() = delete;

Destructor:

The destructor needs to execute terminate, and the terminate function is to release some memory space opened before this op;

MyCustomPlugin::~MyCustomPlugin() {
	terminate();
}

Output related functions:

1. Obtain the output number of the layer;

int getNbOutputs() const;

2. According to the number of inputs and input dimensions, obtain the dimension of the index-th output;

nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims); 

3. According to the input number and input type, obtain the index output type;

nvinfer1::DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const; 

Serialization and deserialization related functions:

1. How many bytes need to be written to the buffer when returning serialization;

size_t getSerializationSize() const;

2. The serialization function writes the parameter weight of the plugin into the buffer;

void serialize(void* buffer) const;

3. Obtain the type and version of the plugin for deserialization;

const char* getPluginType() const;
const char* getPluginVersion() const;

Initialization, configuration and destruction functions:

1. The initialization function is executed before the plugin is ready to run. Generally apply for the weight memory space and copy the weight;

int initialize(); 

2. The terminate function is to release some video memory space opened by initialize;

void terminate(); 

3. Release the resources occupied by the entire plugin;

void destroy();

4. Configure the plug-in op to determine whether the number of input and output types is correct;

void configurePlugin(const nvinfer1::PluginTensorDesc* in, int nbInput, const nvinfer1::PluginTensorDesc* out, int nbOutput);

5. Determine whether the input/output of the pos index supports the format/data type specified by inOut[pos].format and inOut[pos].type;

bool supportsFormatCombination(int pos, const nvinfer1::PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const;

Runtime related functions:

1. Obtain the video memory size required by the plugin. It is best not to use cudaMalloc to apply for video memory in plugin enqueue;

size_t getWorkspaceSize(int maxBatchSize) const;

2. Reasoning function;

int enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream);

IPluginCreator related functions:

1. Get the pluginname and version to identify the creator;

const char* getPluginName() const; 
const char* getPluginVersion() const; 

2. Use PluginFieldCollection to create a plugin, take out the weights and parameters required by the op one by one, and then call the first constructor mentioned above:

const nvinfer1::PluginFieldCollection* getFieldNames(); 
nvinfer1::IPluginV2* createPlugin(const char* name, const nvinfer1::PluginFieldCollection* 
fc);

3. Deserialize, call the deserialized constructor to generate plugin;

nvinfer1::IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength);

It is recommended to refer to the official case study to consolidate the implementation process of the code;

optimization

First of all, you can understand the specific definitions of FP32 and FP16 types:

Reference: ARM CPU performance optimization: the difference between FP32, FP16 and BF16 - Zhihu (zhihu.com)

The meaning of INT8 quantization:

Convert floating-point-based models to low-precision int8 values ​​for operations to speed up inference;

insert image description here

The principle of INT8 and FP16 accelerated reasoning:

Through instructions or hardware technology, within a unit clock cycle, the number of operations of FP16 and INT8 types is greater than that of FP32 types;

Why doesn't INT8 quantization significantly lose precision?

Since the neural network has the following characteristics: it has certain robustness;

Reason: The training data is generally noisy, and the training process of the neural network is often to identify effective information from the noise, and the loss caused by the reduced precision calculation can be understood as another kind of noise;

Classification of INT8 quantization:

Dynamic symmetric quantization algorithm (ONNX quantization, torch dynamic quantization)

Dynamic Asymmetric Quantization Algorithm (Google Gemmlowp)

Static symmetric quantization algorithm (torch static quantization, TensorRT, NCNN)

Dynamic symmetric quantization algorithm:

insert image description here

Calculation formula:

scale = |max| * 2/256;

real_value = scale * quantized_value;

Where real_value is the real value (float type), quantized_value is the result of INT8 quantization (char type)

Advantages: the algorithm is simple, and the quantization step takes a short time;

Disadvantages: It will cause a waste of bit width and affect the accuracy, that is to say, if it may be converted into an 8-bit value, a value of one bit may be empty;

Dynamic asymmetric quantization algorithm:

insert image description here

Calculation formula:

scale = |max - min| / 256;

real_value = scale * (quantized_value - zero_point);

Among them, real_value is the real value (float type), quantized_value is the result of INT8 quantization (char type), and zero_point is the zero point value

Advantages: no waste of bit width, guaranteed accuracy;

Disadvantages: the algorithm is more complicated, and the quantization step takes a long time;

Static symmetric quantization algorithm:

insert image description here

Dynamic quantification: real-time statistical value |max| during inference;

Static quantification: use pre-statistical scaling thresholds during inference, and truncate some data outside the threshold;

Advantages: The algorithm is the simplest, the quantization time is the shortest, and the accuracy is guaranteed;

Disadvantages: It is troublesome to build a quantitative network;

The KL divergence is mainly used to calculate the quantized threshold. You can refer to the following articles:

(72 messages) TensorRT INT8 quantization principle and implementation (very detailed)_Nicholson07's Blog-CSDN Blog

Source code example: TensorRT/samples/opensource/sampleINT8 at release/7.2 NVIDIA/TensorRT GitHub

INT8 quantification goes online on a large scale:

insert image description here

Summarize

1. For the inference of deep neural network, TRT can give full play to the computing power of GPU and save the storage space of GPU;

2. It is necessary to refer to the sample case of the official source code, try to replace the existing model, and then learn more about the API to build the network;

3. If you want to use custom components, at least understand the basic architecture and common properties of CUDA;

4. It is recommended to use the two quantization modes of FP16 (few variables are defined, which can obviously improve the speed and have little impact on precision) and INT8 (greater potential, which may lead to a decrease in precision);

5. On devices with GPUs of different architectures or different software versions, the engine cannot be used universally, and one needs to be regenerated;

Guess you like

Origin blog.csdn.net/weixin_40620310/article/details/130194428