Quantization, pruning, and compression of deep learning models

fp16 refers to a data type that uses 2 bytes (16 bits) for encoding and storage;

fp32 refers to the use of 4 bytes (32 bits);

Optimization of training compared to fp16 and fp32 :

  • 1. Reduced memory usage : The memory usage of the application fp16 is smaller than before, and a larger batch_size can be set
  • 2. Accelerated computing : Accelerated computing is only available in some recent new GPUs, and I have not experienced the benefits of this area... Some papers point out that the training speed of fp16 can be 2-8 times that of fp32

type of data

Ranges

float16

-65504 ~ 65504

float32

-2^31 ~ 2^31-1

you8

-2^7 ~ 2^7-1 (-128 ~ 127)

uint8

0 ~ 2^8-1 (0~255)

1. Quantitative purpose

        When developing machine learning applications, it is important to efficiently utilize computing resources both on the server side and on the device. In order to support more efficient deployment on servers and edge devices, support for model quantization will become more important.

  Quantization utilizes 8-bit integer (int8) instructions to reduce model size and run inference faster (reducing latency), and can be the difference between models that achieve quality of service goals or even fit within the resources available on mobile devices. It enables you to deploy larger, more accurate models even in less resource-constrained situations.

        Quantization is primarily a technique to speed up inference , and only supports the forward pass of quantization operators. In deep learning, quantization refers to using fewer bits to store tensors that were originally stored in floating point numbers, and using fewer bits to complete calculations that were originally done in floating point numbers.

 

2. Introduction to quantification

  Quantization is primarily a technique to speed up inference, and quantization operators only support the forward pass. Quantization refers to the technique of using lower precision data for calculations and memory accesses, usually int8, compared to floating point implementations.

Performance gains can be achieved in several important areas:

        Model size reduced by 4 times;

        2-4x reduction in memory bandwidth;

2-4x faster inference due to memory bandwidth savings and faster calculations using int8 algorithms.

  However, quantization does not come without additional costs. Fundamentally, quantization means introducing approximations, and the resulting network is slightly less accurate. These techniques attempt to minimize the gap between full floating point precision and quantized precision.    

3. Quantification method

3.1 Dynamic Quantization - dynamic quantization

  The simplest quantization method supported by PyTorch is called dynamic quantization. This involves not only converting the weights to int8, but also converting the activations to int 8 (hence "dynamic") before performing the computation. Therefore, calculations will be performed using efficient int8 matrix multiplication and convolution implementations, resulting in faster calculations. However, activations are read from and written to memory in floating point format.

3.2 Post-Training Static Quantization - static quantization after training

  Performance (latency) can be further improved by converting the network to use both integer arithmetic and int8 memory accesses. Static quantization performs the extra step of first feeding batches of data through the network and computing the resulting distribution over different activations. This information is used to determine how specifically different activations should be quantified at inference time. Importantly, this additional step allows us to pass quantized values ​​between operations, rather than converting these values ​​to floating point values ​​and then to integers between each operation, resulting in a significant speedup.

3.3 Quantization Aware Training - Quantization Aware Training

  Quantization-aware training (QAT) is the most accurate of the three methods. With QAT, all weights and activations are "pseudo-quantized" during the forward and backward passes of training: that is, floating point values ​​are rounded to simulate int8 values, but all calculations are still done using floating point numbers. Thus, all weight adjustments during training are done in the "awareness" of the fact that the model will eventually be quantized; thus, after quantization, this method generally yields higher accuracy than the other two methods.

4. Introduction to PTQ and QAT

        Quantization methods can be divided into quantization-aware training (QAT) and post-training quantization (PTQ) according to whether to adjust the quantized parameters. The operation difference between these two methods is shown in the figure below (QAT on the left and PTQ on the right):

    

Quantization-aware training QAT is to quantize the trained model and then retrain it. Since the fixed-point value cannot be used for reverse gradient calculation, the actual operation process is to insert fake quantization nodes (fake quantization nodes) before some ops, which are used to obtain the cutoff value of the data flowing through the op during training, which is convenient for deploying quantization Used when quantizing nodes when modeling. We need to obtain the best quantization parameters by continuously optimizing the accuracy during training. Because it needs to train the model, it has high technical requirements for operators.

  Post-training quantization PTQ uses a batch of calibration data to calibrate the trained model, and directly converts the trained FP32 network into a fixed-point computing network without any training on the original model. The quantization process can be completed by only adjusting a few hyperparameters, and the process is simple and fast without training, so this method has been widely used in a large number of device-side and cloud-side deployment scenarios.

model type

preferred option

Why

LSTM/RNN

dynamic quantization

Throughput is determined by the compute / memory bandwidth of the weights

BERT/Transformer

dynamic quantization

Throughput is determined by the compute / memory bandwidth of the weights

CNN

static quantization

Throughput limited by active memory bandwidth

CNN

Quantization-aware training

In cases where static quantization cannot achieve precision

Guess you like

Origin blog.csdn.net/xs1997/article/details/131747158