[Deep Learning] Model Quantization - Notes/Experiments

content

1. Quantification method

1. Asymmetric quantization

2. Symmetrical quantization

 3. Random quantization

The type of quantification

1. Dynamic quantization

2. Static quantization

3. Quantization-aware training

3. Quantitative Practice

1. Experimental configuration

2. Experimental results

3. Experimental code

4. Reference


Deep learning has been shown to work well on various tasks such as image classification, object detection, semantic segmentation, natural language processing, etc. In industrial deployment, there are often very strict restrictions on the model. In order to more effectively apply the model to mobile devices or embedded devices, it is necessary to effectively compress the size of the model. There are two commonly used methods:

(1) Design more effective model architectures, such as mobilenet and squeezenet;

(2) Reduce the size of the network through compression, encoding, etc.

If the model parameter type of deep learning is FP32, each value needs 32bit storage space when it is stored. If it is converted to int8 or uint8 to save, then the size of the model will be reduced to 1/4. The following table shows the numerical calculation cost of different precisions. It can be seen that the calculation cost of 8bit value is much lower than that of 32bit calculation.

1. Quantification method

Quantization methods: asymmetric quantization, symmetric quantization, random quantization.

1. Asymmetric quantization

 Assuming that the input range of floating-point numbers is (X_min, X_max), the range after quantization is (0, N_levels - 1), for 8bits quantization, N_levels is 256, and the calculation formula of scale and zero point is as follows:

\Delta =(X\_max-X\_min)/255

z = -X \ _min / \ Delta

After getting the scale and zero point, for any input x, the quantization calculation process is:

x_{int}=round(\frac{x}{\Delta })+z

x_Q=clamp(0, N\_levels-1,x_{init})

 The corresponding inverse quantization formula is:

 Note : For unilateral distributions such as (2.5, 3.5), the range needs to be relaxed to (0, 3.5) and then quantized, which will lose accuracy in the case of extreme unilateral distributions.

2. Symmetrical quantization

The scale calculation formula is as follows:

\Delta =\frac{max(abs(x))}{N\_levels-1}

Symmetric quantization is relatively simple. It limits the zero value to 0. The quantization formula is as follows:

 Inverse quantization formula:

 3. Random quantization

Random quantization is similar to asymmetric quantization. The difference is that noise is introduced during quantization. The parameter calculation and inverse quantization process are the same as those of asymmetric quantization, which will not be repeated here.

The type of quantification

There are three quantization methods in pytorch: dynamic quantization, static quantization, and quantization-aware training.

1. Dynamic quantization

When the network is trained, its weight value has been determined, so the quantization factor of the weight has been determined, but for different input values, its scaling factor is dynamically calculated (the origin of "dynamic"). Dynamic quantization has the worst performance among several quantization methods.

Dynamic quantization is often used for very large models.

2. Static quantization

The difference between static quantization and dynamic quantization is that the input scaling factor calculation method is different. The static quantization model has a fine-tuning process (calibrating the scaling factor) before use: prepare part of the input (for the image classification model is to prepare Some pictures, other tasks are similar), use a statically quantized model to make predictions, during which the scaling factor of the quantized model is adjusted according to the distribution of the input data.

After fine-tuning is completed, the weights and input scaling factors are fixed (the origin of "static"). Static quantization generally performs better than dynamic quantization and is often used for medium and large models.

3. Quantization-aware training

Quantization-aware training (QAT) is the highest of the three methods, it inserts static quantization directly into the training process of the network, eliminating the calibration process after the network is trained. This method reduces the training speed, but can achieve higher accuracy.

3. Quantitative Practice

Pytorch supports asymmetric linear quantization of per tensor and per channel. Per tensor means that all values ​​in the tensor have a uniform scaling factor (scale), and per channel means that different channels of the tensor have different scaling factors.

1. Experimental configuration

The experimental environment is as follows:

torch==1.11.0

windows 10 cpu

fx graph

In the process, torchvision.models.mobilenet_v2 model is used as the test model, the model parameters are not trained, only cifar10 data is used to test the running speed of the model after quantization, and the model is tested after fuse.

2. Experimental results

Note: Dynamic quantization does not support convolutional quantization, so only the fully connected layer is quantized, so its parameters and time consumption are almost unchanged.

Parameters (pt size, kb) Time (s)
original model 8971 68.94
Dynamic quantization 8671 68.38
static quantization 2293 45.94
Quantitative perception 2293 46.68

3. Experimental code

There are many experimental codes. Here is a test code for quantization-aware training. The rest of the code can be downloaded by referring to the link at the end of the article:

import torch
import torchvision
from torchvision.transforms import ToTensor, Normalize, Compose
from torch.quantization.quantize_fx import convert_fx, prepare_fx, prepare_qat_fx, fuse_fx
import time

transforms = Compose(
    [
        ToTensor(),
        Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]
)

cifar10_dataset = torchvision.datasets.CIFAR10(root='data', transform=transforms, download=False)
data_loader = torch.utils.data.DataLoader(cifar10_dataset,
                                          batch_size=32,
                                          shuffle=True,
                                          num_workers=0)

model = torchvision.models.mobilenet_v2(num_classes=10)
qconfig_dict = {"": torch.quantization.get_default_qat_qconfig('qnnpack')}

model.train()
model_quant = prepare_qat_fx(model, qconfig_dict)

ce_loss = torch.nn.CrossEntropyLoss()
learning_rate = 0.01
optimizer = torch.optim.SGD(model_quant.parameters(), momentum=0.1, lr=learning_rate)
epochs = 1
for i in range(epochs):
    for idx, (image, label) in enumerate(data_loader):
        logits = model_quant(image)
        loss = ce_loss(logits, label)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if idx % 100:
            print("epochs: {}, loss: ".format(i), loss.detach().numpy())
            break
model_quant = convert_fx(model_quant)

torch.save(model_quant.state_dict(), "model_qat.pt")

model_quant.eval()
model_quant = fuse_fx(model_quant)

beg = time.time()
with torch.no_grad():
    model_quant.eval()
    for idx, (image, label) in enumerate(data_loader):
        logits = model_quant(image)

end = time.time()
print('time consume: ', end - beg)

4. Reference

1、A developer-friendly guide to model quantization with PyTorch

2、introduction-to-quantization-on-pytorch

3, pytorch quantitative test code, mobilenetv2 in cifar10 speed test

4. NCNN model quantization practice - portrait matting MODNet model

Guess you like

Origin blog.csdn.net/qq_40035462/article/details/123745290