[Strongly recommended] YOLOv7 deployment accelerated by 590%, BERT deployment accelerated by 622%, this open source automated compression tool must be favorited! ...

 guide  

As we all know, computer vision technology (CV) is one of the fields with the highest proportion of enterprise artificial intelligence applications. In order to reduce enterprise costs, engineers have been exploring various model compression technologies to produce "more accurate, smaller, and faster" AI model deployment. In the field of natural language processing (NLP), with the continuous improvement of model accuracy, the scale of the model is also getting larger and larger, such as pre-training models represented by BERT and GPT, etc., which has become an obstacle to the deployment of enterprise NLP models. .

This article introduces a low-cost , high-yield AI model automatic compression tool (ACT, Auto Compression Toolkit), without modifying the training source code, through tens of minutes of quantitative training , while ensuring the accuracy of the model, it greatly reduces the size of the model , reduce video memory usage, improve model reasoning speed, and facilitate the rapid implementation of AI models.

The YOLOv7 model is trained using the knowledge distillation-based quantization training method in ACT . Compared with the original FP32 model, the INT8 quantized model is reduced by 75% , and the inference speedup on NVIDIA GPU is 5.89 times . Using the unstructured sparse and distillation method in ACT to train the PP-HumanSeg model, the inference speedup on ARM CPU is up to 1.49 times compared with that before compression .

Table 1 Compression effect and inference acceleration of automatic compression tool on CV model

666499a02a67570204afc2184e15e97a.png

Using the structured sparse and distillation quantization method in ACT to train the ERNIE3.0 model, compared with the original FP32, the model after INT8 quantization is reduced by 185% , and the reasoning acceleration on NVIDIA GPU is 6.37 times .

Table 2 Compression effect and inference acceleration of automatic compression tools on NLP models

d70dfa77e4f4e35a8cb8a115d6cfa94d.png

This article will further technical interpretation from the following 6 aspects, the full text is about 3900 words, and the expected reading time is 5 minutes.

one

Four-step YOLOv7 automatic compression practice

two

Four-step BERT automatic compression practice

three

Inference Deployment of Models

Four

Live class registration

five

Interpretation of Automatic Compression Technology

six

future job outlook

Thank you for your Star attention:

https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/auto_compression

Join the automatic compression technology exchange group

3a937557631c05a7977c7041e1be5ac0.pngMembership benefits

  • Get the link to the live lecture that explains the content of this upgrade in detail on automatic compression

  • Get the heavy learning spree organized by the compression team , including:

  1. Deep Learning Course

  2. Collection of Top Conference Papers on Compression Direction

  3. Compress previous live broadcast course videos

  4. Winner of Baidu Highest Award: Model Compression Architect Training Course Materials

817de316d78a08998d5abd426ea2dad2.png

75c32da6d701067c66e603585501a234.pngHow to join the group

  1. Scan the code to follow the official account, fill out the questionnaire and enter the WeChat group

  2. Before the live broadcast, the benefits are announced to the group announcement

e7716605b28d3d4694ba73822f14c7ce.jpeg

01

Four-step YOLOv7 automatic compression practice

1. Prepare the prediction model: export the ONNX model.

git clone https://github.com/WongKinYiu/yolov7.git
cd yolov7
python export.py --weights yolov7-tiny.pt --grid

2. Prepare training data & define DataLoader: Prepare data in COCO or VOC format. Define data preprocessing modules. Among them, the settings of the data preprocessing Reader are as follows:

train_dataset = paddle.vision.datasets.ImageFolder(
         global_config['image_path'], transform=yolo_image_preprocess)
train_loader = paddle.io.DataLoader(
            train_dataset,
            batch_size=1,
            shuffle=True,
            drop_last=True,
            num_workers=0)

3. Define the configuration file: define the configuration file for quantization training, Distillation means distillation parameter configuration, Quantization means quantization parameter configuration, TrainConfig means the number of training rounds, optimizer and other settings during training. For specific hyperparameter settings, please refer to the ACT hyperparameter setting document

Distillation: # 蒸馏参数设置
  alpha: 1.0 # 蒸馏loss所占权重
loss: soft_label

Quantization:  # 量化参数设置
  use_pact: true  # 是否使用PACT量化算法
  activation_quantize_type: 'moving_average_abs_max'   # 激活量化方式,选择'moving_average_abs_max'即可
  quantize_op_types:  # 需要量化的OP类型,可以是conv2d、depthwise_conv2d、mul、matmul_v2等
  - conv2d
  - depthwise_conv2d

TrainConfig:   # 训练的配置
  train_iter: 3000   # 训练的轮数
  eval_iter: 1000    # 训练中每次评估精度的间隔轮数
  learning_rate: 0.00001  # 训练学习率
  optimizer_builder:  # 优化器设置
    optimizer: 
      type: SGD
    weight_decay: 4.0e-05

4. Start compression: ACT quantization training can be started with two lines of code. When starting ACT, you need to pass in the model file path (model_dir), model file name (model_filename), parameter file name (params_filename), compressed model storage path (save_dir), compressed configuration file (config), dataloader, and evaluation accuracy eval_callback.

from paddleslim.auto_compression import AutoCompression
ac = AutoCompression(
        model_dir=global_config["model_dir"],
        model_filename=global_config["model_filename"],
        params_filename=global_config["params_filename"],
        save_dir=FLAGS.save_dir,
        config=all_config,
        train_dataloader=train_loader,
        eval_callback=eval_function)
ac.compress()

02

Four-step BERT automatic compression practice

1. Prepare the predictive model

The Paddle model can skip this step and directly compress it; the PyTorch model can use either of the following two methods, and after the model conversion is completed, the model compression can start.

  • Use PyTorch2Paddle (in the X2Paddle toolbox) to directly convert the PyTorch dynamic graph model to the flying paddle static graph model; (the following code uses this method)

  • Use ONNX2Paddle to save the PyTorch dynamic graph model in ONNX format and then convert it to a static graph model of the flying paddle.

import torch
import numpy as np
# 将PyTorch模型设置为eval模式
torch_model.eval()
# 构建输入,
input_ids = torch.zeros([batch_size, max_length]).long()
token_type_ids = torch.zeros([batch_size, max_length]).long()
attention_msk = torch.zeros([batch_size, max_length]).long()
# 进行转换
from x2paddle.convert import pytorch2paddle
pytorch2paddle(torch_model,
               save_dir='./x2paddle_cola/',
               jit_type="trace",  
              input_examples=[input_ids, attention_msk, token_type_ids])

2. Prepare training data & define DataLoa der. In this case, the automatic compression experiment is performed with GLUE data by default, and PaddleNLP will automatically download the corresponding data set.

5a2d98dc87a8a36c8dd5f751f51cec36.png

8f87322aebd139513824c97f67f886b1.png

8ba3a49caa146bf5537bd1daf208074f.png

3. Define the configuration file. If automatic compression does not specify a specific compression strategy for the model of the Transformer encoder structure, it will automatically select structured pruning and quantization for compression. If you want to set a certain compression strategy separately, you can refer to the specific hyperparameter setting ACT hyperparameter setting document

### 训练配置
train_config = {
    "epochs": 3,               ### 压缩训练epoch数量
    "eval_iter": 855,         ### 训练多少轮数进行一次测试
    "learning_rate": 1.0e-6,   ### 压缩训练过程中的学习率
    "optimizer_builder": {     ### 优化器配置
        "optimizer": {"type": "AdamW"},
        "weight_decay": 0.01  ### 权重衰减值
    },
    "origin_metric": 0.6006    ### 压缩前模型精度,用来确认转换过来的模型和实现的dataloader是否正确
}

4. Start compressing. Two lines of code can start ACT quantization training. When starting ACT, you need to pass in the model file path (model_dir), model file name (model_filename), parameter file name (params_filename), compressed model storage path (save_dir), compressed configuration file (config), dataloader, and evaluation accuracy eval_callback.

### 调用自动压缩接口
    ac = AutoCompression(
        model_dir='./x2paddle_cola',
        model_filename='model.pdmodel',
        params_filename='model.pdiparams',
        save_dir=save_dir,
        config={'TrainConfig': train_config}, #config,
        train_dataloader=train_dataloader,
        eval_callback=eval_function,
        eval_dataloader=eval_dataloader)
ac.compress()

The above is the simplified key code. If you want to experience it quickly, you can follow the

https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/auto_compression/pytorch_huggingface sample documents and code experience.

After the training is completed, the model.pdmodel and model.pdiparams files will be generated under the save_dir path. So far, the model training compression work is completed, and the inference deployment refers to the next section.

03

Inference Deployment of Models

Based on the compressed and trained model, developers can directly use the FastDeploy inference deployment kit to complete the deployment. When deploying with FastDeploy, developers can use one line of code to switch different backends such as Paddle Inference, Paddle Lite, TensorRT, OpenVINO, ONNX Runtime, and RKNN as needed to implement the deployment of different hardware.

04

Live registration

If you want to know the AI ​​model compression strategy, and learn more about the algorithms and capabilities of automatic compression tools, scan the QR code and join the group to follow our live broadcast room!

Live broadcast time:  20:30-21:30 on November 7 (next Monday) and November 8 (next Tuesday), 2022. Welcome to scan the QR code to register.

705bed2c9ff90b080791db2e1b1322f5.jpeg

05

Interpretation of Automatic Compression Technology

1

The motivation and thinking of developing the model automatic compression tool

Model pruning is an important means of model compression. In actual use, there are two difficulties as follows:

1) The direct use of pruning loss is relatively large, which cannot meet the accuracy requirements

  • Structural pruning is to cut out unimportant neurons in the network. Although it will be retrained after pruning, it is usually difficult to restore some information in the pre-trained model, resulting in a decrease in the accuracy of the model after pruning. If pre-trained data is added for retraining, the cost of pruning will be greatly increased.

2) Model pruning needs to modify the training code, the operation is complicated, and the technical threshold is high

Structured pruning consists of the following 3 steps:

  • Calculate the importance of neurons according to the rules;

  • Pruning model neurons based on importance;

  • Retrain the pruned model.

These steps require developers to directly call the pruning-related interfaces in the original training code and perform step-by-step operations. Usually the project engineering is quite complicated, and the modification of the training code is technically complex and time-consuming.

Model quantization is one of the means to improve the speed of model reasoning . In actual use, there are the following three difficulties:

1) The distribution of model activation values ​​is uneven, resulting in large quantization errors

Overtraining is one of the causes of uneven distribution of model activations . For example, in the iterative process of YOLOv6s, in order to make the model converge better, it is usually necessary to extend the model training cycle. But it will also bring some hidden dangers, such as the overfitting of the model on the COCO dataset, and the extreme value distribution of some layers, which increase the quantization noise. We analyzed the quantization accuracy of each layer of Conv in YOLOv6s, and found that the accuracy of some layers dropped particularly seriously. As a result, the accuracy of the YOLOv6s model on the verification set dropped by 10% after offline quantization, which could not meet the business requirements.

2) The task complexity is high, and the model accuracy is greatly affected by the quantization error

The higher the task complexity, the greater the accuracy loss caused by model quantization. Target detection combines two tasks of target positioning and target classification, and the overall complexity is relatively high, so its accuracy is more affected by quantization. Ordinary offline quantization cannot change the numerical distribution of model activation values, but only adapts the quantization scale to the distribution. When encountering activation values ​​with uneven numerical distribution, the quantization error of offline quantization will be large.

3) Quantization training needs to modify the training code, which is complex and technically difficult

Compared with offline quantization (Post Training Quantization), quantization training can reduce the degree of off-line quantization accuracy drop. During the training process, the quantization training method continuously adjusts the distribution of activation values ​​to make the distribution of activation more suitable for quantization. However, the cost of quantization training is relatively high, which is reflected in the following two aspects. On the one hand, the labor cost is high. In order to realize quantization training, it is necessary to modify the model network and training code, and insert simulated quantization operations. On the other hand, the time cost is high, and the complete training set needs to be loaded for training during training.

2

Model automatic compression tool - structured pruning and quantitative analysis

ACT supports automatic combination of compression algorithms for NLP models. ACT will judge the model structure. If it is a Transformer type model, it will automatically select "structured pruning" and "quantization" for serial compression. The specific technical analysis of the above two modules is as follows:

1) The structured pruning technique consists of the following four steps:

  • Construct the teacher model: load the inference model, and copy a copy of the inference model in memory as the teacher model.

  • Construct a structured pruning model: Reorder the importance of the parameters and attention heads of the original model, arrange the important parameters and attention heads on the front side of the parameters, and then perform structural pruning on the model, and subtract the different parameters in proportion. Important parameters and attention heads. The model after structured pruning is used as a student model for compression training.

  • Add distillation loss: automatically analyze the model structure, and find the output of the last operator containing trainable parameters as the distillation node.

  • Distillation training: Use the output of the original model to supervise the output of the structured pruned model, perform structured pruning training, and complete the overall compression process.

d7335aeb3ff9bf9cdf2bffe4ef775464.png

Graph structured pruning + distillation to realize pruning operation

2) Quantification technique:

  • Automatic selection of quantization strategies: ACT includes two quantization strategies, offline quantization and quantization training , which are automatically selected when quantizing the NLP model. Run a small amount of offline quantization first . If the accuracy loss is large, switch to quantization and compression of the model using distillation quantization training . If the accuracy loss is small, use offline quantization hyperparameter search for quantization.

596a2e02d8159dfbc0944aa9dc653045.png

Quantization Strategies in Graph NLP Models

  • The specific steps of distillation and quantization training include the following three steps (this technology is often used in CV tasks):

a) Construct the teacher model: load the reasoning model file, and copy the reasoning model in memory as the teacher model in knowledge distillation, and the original model as the student model.

b) Adding loss: Automatically analyze the model structure and find a layer suitable for adding distillation loss, usually the last layer with trainable parameters. For example, if the detection model head has multiple branches, the last conv of each head will be used as the distillation node.

c) Distillation training: The teacher model supervises the sparse training or quantization training of the original model through distillation loss to complete the process of model compression.

98728b1282b949c27c77608c61381985.gif

Graph Quantization Distillation Training Technology Animation

ACT also supports more functions, including offline quantitative hyperparameter search, automatic algorithm combination and hardware perception, etc., to meet various compression requirements of CV and NLP models. For details about the functions and the application of ACT in more scenarios, please refer to the introduction on the home page of the automatic compression tool.

5c3f16b2c7aad08adf90ccdba3318953.png

Such a good project, everyone is welcome to point star to encourage and come to experience!

https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/auto_compression

06

future job outlook

The ACT automatic compression tool will support automatic compression of more AI models (Transformer, FastSpeech2, etc.). We will continue to upgrade ACT capabilities to further reduce the loss of precision after compression, improve compression efficiency, and verify the functions of structured pruning and unstructured sparseness in more scenarios to bring the ultimate compression acceleration experience. The ACT automated compression tool will support more deployment methods, including various back-end reasoning engines in FastDeploy such as Paddle Inference, Paddle Lite, and ONNX Runtime, to further facilitate the implementation of AI models.

[ Description of table indicators ]

Test environment and supplementary instructions: mAPval in the table refers to the indicators in the paper corresponding to the model. For example, YOLOv5 is tested on the COCO test set, and MobileNetV3 is tested on the Imagenet dataset.

[More exciting live broadcast recommendations]

b094ff8587c8f761d09b419e64f4f115.jpeg

Guess you like

Origin blog.csdn.net/u014333051/article/details/127699009