Using OpenVINO to implement RT-DETR model INT8 quantitative inference acceleration

Author: Intel Edge Computing Innovation Ambassador Yan Guojin

RT-DETR is an improvement based on the DETR model. It is a real-time end-to-end detector based on the DETR architecture. It achieves more efficient training and inference by using a series of new technologies and algorithms. , in the previous article we published "Deployment of RT-DETR model based on OpenVINO™ Python API | Developers' actual combat" ( Deployment of RT-DETR model based on OpenVINO™ Python API | Developers' actual combat ), "Deployment of RT-DETR model based on OpenVINO™ C++ API | Developers' practical experience" ( Deployment of RT-DETR model based on OpenVINO™ C++ API | Developers' practical experience >) and "Deployment of RT-DETR model based on OpenVINO™ C# API | Developer's practice" ( Deployment of RT-DETR model based on OpenVINO™ C# API | Developer's practice a>), implemented the RT-DETR model deployment process based on OpenVINO™ Python, C++ API and C#. He showed everyone the deployment process of the RT-DETR model, and also showed whether the model deployment process includes post-processing, for everyone Using the RT-DETR model provides a good example.

However, after time testing, using OpenVINO™ to run the RT-DETR model on the CPU platform, the inference speed can reach about 3 to 4 frames at the fastest, but this is far from enough for video data prediction. Since the latest release version 2023.1.0 of OpenVINO™ does not yet support the RT-DETR model operator on the GPU platform, we did not test it on the iGPU platform in the previous article. In order to improve the inference speed, in this article, we will use OpenVINO to implement INT8 quantization of the RT-DETR model, and achieve model inference acceleration through model optimization technology; and under the guidance of OpenVINO engineers, we will re-compile the official library by modifying the OpenVINO source code. Implemented GPU support for RT-DETR model.

All the code used in the project has been open sourced on GitHub and is collected in the OpenVINO-CSharp-API project. The link to the project directory is:

You can also access the project directly, the project link is:

https://github.com/guojin-yan/OpenVINO-CSharp-API/tree/csharp3.0/tutorial_examples

https://github.com/guojin-yan/RT-DETR-OpenVINO.git

Chapter 1 Using OpenVINO to implement RT-DETR model INT8 quantization

Post-training model optimization is the use of special methods to transform the model into a more hardware-friendly representation without the need for retraining or fine-tuning. Currently the most popular and widely used method is INT8 quantization, which has the following advantages:

  1. It's easy to use.
  2. It won't affect accuracy much.
  3. It provides significant performance improvements.
  4. It works well with a lot of stock hardware since most of them support 8-bit computing natively.

INT 8 quantization reduces the accuracy of model weights and activation functions to 8 bits, thereby reducing the model footprint by nearly 4 times, reducing the throughput required for inference, and significantly improving inference speed. The quantization process is completed offline before actual inference. The quantification process of the model implemented through OpenVINO does not require training data sets or training code in the source deep learning framework.

In order to allow everyone to better reproduce the RT-DETR model INT8 quantification process, we provide a complete Notebook file, and users can perform step-by-step operations according to the file operation process. The complete code for implementing RT-DETR model INT8 quantization using OpenVINO has been uploaded to GitHub. The article link is:

https://github.com/guojin-yan/RT-DETR-OpenVINO/blob/master/optimize/openvino-convert-and-optimize-rt-detr.ipynb

In order to facilitate everyone to reproduce this project, a demonstration video is recorded here and has been published to Station B. The video link is:

https://www.bilibili.com/video/BV11N411T7m5/

1.1  Neural Network Compression Framework (NNCF)

The Neural Network Compression Framework (NNCF) provides a post-training quantization API available in Python, designed to reuse code for model training or validation that is typically available for models in source frameworks such as PyTorch or TensroFlow. The NNCF API is cross-framework and currently supports models in the following frameworks: OpenVINO, PyTorch, TensorFlow 2.x, and ONNX. Currently, post-training quantization of models in OpenVINO intermediate representation is the most mature in terms of supported methods and model coverage.

The NNCF API has two ways to implement post-training INT 8 quantization:

  1. Basic Quantization: The basic quantization flow is the simplest way to apply INT 8 quantization to your model. It works for models in the OpenVINO, PyTorch, TensorFlow 2.x, and ONNX frameworks. In this case, only a representative calibration data set is required.
  2. Quantization with Accuracy Control: This is an advanced quantization process that allows INT 8 quantization to be applied to the model and the accuracy metrics controlled through the validation function. Currently only models in the OpenVINO framework are supported. In addition to the calibration data set, a validation data set is also required to calculate accuracy metrics.

1.2 Preparing the calibration data set

In this experiment, we only implement basic quantization, so we only need to prepare the verification data set. The RT-DETR model pre-training model is trained on the COCO data set, so we only need to prepare the COCO validation data set. In order to make it easier to build a verification data set, we use the API method under the ultralytics framework here.

1.2.1  Download COCO 验证数集

The COCO verification data set can be downloaded directly from the official website or through the following code:

DATA_URL = "http://images.cocodataset.org/zips/val2017.zip"

LABELS_URL = "https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017labels-segments.zip"

CFG_URL = "https://raw.githubusercontent.com/ultralytics/ultralytics/8ebe94d1e928687feaa1fee6d5668987df5e43be/ultralytics/datasets/coco.yaml"

CACHE_URL = "https://github.com/guojin-yan/RT-DETR-OpenVINO/releases/download/Model2.0/val2017.cache"

OUT_DIR = Path('./datasets')

DATA_PATH = OUT_DIR / "val2017.zip"

LABELS_PATH = OUT_DIR / "coco2017labels-segments.zip"

CFG_PATH = OUT_DIR / "coco.yaml"

CACHE_PATH = OUT_DIR / "coco/labels/val2017.cache"

download_file(DATA_URL, DATA_PATH.name, DATA_PATH.parent)

download_file(LABELS_URL, LABELS_PATH.name, LABELS_PATH.parent)

download_file(CFG_URL, CFG_PATH.name, CFG_PATH.parent)

if not (OUT_DIR / "coco/labels").exists():

    with ZipFile(LABELS_PATH , "r") as zip_ref:

        zip_ref.extractall(OUT_DIR)

    with ZipFile(DATA_PATH , "r") as zip_ref:

        zip_ref.extractall(OUT_DIR / 'coco/images')

download_file(CACHE_URL, CACHE_PATH.name, CACHE_PATH.parent)

1.2.2  ValidatorWrapping device

The Yolov8 model repository uses the Validator wrapper, which represents the accuracy validation pipeline. It creates data loaders and evaluation metrics, and updates the metrics for each batch of data generated by the data loader. In addition, it is also responsible for data preprocessing and result postprocessing. The RT-DETR model training set also uses the COCO data set. For convenience, we use the Yolov8 environment to configure the data here.

For class initialization, configuration should be provided. We will use the default settings, but you can override it with some parameters to test custom data. The Yolov8 model model has been connected to the ValidatorClass method, so we create the validator class instance through this model.

args = get_cfg(cfg=DEFAULT_CFG)

args.data = str(CFG_PATH)

YOLO_MODEL = "yolov8n"

models_dir = Path('./models')

yolo_model = YOLO(models_dir / f'{YOLO_MODEL}.pt')

det_validator = yolo_model.ValidatorClass(args=args)

det_validator.data = check_det_dataset(args.data)

det_data_loader = det_validator.get_dataloader("./datasets/coco", 1)

det_validator.is_coco = True

det_validator.class_map = ops.coco80_to_coco91_class()

det_validator.names = yolo_model.model.names

det_validator.metrics.names = det_validator.names

det_validator.nc = yolo_model.model.model[-1].nc

1.2.3  Converting data sets for quantification

In the previous step, we used the Validator wrapper to create the validation data set, but if it can be used in quantification, it needs to be converted. Here OpenVINO provides the API interface nncf.Dataset(), through which the Validator wrapper can be read. verify the data.

def transform_fn(data_item:Dict):

    input_tensor = det_validator.preprocess(data_item)['img'].numpy()

    return input_tensor

quantization_dataset = nncf.Dataset(det_data_loader, transform_fn)

1.3  Define model accuracy verification method

In order to observe the change in model prediction accuracy before and after model quantification, the model accuracy test method is customized here:

def sigmoid(z):

    return 1/(1+np.exp(-z))

def rtdetr_result(preds_box,preds_score):

    results=[]

    n=0

    for i in range(300):

        scores=preds_score[0,i,:]

        score=sigmoid(np.max(np.array(scores)))

        if(score<0.0001):

            continue

        result=[]

        cx=preds_box[0,i,0]*640.0

        cy=preds_box[0,i,1]*640.0

        w=preds_box[0,i,2]*640.0

        h=preds_box[0,i,3]*640.0

        result.append(cx-0.5*w)

        result.append(cy-0.5*h)

        result.append(cx+0.5*w)

        result.append(cy+0.5*h)

        result.append(score)

        result.append(np.argmax(scores))

        results.append(result)

        n+=1

    return [torch.tensor(results)]

def test(model:ov.Model, core:ov.Core, data_loader:torch.utils.data.DataLoader, validator, num_samples:int = None):

    validator.seen = 0

    validator.jdict = []

    validator.stats = []

    validator.batch_i = 1

    validator.confusion_matrix = ConfusionMatrix(nc=validator.nc)

    compiled_model = core.compile_model(model)

    for batch_i, batch in enumerate(tqdm(data_loader, total=num_samples)):

        if num_samples is not None and batch_i == num_samples:

            break

        batch = validator.preprocess(batch)

        results = compiled_model(batch["img"])

        preds_box = torch.from_numpy(results[compiled_model.output(0)])

        preds_score = torch.from_numpy(results[compiled_model.output(1)])     

        preds=rtdetr_result(preds_box,preds_score)

        validator.update_metrics(preds, batch)

    stats = validator.get_stats()

    return stats

1.4  Model quantification

1.4.1 Model quantification implementation

First we define the quantification interface. The nncf.quantize function provides an interface for model quantification. The main input parameters are: OpenVINO model and quantification data set. The implementation is as follows:

quantized_det_model = nncf.quantize(

    det_ov_model,

    quantization_dataset,

    preset=nncf.QuantizationPreset.MIXED

)

Through the above method, the quantification of the RT-DETR model can be achieved.

1.4.2  Accuracy test before and after quantification

In the previous article, we defined the model accuracy testing method, so we perform accuracy testing before and after quantification here. Through this method, we conduct testing on 1,000 test sets, and finally print the model prediction accuracy:

fp_det_stats = test(det_ov_model, core, det_data_loader, det_validator, num_samples=NUM_TEST_SAMPLES)

int8_det_stats = test(quantized_det_model, core, det_data_loader, det_validator, num_samples=NUM_TEST_SAMPLES)

print("FP32 model accuracy")

print_stats(fp_det_stats, det_validator.seen, det_validator.nt_per_class.sum())

print("INT8 model accuracy")

print_stats(int8_det_stats, det_validator.seen, det_validator.nt_per_class.sum())

The picture above shows the prediction accuracy of the printed model before and after quantification. In order to better view the accuracy changes before and after quantification, a histogram is drawn here, as shown in the figure below. Through the histogram, it can be seen that the model before and after quantification, the model There is a small decrease in prediction accuracy, but it also meets our model quantification needs.

1.4.3  Speed ​​comparison before and after quantification

OpenVINO  provides performance testing tools Benchmark App , to facilitate developers to quickly test the performance of the OpenVINO  model on different hardware platforms, here due to a> platforms. CPU model is not supported yet, so here We only conduct testing onRT-DETROperator pairOpenVINO GPU

First test the model before quantification, just enter the following command:

!benchmark_app -m {det_model_path} -d {device.value} -api async -shape "[1,3,640,640]"

The above figure shows the inference speed of the model before quantization on the CPU platform. It can be seen that the inference speed of the model before quantization can only reach 2.67 FPS. Next, test the quantified model.You only need to enter the following command:

!benchmark_app -m {int8_model_det_path} -d {device.value} -api async -shape "[1,3,640,640]" -t 15

The above figure shows the inference speed of the quantized model under the iCPU platform. It can be seen that the inference speed of the model before quantization can only reach 8.82 FPS, and the inference speed is increased by 3.3 times.

Note: The above quantification process is only explained based on a few key steps of the two countries. For the complete quantification steps and implementation sequence, please refer to the Notebook< pointed out above. /span>Article.

Chapter 2 Intel iGPU Inference acceleration RT-DETR model and speed comparison

2.1 Intel iGPU Inference acceleration RT-DETR model implementation

Since the current OpenVINO release GPU operator does not support RT-DETR model implementation, we iGPU accelerated inference cannot be performed directly, but in order to improve the speed of model inference, iGPU accelerated inference can be achieved by submitting Issues through OpenVINO GitHub, modifying the source code, then recompiling the source code, and updating the dynamic link library reference. Issues link is:

[Bug]: There was an error compiling the RT-DETR model on the GPU platform using OpenVINO. · Issue #20871 · openvinotoolkit/openvino (github.com)

When using it, you can refer to the official solution, modify the source code, and then compile the source code according to the OpenVINO source code compilation process to obtain the modified dynamic link library.

For your convenience, we provide a compiled Windows dynamic link library in the project, which can be downloaded in the following ways:

wget https://github.com/guojin-yan/RT-DETR-OpenVINO/releases/download/Model2.0/openvino_new_build.rar

After downloading the compiled dynamic link library file through this method, then refer to the previous article "Deploying RT-DETR model based on OpenVINO™ C++ API | Developer practice" ( Deploy RT-DETR model based on OpenVINO™ C++ API | Developer Practice) to develop, replace the dynamic link library reference with the dynamic link library file downloaded here, and set the device to GPU, Then GPU inference can be implemented.

2.2  CPUGPU reasoning speed ratio

In the previous step, we implemented the inference acceleration of the RT-DETR model on the iGPU device based on OpenVINO, and in the previous section we implemented the model INT8 quantization. Finally, we conducted a comparison experiment here, running on CPU and iGPU respectively. Under the inference device, infer the FP32 and INT8 models to check the inference speed of the model. In order to avoid chance, the final inference speed is obtained by averaging 100 times of inference, as shown in the following table:

CPU: 11th generation Intel Cool i7-1165G7

iGPU:Intel Iris Xe Graphics integrated graphics card

rtdetr_r50vd_6x_coco

inference device

Model accuracy

Load model

Image processing

Download Data

Model reasoning

Result processing

FPS

CPU

FP32

712.51

3.28

1.67

360.52

0.75

2.77

INT8

1058.57

2.98

1.57

123.07

0.8

8.13

iGPU

FP32

623.3

3.38

1.15

133.54

0.48

7.49

INT8

674.61

3.2

1.2

83.14

0.48

12.03

rtdetr_r50vd_m_6x_coco

inference device

Model accuracy

Load model

Image processing

Download Data

Model reasoning

Result processing

FPS

CPU

FP32

513.9

3.16

1.62

269.58

0.84

3.71

INT8

695.34

3.04

1.47

86.65

0.75

11.54

iGPU

FP32

671.21

3.39

1.13

107.73

0.5

9.28

INT8

636.24

3.09

1.99

59.09

0.41

16.92

rtdetr_r34vd_6x_coco

inference device

Model accuracy

Load model

Image processing

Download Data

Model reasoning

Result processing

FPS

CPU

FP32

519.14

3.29

1.65

236.62

0.84

4.23

INT8

756.66

2.91

1.47

82.99

0.74

12.05

iGPU

FP32

664.2

3.1

2.19

80.36

0.45

12.44

INT8

679.55

3.21

2.22

56

0.46

17.86

rtdetr_r18vd_6x_coco

inference device

Model accuracy

Load model

Image processing

Download Data

Model reasoning

Result processing

FPS

CPU

FP32

405.11

3.29

1.66

166.65

0.77

6.00

INT8

588.84

2.86

1.45

60.16

0.71

16.62

iGPU

FP32

630.04

4.19

2.61

63.69

0.48

15.70

INT8

640.31

2.97

1.91

42.86

0.49

23.33

Test hardware: The CPU processor is the 11th generation Intel Cool i7-1165G7, and the iGPU is Intel Iris Xe Graphics integrated graphics card. By plotting model inference time and FPS histograms on different platforms, different precisions, it can be clearly seen that regardless of whether the model is quantized or iGPU accelerated, on different inference devices, the model inference speed will be 1 to 3 times different. promote. Through testing, it can be seen that the fastest speed can achieve inference of about 23 frames.

3. Summary

In this article, we implemented the INT8 quantization of the RT-DETR model based on the model optimization tool NNCF under OpenVINO, and achieved a model inference speed increase of about 3 to 4 times at a minimal loss of accuracy. The model size will be 1/4 of the original, which not only improves the inference speed of the model, but also reduces the memory occupied by the model inference, which is of great significance for deployment on edge devices.

In addition, through the guidance of OpenVINO Github, we solved the problem that the RT-DETR model cannot be inferred on Intel iGPU, and realized the inference acceleration on iGPU devices using OpenVNO, which improved the model inference speed by 1 to 3 times.

Guess you like

Origin blog.csdn.net/gc5r8w07u/article/details/134548499