Developer Practice | Divide everything? Segment Anything quantitative acceleration actual combat

Author: Yang Yicheng

I. Introduction

"Split everything, and everyone loses their jobs together!"——Recently, such a sentence has become popular on social media! This is the Segment Anything Model ("SAM" for short). What exactly are SAMs? What functions does it have? Is it really that powerful? Let's learn more through this article together!

SAM is a powerful artificial intelligence image segmentation application launched by Meta AI Lab. It can automatically identify which image pixels belong to an object, and perform automatic style processing on each object in the image. It can be widely used in analyzing scientific images, editing photos, etc.

The complete application of SAM consists of an image encoder model (encoder) and a mask decoder (mask decoder) + prompt encoding model (prompt encoder), both of which can be parsed into independent static models. Most of the computing power load and inference delay are concentrated in the image encoder task, so if the execution efficiency of the image encoder is further improved, it becomes one of the main optimization directions for SAM applications.

Figure: SAM model task pipeline

This sharing lecture focuses on demonstrating how to use the NNCF model compression tool of OpenVINO ™ to realize the quantization and compression of the SAM encoder part to achieve performance improvement on the CPU side.

2. Quantitative introduction

Before officially starting the actual combat, we have to mention the concept of quantization. Quantization refers to mapping the expression range of model parameters from FP32 to INT8 or INT4 range without changing the model structure , and using a smaller numerical bit width. Represent the same information to realize the compression of the model volume and reduce memory consumption . At the same time, during the execution of the model network, the system will automatically call the instruction set or kernel function optimized for low-bit data on the hardware platform to improve performance.

Figure: Representation bit width of different precision data

The Intel AVX512 VNNI extended instruction set realizes the compression of the INT8 matrix point multiplication and addition operations that originally required 3 clock cycles to one clock cycle , and in the latest AMX instruction set, multiple VNNI modules are stacked to achieve a single Double the performance improvement within the cycle.

Figure: INT8 Matrix Dot Multiplication and Addition Operation Instruction Set Optimization

3. Quantization mode after NNCF training

The full name of the NNCF tool is Neural Network Compression Framework, which is a solution dedicated to model compression acceleration in the OpenVINO™ tool chain, including quantization , pruning, binarization and other model compression algorithms, and the calling method can be differentiated into post-training There are two modes: quantization (PTQ) and training -time compression (QAT). Compression during training requires the introduction of the original training script and data set, while quantization after training can directly compress the generated model files for training without additional training scripts and Labeling dataset participation is also a new feature officially released by NNCF in OpenVINO™ 2023.0, and this mode can be realized in only the following two steps:

1. Prepare the verification data set . The verification data here is only used to calculate the range and distribution of the data during the quantization process, so no additional label data is required. For example, in the image recognition task, we only need to send in 200- About 300 picture files are enough. In addition, we also need to define the DataLoader object and the transform_fn data conversion function. DataLoader is used to read each element in the verification data set, and transform_fn is used to convert the read elements into direct input data for OpenVINO™ model reasoning .

import nncf
calibration_loader = torch.utils.data.DataLoader(...)
def transform_fn(data_item):
    images, _ = data_item
    return images
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)

2. To run model quantization , you first need to import the model object, and then use the nncf.quantize() interface to bind the model object with the verification data set to start the quantization task . The NNCF tool can support multiple model object types , including openvino.runtime .Model, torch.nn.Module, onnx.ModelProto and tensorflow.Module

model = ... #OpenVINO/ONNX/PyTorch/TF object
quantized_model = nncf.quantize(model, calibration_dataset)   

3. (Optional) Accuracy control mode, if it is found that the accuracy of the model exported by NNCF in the default mode has dropped more than expected, we can also use the accuracy control mode ( accuracy control ) to complete the quantization after training. At this time, we need to add The labeled test set data is used to evaluate which layers of the model have a greater impact (sensitivity) on the loss of model accuracy during the quantization process, and as a sorting basis, these layers are rolled back to the original accuracy in turn until the model meets expectations accuracy performance. Through this mode, we can compress the model volume as much as possible while ensuring the accuracy of the model to achieve a balance between performance and accuracy. For specific methods, please refer to the following links:

https://docs.openvino.ai/nightly/quantization_w_accuracy_control.html

4. Segment Anything + NNCF in practice

Next, let's take a step-by-step look at how to use NNCF's PTQ mode to complete the quantization of the SAM encoder.

Project address: openvino_notebooks/237-segment-anything.ipynb at main openvinotoolkit/openvino_notebooks GitHub

1. Define the data loader

This example uses coco128 as the validation data set, which contains 128 pictures in .jpg format. Since in the case of quantifying ONNX or IR static models, the data loader must be a torch DataLoader class, so here we need to inherit torch.utils.data.Dataset and rebuild a dataset class, which must contain the __getitem__ method , used to traverse each object in the dataset, __len__ is used to obtain the number of objects in the dataset, and finally generate a data loader through the torch.utils.data.DataLoader method.

class COCOLoader(data.Dataset):
    def __init__(self, images_path):
        self.images = list(Path(images_path).iterdir())

    def __getitem__(self, index):
        image_path = self.images[index]
        image = cv2.imread(str(image_path))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return image
    
    def __len__(self):
        return len(self.images)
    
coco_dataset = COCOLoader(OUT_DIR / 'coco128/images/train2017')
calibration_loader = torch.utils.data.DataLoader(coco_dataset)

2. Define the data format conversion module

The next step is to define the data conversion module. We can call the previously defined preprocess_image function to complete the data preprocessing. It is worth noting that the single data object returned by the calibration_loader module is a torch tensor type, and the Python interface of OpenVINO™ does not support this type of data. We need to coerce it to numpy format first.

def transform_fn(image_data):
    image = image_data.numpy()
    processed_image = preprocess_image(np.squeeze(image))
    return processed_image

calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)

3. Run NNCF quantification

In order to ensure the accuracy of the quantized model, here we use the original FP32 ONNX format model as the input object instead of the FP16 IR format model, and then send the object to the nncf.quantize interface for quantization. There are several functions in the interface A more important additional parameter:

# Load FP32 ONNX model
model = core.read_model(onnx_encoder_path)
quantized_model = nncf.quantize(model,
                                calibration_dataset,
                                model_type=nncf.parameters.ModelType.TRANSFORMER,
                                preset=nncf.common.quantization.structs.QuantizationPreset.MIXED)
ov_encoder_path_int8 = "sam_image_encoder_int8.xml"
serialize(quantized_model, ov_encoder_path_int8)
  • model_type : model category, used to enable special quantization strategies, for example, in Transformer-like models, we need to prioritize the accuracy of the model.
  • preset : Quantization mode, the default is PERFORMANCE, using a symmetric quantization algorithm for the weight and bias of the convolution, which helps to improve the performance of the model. Here, in order to improve the accuracy of the model, we use the MIXED mode, which adopts weight symmetric quantization, bias The method of setting asymmetric quantization is suitable for models containing non-Relu or asymmetric activation layers.

Since the network structure of the SAM encoder model is relatively complex, and we need to traverse the parameters of each layer of the model multiple times during the quantization process, the quantization time-consuming is relatively longer, please wait patiently. It is recommended to use a hardware device with a memory of 32G or more. If the memory is insufficient, you can pass the subset_size=100 parameter to appropriately reduce the amount of verification data.

4. Model Accuracy Comparison

Next, we compare the inference results of the INT8 and FP16 models:

Figure: prompt mode FP16 – INT8 result comparison

Figure: auto mode FP16 – INT8 result comparison

It can be seen that in the prompt and auto modes, the accuracy of the INT8 model has almost no change compared with the FP16 model.

Note: In auto mode, the mask will use randomly generated colors.

5. Performance comparison

Finally, we compare the performance indicators through the benchmark_app tool that comes with OpenVINO™:

[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            60 iterations
[ INFO ] Duration:         75716.93 ms
[ INFO ] Latency:
[ INFO ]    Median:        14832.33 ms
[ INFO ]    Average:       14780.77 ms
[ INFO ]    Min:           10398.47 ms
[ INFO ]    Max:           16725.65 ms
[ INFO ] Throughput:   0.79 FPS

Benchmark results (FP16)

[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            72 iterations
[ INFO ] Duration:         68936.14 ms
[ INFO ] Latency:
[ INFO ]    Median:        11281.87 ms
[ INFO ]    Average:       11162.87 ms
[ INFO ]    Min:           6736.09 ms
[ INFO ]    Max:           12547.48 ms
[ INFO ] Throughput:   1.04 FPS

Figure: Benchmark results (INT8)

It can be seen that on the CPU side, the INT8 model has increased by about 30% compared with FP16 , and the volume has been compressed from the original 350MB to less than 100MB.

V. Summary

In view of the excellent automatic segmentation capability of SAM, it is believed that more and more application scenarios will deploy this technology in the future. In the process of industrialization, developers often pay the most attention to the balance between performance and accuracy . This is a more cost-effective solution. The OpenVINO™ NNCF tool significantly improves the operating efficiency of the model and reduces the space occupied by the model by quantizing and compressing the Segment Anything encoder part without affecting the accuracy of the model.

Guess you like

Origin blog.csdn.net/gc5r8w07u/article/details/131066390