Q-YOLO

Compared with the great progress of large-scale visual transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. Efficient Inference with Q-YOLO for Real-time Object Detection

Real-time object detection plays a vital role in various computer vision applications. However, deploying real-time object detectors on resource-constrained platforms poses challenges due to high computational and memory requirements. This paper describes a low-bit quantization method to build an efficient onr-stage detector, called Q-YOLO , which can effectively solve the performance degradation caused by the imbalanced activation distribution in traditional quantized YOLO models. Q-YOLO introduces a fully end-to-end post-training quantization (PTQ) pipeline with a well-designed one-sided histogram (UH)-based activation quantization scheme that determines the maximum cutoff value via histogram analysis by minimizing the mean square error (MSE) quantization error.

Extensive experiments on the COCO dataset demonstrate the effectiveness of Q-YOLO, outperforming other PTQ methods while achieving a more favorable balance between accuracy and computational cost. This research helps to promote efficient deployment of object detection models on resource-constrained edge devices, enabling real-time detection while reducing computation and memory overhead.

Real-time object detection is a key component in various computer vision applications such as multi-object tracking, autonomous driving, and robotics. The development of real-time object detectors, especially YOLO-based detection, has achieved remarkable performance in terms of accuracy and speed.

Returning to the text, for example, the YOLOv7-E6 object detector achieves 55.9% mAP on COCO 2017, outperforming both the transformer-based detector SWINL cascaded mask R-CNN and the convolution-based detector ConvNeXt XL cascaded mask R-CNN in terms of speed and accuracy. Despite their success, the computational cost during inference remains a challenge for real-time object detectors on resource-constrained edge devices such as mobile CPUs or GPUs, limiting their practical use. Substantial efforts have been made in network compression to enable efficient online inference. Methods include enhanced network design, performing network search, network pruning, and network quantification. Quantization, in particular, has become very popular for deployment on AI chips by using a low-bit format to represent networks. There are two mainstream quantization methods, quantization-aware training (QAT) and post-training quantization (PTQ) . Although QAT generally achieves better results than PTQ, it requires training and optimization of all model parameters during quantization. The need for pre-training data and massive GPU resources makes the execution of QAT challenging. On the other hand, PTQ is a more effective method for quantifying real-time object detection. 

To examine low-bit quantization for real-time object detection, we first establish a PTQ baseline using YOLOv5, a state-of-the-art object detector. Through empirical analysis on the COCO 2017 dataset, it is observed that the performance after quantization drops significantly, as shown in the table above. For example, 4-bit quantized YOLOv5s with Percentile achieves only 7.0% mAP, resulting in a performance gap of 30.4% compared to the original real-valued model.

The performance drop of quantized YOLO is found to be attributable to an imbalance in the activation distribution. As shown in the graph below, high concentrations close to the lower limit values ​​were observed, and the incidence above zero was significantly reduced. When using a fixed cutoff such as MinMax, representing activation values ​​with extremely low probability will consume a considerable number of bits within the limited integer bit width, leading to further loss of information.

In view of the above issues, we introduce Q-YOLO, a fully end-to-end PTQ quantization architecture for real-time object detection, as shown in the figure below. Q-YOLO quantizes the backbone, neck, and head modules of the YOLO model, while using standard MinMax quantization for weights. To address the problem of unbalanced activation distribution, a new method called unilateral histogram-based (UH) activation quantization is introduced. UH iteratively determines the maximum cutoff value that minimizes the quantization error through the histogram. This technique significantly reduces calibration time and effectively resolves quantification-induced variance, optimizing the quantification process to maintain stable activation quantification. Accurate object detection results are ensured by reducing information loss in activation quantization, resulting in accurate and reliable low-bit real-time object detection performance.

Network quantization process. The main steps of the post-training quantization (PTQ) process are first reviewed and detailed information is provided. First, weights and activations are trained using full precision and floating-point arithmetic or the network is provided as a pretrained model. Subsequently, the numerical representations of weights and activations are appropriately transformed for quantization. Finally, deploying fully quantized networks on integer arithmetic hardware or emulating them on GPUs enables efficient inference with reduced memory storage and computational requirements while maintaining reasonable levels of accuracy.

Quantization range setting. Quantization range setting is the process of establishing the upper and lower clipping thresholds of the quantization grid, denoted as u and l, respectively. The key tradeoff in range setting is the balance between two types of errors: clipping error and rounding error. as described in the following equation.

Clipping errors occur when data is truncated to fit within the predefined grid limits. Such truncation results in a loss of information and a reduction in the precision of the resulting quantized representation. On the other hand, round-off errors occur due to inaccuracies introduced during the rounding operation, as described in the following equation. 

This error can accumulate over time and have an impact on the overall accuracy of the quantized representation. The following methods provide different trade-offs between the two quantities. 

MinMax

In the experiment, the MinMax method is used for weight quantization, where the clipping thresholds lx and ux are formulated as:

This way there are no clipping errors. However, this method is sensitive to outliers since strong outliers may cause excessive rounding errors.

Mean Squared Error (MSE)

One way to alleviate the problem of large outliers is to employ MSE-based range setting. In this method, lx and ux are determined that minimize the mean squared error (MSE) between the original tensor and the quantized tensor:

Unilateral Histogram-based (UH) Activation Quantization

To address the problem of unbalanced activation values, a new method called unilateral histogram-based (UH) activation quantization is proposed. First, the activation values ​​after forward propagation are empirically studied with a calibration dataset. As shown in the plot above, a concentrated distribution of values ​​is observed around the lower bound, while the number of occurrences above zero is significantly reduced. Further analysis of the activation values ​​revealed an empirical value of -0.2785 as the lower limit. This phenomenon can be attributed to the frequent use of the Swish (SILU) activation function in the YOLO series. Experimentation and Visualization

Comparison of activation value quantification symmetry analysis. Asymmetric means use of an asymmetric activation value quantization scheme, while symmetric means symmetric quantization of activation values. whaosoft  aiot  http://143ai.com  conducted inference speed tests on GPU and CPU platforms in order to actually verify the acceleration benefits brought by the quantization scheme. For the GPU, we chose commonly used GPUs NVIDIA RTX 4090 and NVIDIA Tesla T4, which are usually used for inference tasks in computing centers.

Due to limited CPU resources, only Intel's products i7-12700H and i9-10900, both of which have x86 architecture, were tested. For deployment tools, TensorRT and OpenVINO were chosen. The whole process includes converting the weights in the torch framework to ONNX models with QDQ nodes, and then deploying them to a specific inference framework. The inference mode is set to single image serial inference, and the image size is 640x640. Since most current inference frameworks only support symmetric quantization and 8-bit quantization, a symmetric 8-bit quantization scheme has to be chosen, which results in a very small drop in accuracy compared to an asymmetric scheme. As shown in the table below, the speedup is very significant, especially for the larger YOLOv7 model, where the speedup ratio when using the GPU is even more than 3 times that of the full precision model. This shows that applying quantization in real-time detectors can lead to significant speedups.  

 

 

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131848337