Developer Practice | How to use low-bit quantization technology to further improve large model inference performance

Click on the blue text

Follow us to make development more interesting

Author | Yang Yicheng

Typesetting | Li Qing

How to take advantage of low-bit quantization technology

Further improve large model inference performance

In response to the performance requirements of large language models (LLM) during the deployment process, low-bit quantization technology has always been one of the best optimization solutions. This article will explore how low-bit quantization technology can help LLM improve performance, and how the new version of OpenVINO™ can improve low-bit performance. Quantitative technology support.

Large model performance bottleneck

Compared with the increase in the amount of calculation, the inference speed of large models is more susceptible to the impact of memory bandwidth (memory bound), that is, the memory read and write efficiency problem. This is because large models have huge parameters and the amount of memory access far exceeds the memory bandwidth capacity, which means that Because the reading and writing speed of the model's weights cannot keep up with the computing intensity of the hardware for the operators, the computing resources cannot be fully utilized, which in turn affects the performance.

94b2fe3daee5ae4365ea72480e0c4c00.png

Figure: Comparison between memory bound and compute bound

Low bit quantization technology

Low-bit quantization technology refers to compressing the model parameters from fp32/fp16 to a lower bit width expression, reducing the model volume without affecting the accuracy of the model output and the amount of parameters, thereby reducing the need for caching Reduce the pressure of data reading and writing and improve reasoning performance. Since the weight volume on a single layer in a large model is often much larger than the input data (activation) of the layer, quantization techniques for large models often only quantify the key weight parameters (WeightOnly), but not the input data. , while achieving the ideal compression ratio, ensure the output results as much as possible and achieve the highest quantization"cost-effectiveness".

074e61bcc27bcd68e0cf4c1ed9fa5d3f.png

Figure: Weight compression diagram

It has been proven that conventional int8 weight quantization has a very low impact on the accuracy of large models. In order to introduce more extreme compression accuracy like int4 and nf4, some explorations have been done on the weight quantization algorithm. The most typical one is GPTQ algorithm, to put it simply, GPTQ quantifies all parameters in a block one by one. After each parameter is quantized, other unquantized parameters in the block need to be appropriately adjusted to make up for the loss of accuracy caused by quantization. GPTQ quantification requires the preparation of calibration data sets, so it is also a PTQ (Post Training Quantization) quantification technology.

OpenVINO™ 2023.2 support for int4 model

Compared with the 2023.1 version, OpenVINO™ 2023.2 fully introduces support for int4 models and quantification technology. There are mainly two aspects:

1. CPU and iGPU support native int4 model inference

The OpenVINO™ tool can currently directly read the int4 model quantized by NNCF, or convert the model quantized using the AutoGPTQ library in HuggingFace for reading and compilation. Since the current OpenVINO™ back-end hardware cannot directly support operations in the int4 data format, during model execution, the OpenVINO™ runtime will dequantize the int4 weight Perform operations with the precision of FP16 or BF16. In short: The model is stored with int4 precision and calculated with fp16 precision. The computing cost is exchanged for space and IO costs, and the operating efficiency is improved. This is also because the performance bottleneck of large models mainly comes from memory bound, which uses higher data reading and writing efficiency to reduce the cost of memory bandwidth and memory capacity.

7162036f54043d6da9c188fda2222b46.png

Figure: Model structure after NNCF weight compression

2. The NNCF tool supports int4’s mixed-precision quantization strategy (Weights Compression)

The GPTQ just mentioned is a data-based quantification scheme that requires the verification data set to be prepared in advance. This operation can be completed with the help of HuggingFace's Transformers and AutoGPTQ libraries. In order to help developers shorten the compression time of LLM models and lower the quantization threshold, the NNCF tool introduced a weight compression mode for int4 and nf4 precision in version 2.7.0. This is a data-free mixed-precision quantization algorithm that does not require Prepare the verification data set and perform weight compression only on the Linear and Embedding layers in LLM. The entire process can be completed with just one line of code:

compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4, group_size=64, ratio=0.9)

where model  is the model object of PyTorch or OpenVINO™; mode  represents the quantization mode. Here you can chooseCompressWeightsMode.NF4, or CompressWeightsMode.INT4_ASYM/INT4_SYM  and other different modes; in order to improve the quantization efficiency, Weights Compression uses a grouped quantization strategy (grouped quantization), so it needs to pass group_size Configure the group size, for example group_size=64, which means that the parameters of 64 channels will share the same set of quantization parameters (zero point, scale value); in addition, since the data-free int4 quantization strategy is Bringing a certain loss of accuracy. In order to balance the model volume and accuracy, Weights Compression also supports a mixed precision strategy. By defining the ratioratio value, we can Some weights that are sensitive to accuracy are represented by int8. For example, in the case of ratio=0.9, 90% of the weights are represented by int4 and 10% by int8. Developers can adjust this parameter based on the output of the quantized model.

During the quantization process, NNCF will compare the differences between the pseudo-quantized weights and the original floating-point weights layer by layer through search (https://github.com/openvinotoolkit/nncf/blob/5eee3bc293da2e94b30cb8dd19da9f20fce95f02/nncf/quantization/algorithms/ weight_compression/openvino_backend.py#L409C5-L409C5), measures the error loss that the quantization operation may bring to each layer, and compresses the weight with relatively low loss to int4 bit width based on the sorting result and the user-defined ratio value.

Chinese language model practice

·With the release of OpenVINO™ 2023.2, the int4 compression example of the large language model has also been added to the openvino_notebooks repository (https://github.com/OpenVINO-dev-contest/openvino_notebooks/tree/main/notebooks /254-llm-chatbot), this time specially added examples for Chinese LLM, including the currently popular models ChatGLM2  and Qwen. In this notebook, developers can experience how to export an OpenVINO™ IR format model from HuggingFace's warehouse, perform low-bit quantization through the NNCF tool, and finally complete the construction of a chatbot.

a448782838f505893424bc770ac0ff1b.png

Figure: Comparison of space occupied by fp16 and int4 models

As can be seen from the screenshot above, after NNCF's int4 quantization, qwen-7b-chat can be compressed to 1/3 of the original fp16 model, so that a notebook with 16GB of memory can run the compressed ChatGLM2 smoothly. Model. In addition, we can also deploy the LLM model on the integrated graphics card in the Core CPU to improve performance and reduce the task load on the CPU side.

d210b7adc50e8ff005cd380253b7d1fd.png

Figure: Notebook running effect

Summarize

The support for int4 weight quantization in OpenVINO™ 2023.2 can comprehensively improve the running performance of large models on the Intel platform, while reducing the capacity requirements for storage and memory, lowering the threshold for developers to deploy large models, and allowing localized large models to It is possible for language model applications to be implemented on ordinary PCs.

Reference item site

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/254-llm-chatbot

https://github.com/openvinotoolkit/nncf/tree/5eee3bc293da2e94b30cb8dd19da9f20fce95f02/nncf/quantization/algorithms/weight_compression

OpenVINO™

--END--

你也许想了解(点击蓝字查看)⬇️➡️ OpenVINO™ 2023.2 发布:让生成式 AI 在实际场景中更易用➡️ 开发者实战 | 介绍OpenVINO™ 2023.1:在边缘端赋能生成式AI➡️ 基于 ChatGLM2 和 OpenVINO™ 打造中文聊天助手➡️ 基于 Llama2 和 OpenVINO™ 打造聊天机器人➡️ OpenVINO™ DevCon 2023重磅回归!英特尔以创新产品激发开发者无限潜能➡️ 5周年更新 | OpenVINO™  2023.0,让AI部署和加速更容易➡️ OpenVINO™5周年重头戏!2023.0版本持续升级AI部署和加速性能➡️ OpenVINO™2023.0实战 | 在 LabVIEW 中部署 YOLOv8 目标检测模型➡️ 开发者实战系列资源包来啦!➡️ 以AI作画,祝她节日快乐;简单三步,OpenVINO™ 助你轻松体验AIGC
➡️ 还不知道如何用OpenVINO™作画?点击了解教程。
扫描下方二维码立即体验 
OpenVINO™ 工具套件 2023.2

Click to read the original text and experience OpenVINO 2023.2 now

61cb3a107fe9be1566c034fe281ffe1b.png

The article is so exciting, are you “reading” it?

Guess you like

Origin blog.csdn.net/OpenVINOCC/article/details/134746561