Smaller is better: Q8-Chat, experience efficient generative AI on Intel Xeon CPU

Large language models (LLMs) are taking the machine learning world by storm. Thanks to its transformer architecture, LLM has the uncanny ability to learn from large amounts of unstructured data such as text, images, video or audio. They perform extremely well on a variety of task types, whether they are extractive tasks like text classification or generative tasks like text summarization and vinson images.

As the name suggests, LLMs are large  models, which typically have more than 10 billion parameters, and some even have more than 100 billion parameters, such as the BLOOM model. LLM requires a lot of computing power to meet the low-latency requirements of certain scenarios (such as search, conversational applications, etc.). While large computing power is usually only available on high-end GPUs, unfortunately, the associated costs can be prohibitive for many organizations, making it difficult for them to use state-of-the-art LLMs in their application scenarios.

In this article, we discuss optimization techniques that help reduce LLM size and inference latency so that they can run efficiently on Intel CPUs.

Getting Started with Quantification

LLMs are usually trained with 16-bit floating point parameters (ie FP16 or BF16). Therefore, storing one weight value or activation value requires 2 bytes of memory. Additionally, floating-point operations are more complex, slower, and require additional computing power than integer operations.

Quantization is a model compression technique that aims to solve the above two problems by reducing the range of values ​​of model parameters. For example, you can quantize models to lower precision, such as 8-bit integers (INT8), to shrink their bit width and replace complex floating-point operations with simpler, faster integer operations.

In short, quantization scales model parameters to a smaller range of values. Once successful, it will shrink your model by a factor of at least 2 without any impact on model accuracy.

You can do training-time quantization, known as quantization-aware training (QAT), which is usually more accurate. If you need to quantize an already trained model, you can use post-training quantization (PTQ), which is faster and requires less computing power.

There are many quantitative tools in the market. For example, PyTorch has built-in support for quantization. You can also use the Hugging Face Optimum-Intel library, which contains QAT and PTQ APIs for developers.

Quantized LLM

Recently, studies [1][2] have shown that current quantization techniques are not suitable for LLM. There is a special phenomenon in LLM, that is, the amplitude anomalies of some specific activation channels can be observed in each layer and each word vector, that is, the activation value of some channels has a larger amplitude than other channels. As an example, the image below is from the OPT-13B model, and you can see that among all the word vectors, one of the channels has a much larger activation value than all other channels. This phenomenon exists in every transformer layer.

a0f9015607affaa41a15a670469caa59.png
Source: SmoothQuant paper

To date, the best activation quantization technique is word-by-word quantization, which leads to either truncated outliers or underflow of small-magnitude activations, both of which significantly degrade model quality. Quantization-aware training requires additional training, which is impractical in most cases due to lack of computational resources and data.

SmoothQuant [3][4] as a new quantization technique can solve this problem. It performs smoothing by performing a joint mathematical transformation of weights and activations, reducing the ratio between outliers and non-outliers in activations at the expense of increasing the ratio between outliers and non-outliers in weights. . This transformation makes the layers of the transformer model "quantization friendly" and enables 8-bit quantization again without compromising model quality. As a result, SmoothQuant can help generate smaller, faster models that perform well on Intel CPU platforms.

46967ac7e67c840e76e9eb37d6da3544.png
Source: SmoothQuant paper

Now, let's see how SmoothQuant works on the popular LLM.

Quantify LLM using SmoothQuant

Our partners at Intel have quantized several LLMs using SmoothQuant-O3, namely: OPT 2.7B, 6.7B [5], LLaMA 7B [6], Alpaca 7B [7], Vicuna 7B [8], BloomZ 7.1B [9] and MPT-7B-chat [10]. They also evaluated the accuracy of the quantified model using EleutherAI's language model evaluation tool.

The table below summarizes their findings. The second column shows the number of tasks whose performance was improved after quantization. The third column shows the mean of the average performance degradation of each task after quantization (* negative values ​​indicate that the average performance of the model after quantization has improved ). You can find detailed results at the end of the article.

031c92df877e79a3cc01fbcadc12b0a6.png

As you can see, the OPT model is well suited for SmoothQuant quantization. The model is approximately 2x smaller than the pretrained 16-bit model. Most indicators improved, while those that did not improved only slightly decreased.

For LLaMA 7B and BloomZ 7.1B, the situation is mixed. The model was compressed by a factor of about 2, and the metrics improved for about half of the tasks. But again, the other half of the metrics were only slightly affected, with only one task having a relative degradation of more than 3%.

The obvious benefit of using a smaller model is that inference latency is significantly reduced. This video demonstrates real-time text generation with batch size 1 using the MPT-7B-chat model on a 32-core 1-socket Intel Sapphire Rapids CPU.

In this example, we ask the model: “ What is the role of Hugging Face in democratizing NLP? ”. The program sends the following prompt to the model: “ A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: What is the role of Hugging Face in democratizing NLP? ASSISTANT : "

f53798b3125227075fb83135cf6dc9cf.gif

This example shows that 8-bit quantization can gain additional latency gain on 4th generation Xeon processors, resulting in very short generation times per word. This level of performance certainly makes it possible to run LLM on CPU platforms, providing customers with greater IT flexibility and better price/performance than ever before.

Experience chat application on Xeon CPU

Clement, CEO of HuggingFace, recently said: "Focus on small-scale, vertical-domain models with lower training and operating costs will benefit more companies." The reduction of fine-tuning and inference costs in production creates new opportunities. As we showed above, high-quality quantization brings a high-quality chat experience to Intel CPU platforms without the need for bulky LLMs and complex AI accelerators.

We've worked with Intel to create a demo of an interesting new application in Spaces called Q8-Chat (pronounced Cute chat). Q8-Chat provides a chat experience similar to ChatGPT, but only needs a single-socket Intel Sapphire Rapids CPU with 32 cores (batch size is 1).

Space experience address: https://intel-q8-chat.hf.space

Next step

We are working on integrating the Intel Neural Compressor into Hugging Face Optimum Intel so that Optimum Intel can take advantage of this new quantization technique. Once done, you can reproduce our results with just a few lines of code.

stay tuned.

The future belongs to 8bit!

This article guarantees that it does not contain ChatGPT.

thank you

This article was co-authored with Ofir Zafrir, Igor Margulis, Guy Boudoukh, and Moshe Wasserblat from Intel Labs. Special thanks to them for their valuable comments and cooperation.

Appendix: Detailed Results

Negative values ​​indicate improved performance after quantization.

2e014942a09e7701aad2df5f3677b32f.png 19872dd7c965ee8770338a204f8e51c4.png baa1b6233f3809b17d07c90bf9439432.png 6a331c69c7637e917260bc3488dc5b5c.png

Original English: https://hf.co/blog/generative-ai-models-on-intel-cpu

Author: Julien Simon

Translator: Matrix Yao (Yao Weifeng), an Intel deep learning engineer, works on the application of transformer-family models on various modal data and the training and reasoning of large-scale models.

Proofreading/Typesetting: zhongdongy (阿东)

Guess you like

Origin blog.csdn.net/HuggingFace/article/details/130838580