Use BigDL-LLM to instantly accelerate tens of billions of parameter LLM reasoning | the most "in" large model

Author: Intel Corporation Huang Shengsheng, Huang Kai, Dai Jinquan
Qubit | Public Account QbitAI

We are entering a new era of AI driven by Large Language Models (LLMs), which are playing an increasingly important role in various applications such as customer service, virtual assistants, content creation, programming assistance, and more.

However, as the scale of LLM continues to expand, the resource consumption required to run large models is also increasing, causing it to run slower and slower, which brings considerable challenges to AI application developers.

To this end, Intel recently launched a large model open source library called BigDL-LLM [1] , which can help AI developers and researchers accelerate the optimization of large language models on Intel® platforms  , and improve the performance of large language models on Intel® platforms  . use experience.

c629adeebc61b75afef99f8d90994fba.png

The following shows the real-time effect of Vicuna-33b-v1.3 [2], a large  language model with 33 billion parameters accelerated by BigDL-LLM, running on a server equipped with Intel® Xeon® Platinum  8468 processor.

6d3f45be2b97e930caf9831955bce155.gif
The actual speed of running a 33 billion parameter large language model on a server equipped with an Intel® Xeon®  Platinum 8468  processor (real-time screen recording)

BigDL-LLM: An Open Source Large Language Model Acceleration Library on Intel®  Platforms

BigDL-LLM is an optimized acceleration library for large language models, part of the open source BigDL, released under the Apache 2.0 license.

It provides various low-precision optimizations (such as INT4/INT5/INT8), and can utilize multiple Intel® CPU  integrated hardware acceleration technologies (AVX/VNNI/AMX, etc.) and the latest software optimizations to enable large language models Optimize  more efficiently and run faster on Intel® platforms.

One of the important features of BigDL-LLM is that for models based on the Hugging Face Transformers API, only one line of code can be changed to accelerate the model. In theory, any Transformers model can be run, which is very friendly to developers familiar with the Transformers API . .

In addition to Transformers API, many people will also use LangChain to develop large language model applications.

To this end, BigDL-LLM also provides an easy-to-use LangChain integration [3] , so that developers can easily use BigDL-LLM to develop new applications or migrate existing applications based on Transformers API or LangChain API.

In addition, for general PyTorch large language models (models that do not use Transformer or LangChain API), you can also use BigDL-LLM optimize_model API one-key acceleration to improve performance. For details, please refer to GitHub README [4] and official documentation [5] .

BigDL-LLM also provides a large number of commonly used open source LLM acceleration samples (eg samples using Transformers API [6] and samples using LangChain API [7] , as well as tutorials (including supporting jupyter notebooks) [8]  for easy development Those who try to get started quickly.

Installation and use: easy installation process and easy-to-use API interface

Installing BigDL-LLM is as easy as executing the one line command shown below.

pip install --pre --upgrade bigdl-llm[all]

If the code is not displayed completely, please swipe left and right

It is also very easy to use BigDL-LLM to accelerate large models (here only uses the Transformers style API as an example).

To use the BigDL-LLM Transformer style API to accelerate the model, only the model loading part needs to be changed, and the subsequent use process is exactly the same as that of the original Transformers.

The way to load the model with the BigDL-LLM API is almost the same as that of the Transformers API - the user only needs to change the import and set load_in_4bit=True in the from_pretrained parameter .

BigDL-LLM will perform 4-bit low-precision quantization on the model during the process of loading the model, and use various software and hardware acceleration technologies to optimize its execution during the subsequent inference process.

#Load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm. transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

If the code is not displayed completely, please swipe left and right

Example: Quickly implement a voice assistant application based on a large language model

The following will take the common application scenario of LLM "Voice Assistant" as an example to show the case of using BigDL-LLM to quickly implement LLM applications. Typically, the workflow of a voice assistant application is divided into the following two parts:

c628cbf2acaf8dffe1b82c02533cd959.png

Figure 1. Schematic diagram of voice assistant workflow
  1. Speech recognition - use a speech recognition model (this example uses the Whisper model [9]  ) to convert the user's speech into text;

  2. Text generation - use the text output in 1 as a prompt, and use a large language model (Llama2 [10] in this example  ) to generate a reply.

The following is the process of using BigDL-LLM and LangChain [11]  to build a voice assistant application in this paper:

In the speech recognition stage: the first step is to load the preprocessor processor and the speech recognition model recog_model. The recognition model Whisper used in this example is a Transformers model.

Simply using AutoModelForSpeechSeq2Seq in BigDL-LLM and setting the parameter load_in_4bit=True , the model can be loaded and accelerated with INT4 precision, thus significantly reducing the model inference time.

#processor = WhisperProcessor .from_pretrained(recog_model_path)
recog_model = AutoModelForSpeechSeq2Seq .from_pretrained(recog_model_path, load_in_4bit=True)

If the code is not displayed completely, please swipe left and right

The second step is voice recognition. The processor is used first to extract input features from the input speech, then the recognition model is used to predict tokens, and the processor is used again to decode tokens into natural language text.

input_features = processor(frame_data,
                                                                      sampling_rate=audio.sample_rate,
                                                                      return_tensor=“pt”).input_features
predicted_ids = recogn_model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

If the code is not displayed completely, please swipe left and right

In the text generation stage, first use BigDL-LLM's TransformersLLM API to create a LangChain language model (TransformersLLM is a language chain LLM integration defined in BigDL-LLM).

You can load any Hugging Face Transformers model using this API.

llm = TransformersLLM . from_model_id(
                  model_id=llm_model_path,
                  model_kwargs={ "temperature": 0,
                                                             "max_length": args.max_length,
                                                             "trust_remote_code": True},
)

If the code is not displayed completely, please swipe left and right

Then, create a normal dialogue chain LLMChain , and set the already created llm as an input parameter.

# The following code is complete the same as the use-case
voiceassistant_chain = LLMChain(
          llm=llm,
         prompt=prompt,
        verbose=True,
        memory=ConversationBufferWindowMemory(k=2),

If the code is not displayed completely, please swipe left and right

This chain will record all the dialog history and format the dialog history appropriately as prompts for the large language model for generating replies. At this time, you only need to input the text generated by the recognition model as "human_input". The code is as follows:

response_text = voiceassistant_chain .predict(human_input=text,
                                                   stop=”\n\n”)
           

If the code is not displayed completely, please swipe left and right

Finally, by putting the speech recognition and text generation steps in a loop, you can talk to this "voice assistant" in multiple conversations. You can visit the link at the bottom  [12]  to see the complete sample code and try it out on your own computer. Use BigDL-LLM to quickly build your own voice assistant!

About the Author

Huang Shengsheng, senior AI architect of Intel Corporation, Huang Kai, AI framework engineer of Intel Corporation, Dai Jinquan, academician of Intel, global CTO of big data technology, and founder of BigDL project, are all engaged in big data and AI related work.

*This article is authorized to be published by Qubit, and the views are solely owned by the author.

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132632512