A simple model greatly improves the performance of ChatGLM | the most "in" large model

Authors of this article:
Zhao Zhen, Luo Cheng, Li Tingqian, Zou Wenyi

introduction

Since large language models (LLMs) became a hot topic, a large number of Chinese large language models have emerged and been actively deployed in optimization platforms. ChatGLM is one of the widely acclaimed mainstream Chinese language models.

However, since the ChatGLM model has not yet become a native model of the Transformer ecosystem, the official optimum extension library still lacks support for it.

This article provides a  convenient way to refactor the model architecture using OpenVINO ™ opset.

The  solution includes optimized nodes customized for ChatGLM, and these nodes are implemented using Intel® Advanced  Matrix Extensions (Intel® AMX) inline and MHA (Multi-Head Attention, multi-head attention  ) fusion Highly optimized.

Note that this article only presents  an optimized solution by creating an OpenVINO ™ stateful model for ChatGLM. This solution is limited by the platform, and must use the fourth-generation Intel® Xeon® Scalable  Processor [ 1] (code-named Sapphire Rapids) with built-in Intel® AMX  The author does not commit to any maintenance of the solution.

b6fd35c48eac2213b2593f2b931a801a.jpeg

Introduction to ChatGLM Models

When I checked the source code of the original model of ChatGLM [2] , I found that ChatGLM is not compatible with Optimum ModelForCasualML, but defined a new class ChatGLMForConditionalGeneration [3] .

The pipeline loop of this model contains 3 main modules (Embedding, GLMBlock layer [4] and lm_logits), the structure is as follows:

ad6191bff3dc84730ad7ee957c7e1977.png
Figure 1 ChatGLM model structure

As shown in the figure above, the entire pipeline actually requires the model to have two different execution graphs, and the KV cache is not required as the input of the GLMBlock layer when using the input prompt for the first inference. Starting from the second iteration, the previous result of the QKV attention mechanism will be the input for the current round of model inference.

As the length of the generator increases, a large number of large memory copies will be kept between the model input and output during pipeline inference.

Taking the ChatGLM6b default model configuration [5] as an example, the memory copy between the input and output arrays is similar to the following pseudocode, and the memory copy overhead is determined by the model parameter hidden_size and the number of iterations:

while(eos_token_id || max_seq_len){
    memcpy(model_inp, model_outp, num_layer*2*sizeof(model_outp)* hidden_size)
    model_outp.push_back(gen_token)
}
△代码若显示不全,可左右滑动

Therefore, the two key issues to be addressed in this paper are:

  • How to optimize model inference pipeline to eliminate memory copy between model input and output

  • How to optimize the GLMBlock module by redesigning the execution graph

Build OpenVINO  stateful model to achieve significant optimization

First, it is necessary to analyze the structure of the GLMBlock layer, try to encapsulate a class and call OpenVINO ™  opset according to the following workflow . Next, serialize the graph data into an IR model (.xml, .bin).

a0ce1da876e25f43f942df3301d05780.png
Figure 2 ChatGLM builds OpenVINO  stateful model

For how to build the OpenVINO  stateful model, and how to use  the model provided by OpenVINO ™ to create samples and build the model in opset, please refer to the document at the end of this article.

ChatGLM's custom attention mechanism is the part that this article focuses on and optimizes.

The main idea is: build a global context structure, which is used to append and save the pastKV results after each round of iterations inside the model, which reduces the copy overhead of pastKV as the input and output of the model, and uses inline optimization to achieve Rotary Embedding and Multi-Head Attentions.

Intel® AMX is a matrix multiplication accelerator built into the 4th generation Intel® Xeon®  Scalable processors, which can process matrix multiplication and addition  operations of bf16 or  int8 data types faster, and significantly improve inference by accelerating tensor processing and training performance. With the help of Intel® AMX inline instructions (single instruction multiple operations for accelerated computing), highly optimized operators such as Attention and Rotary Embedding in the ChatGLM model are realized, and bf16 instructions are used for multiplication and addition operations, while guaranteeing floating-point exponents  bit precision while improving operational efficiency.

At the same time, this solution also uses int8 precision to compress the weight of the fully connected layer, and bf16 will be used for calculation in real-time calculation. Therefore, there is no need to subject the model to low precision via post-training quantization (PTQ) or quantization-aware training (QAT). The model compression method can reduce the model storage space and reduce the memory bandwidth load, because the calculation still uses floating point, which will not cause overflow and will not cause loss of model accuracy.


Create OpenVINO™ stateful model for ChatGLM

Please follow the example below to configure the software and hardware environment, and follow the steps below to optimize ChatGLM:

hardware requirements

4th generation Intel® Xeon®  Scalable processors (codenamed Sapphire Rapids) or their successors, still with Intel® AMX  built in 

Software Verification Environment

Ubuntu 22.04.1 LTS

Python 3.10.11 for OpenVINO  Runtime Python API

GCC 11.3.0 for building OpenVINO  Runtime

cmake 3.26.4

Build OpenVINO  source code

  • Install system dependencies and set up the environment

  • Create and enable a Python virtual environment

$ conda create -n ov_py310 python=3.10 -y
$ conda activate ov_py310

△代码若显示不全,可左右滑动
  • Install Python dependencies

$ pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 sentencepiece pandas
△代码若显示不全,可左右滑动
  • Compile OpenVINO ™ with GCC 11.3.0

  • Clone OpenVINO  and upgrade submodules

$ git clone https://github.com/luo-cheng2021/openvino.git -b luocheng/chatglm_custom
$ cd openvino && git submodule update --init --recursive

△代码若显示不全,可左右滑动
  • Install Python environment dependencies to build Python Wheel

$ python -m pip install -U pip 
$ python -m pip install -r ./src/bindings/python/src/compatibility/openvino/requirements-dev.txt
$ python -m pip install -r ./src/bindings/python/wheel/requirements-dev.txt

△代码若显示不全,可左右滑动
  • Create compilation directory

$ mkdir build && cd build

△代码若显示不全,可左右滑动
  • Compile OpenVINO ™ using CMake

$ cmake .. -DENABLE_LLMDNN=ON \
    -DBUILD_PYTHON_TESTS=ON \
    -DENABLE_CPU_DEBUG_CAPS=OFF \
    -DENABLE_DEBUG_CAPS=OFF  \
    -DCMAKE_BUILD_TYPE=Release \
    -DENABLE_INTEL_MYRIAD_COMMON=OFF \
    -DENABLE_INTEL_GNA=OFF \
    -DENABLE_OPENCV=OFF \
    -DENABLE_CPPLINT=ON \
    -DENABLE_CPPLINT_REPORT=OFF \
    -DENABLE_NCC_STYLE=OFF \
    -DENABLE_TESTS=ON \
    -DENABLE_OV_CORE_UNIT_TESTS=OFF \
    -DENABLE_INTEL_CPU=ON \
    -DENABLE_INTEL_GPU=OFF \
    -DENABLE_AUTO=OFF \
    -DENABLE_AUTO_BATCH=OFF \
    -DENABLE_MULTI=OFF \
    -DENABLE_HETERO=OFF \
    -DENABLE_INTEL_GNA=OFF \
    -DENABLE_PROFILING_ITT=ON\
    -DENABLE_SAMPLES=ON \
    -DENABLE_PYTHON=ON \
    -DENABLE_TEMPLATE=OFF  \
    -DENABLE_OV_ONNX_FRONTEND=OFF \
    -DENABLE_OV_PADDLE_FRONTEND=OFF \
    -DENABLE_OV_PYTORCH_FRONTEND=OFF \
    -DENABLE_OV_TF_FRONTEND=OFF \
    -DENABLE_OPENVINO_DEBUG=OFF \
    -DENABLE_CPU_DEBUG_CAPS=ON \
    -DCMAKE_INSTALL_PREFIX=`pwd`/install \
    -DCMAKE_INSTALL_RPATH=`pwd`/install/runtime/3rdparty/tbb/lib:`pwd`/install/runtime/3rdparty/hddl/lib:`pwd`/install/runtime/lib/intel64 \
    -Dgflags_Dir=`pwd`/../thirdparty/gflags/gflags/cmake
$ make --jobs=$(nproc --all)
$ make install

△代码若显示不全,可左右滑动
  •  Install Python Wheel built for OpenVINO ™ Runtime and openvino-dev tools

$ pip install ./install/tools/openvino*.whl

△代码若显示不全,可左右滑动
  • Check the system GCC version and the Conda Runtime GCC version. As shown below, if the system GCC version is higher than the Conda GCC version, please upgrade Conda GCC to the same version to meet the requirements of OpenVINO  Runtime. (optional)

##check system (OpenVINO compiling env) gcc version
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
##check conda python (runtime env for OpenVINO later) gcc version
$ python
Python 3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0] on linux
##If sys gcc ver > conda gcc ver, upgrade conda gcc ver -> sys gcc ver
$ conda install -c conda-forge gcc=11.3.0

△代码若显示不全,可左右滑动
  • Convert PyTorch model to OpenVINO  IR

$ cd ..
$ python tools/gpt/gen_chatglm.py /path/to/pytorch/model /path/to/ov/IR

△代码若显示不全,可左右滑动

 Build an inference pipeline for ChatGLM using OpenVINO ™ Runtime API

This article provides  a sample of building an inference pipeline using Transformer and the OpenVINO ™ Runtime API. First, in test_chatglm.py, create a new class derived from transformers.PreTrainedModel.

 Then, the forwarding function is updated by building a model inference pipeline using the OpenVINO ™ Runtime Python API. Other member functions are migrated from ChatGLMForConditionalGeneration of modeling_chatglm.py [2] .

This ensures that input preparation, set_random_seed, tokenizer/detokenizer, and the rest of the pipeline remain consistent with the source code of the original model.

To enable int8 weight compression, simply set the simple environment variable USE_INT8_WEIGHT=1. This is because in the model generation stage, int8 has been used to compress the weights of the fully connected layer, so the model can directly use int8 weights for inference in the subsequent running process, thus eliminating the step of compressing the model through the framework or quantization tools.

 Please follow the steps below to test ChatGLM using OpenVINO ™ Runtime pipeline:

  • Run the bf16 model

$ python3  tools/gpt/test_chatglm.py /path/to/pytorch/model /path/to/ov/IR --use=ov
△代码若显示不全,可左右滑动
  • Run the int8 model

$ USE_INT8_WEIGHT=1 python test_chatglm.py /path/to/pytorch/model /path/to/ov/IR --use=ov

△代码若显示不全,可左右滑动

Weight compression: reduce memory bandwidth usage and increase inference speed

This paper uses Vtune to conduct a performance comparison analysis of the memory bandwidth usage (Figure 3 and Figure 4) and CPI rate of the model weight numerical precision of bf16 and int8 (Table 1). It was found that when the numerical precision of model weights is compressed to int8, the memory bandwidth usage and CPI rate can be reduced at the same time.

74a6942244279c33d8629ca4027788b2.pngFigure 3 Memory bandwidth usage when the numerical precision of model weights is bf16

8a082490f3cb7fd0602dd4ff82322553.pngFigure 4 Memory bandwidth usage when the numerical precision of the model weight is int8

4df6f11cd64992903651c8d514826869.pngTable 1 CPI rate when using different numerical precision of model weights

The clock ticks per Instruction Retired (CPI) event rate consumed by each instruction, also known as the "average number of instruction cycles (Cycles per Instruction)", is one of the basic performance indicators based on hardware event sampling collection. In sampling mode Also known as "Performance Monitoring Counter (PMC) Analysis".

The ratio is calculated by dividing the number of processor clock cycles (Clockticks) in the non-halt state by the number of instructions consumed. The exact events that each processor uses to calculate the number of clock cycles and instructions consumed may not be the same, but VTune Profiler can discern and use the correct number.

A CPI < 1 is typically an application with instruction-intensive code, while a CPI > 1 may be a stalled clock-cycle-intensive application or a memory-intensive application.

From this, we can conclude that language models such as chatGLM have very high requirements for memory bandwidth, and performance is often limited by memory operations or bandwidth.

In many scenarios, eliminating the load of memory operations will greatly benefit performance. How to compress or lighten the model without compromising accuracy is an indispensable skill when optimizing such models. In addition, deploying on heterogeneous platforms and frameworks also involves optimization ideas such as reducing data transfer between memory/device storage.

Therefore, while compressing the model, it is also necessary to consider the optimization of the original pytorch model reasoning forward/generates and other function pipelines, while OpenVINO ™ not only optimizes  the model itself, but also reflects the pipeline optimization idea in the modified model structure (KV The cache is stored inside the model), and by optimizing the pipeline of frameworks such as Optimum-intel, memory copying and data handling are reduced.

in conclusion

According to the above method, the author redesigned the execution graph and optimized GLMBlock to eliminate the memory copy between the input and output of the ChatGLM model, and the model runs efficiently.

With the continuous upgrade of OpenVINO  , the optimization work of this solution will also be promoted and integrated into the officially released version. This will help expand more large language model use cases. Please refer to OpenVINO  official version [6] and Optimum-intel OpenVINO  backend [7] for official and efficient support for large language models.

For more information, please click [Read the original text] at the end of the article.

About the Author:

 Zhao Zhen and Zou Wenyi, customer support engineers of Intel® OpenVINO ™ development tools, and Luo Cheng and Li  Tingqian, AI framework engineers of Intel®  OpenVINO  development tools, are engaged in the development and optimization of AI software tools.

OpenVINO  stateful model construction:
https://docs.openvino.ai/2022.3/openvino_docs_OV_UG_network_state_intro.html

Build the model through opset:
https://github.com/openvinotoolkit/openvino/blob/master/samples/cpp/model_creation_sample/main.cpp

Reference link:

[1]https://www.intel.cn/content/www/cn/zh/events/accelerate-with-xeon.html

[2]https://huggingface.co/THUDM/chatglm-6b/blob/main/modeling_chatglm.py

[3]https://huggingface.co/THUDM/chatglm-6b/blob/main/modeling_chatglm.py#L1031

[4]https://huggingface.co/THUDM/chatglm-6b/blob/main/modeling_chatglm.py#L554

[5]https://huggingface.co/THUDM/chatglm-6b/blob/main/config.json

[6]https://www.intel.cn/content/www/cn/zh/developer/tools/openvino-toolkit/overview.html

[7]https://huggingface.co/docs/optimum/main/en/intel/index

*This article is authorized to be published by Qubit, and the views are solely owned by the author.

Most "in" Mockup Column

1

One billion parameters, one-click slimming! The "Model Weight Loss" artifact makes the big model lose 3/4

2

Protect the security of large model applications, and now there is no need to sacrifice performance

3

How to optimize ChatGLM-6B? one line of code

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132331601