From 0 to 1! How Dewu builds a general large model training and reasoning platform

1. Background

Recently, the release of the GPT large model has brought a shocking experience to the field of natural language processing (NLP). With this incident, a series of open source big models also rose rapidly. According to the evaluation of some evaluation agencies, the performance of these large open source models is also quite good. The evaluation of some large models can be checked here: Huggingface's Open LLM leaderboard, UC Berkeley released a large language model leaderboard, etc.

With the development of large models, the training and deployment technology of large models has become very important. We investigate fine-tuning training techniques such as LORA and QLORA, and quantization deployment techniques such as GPTQ. After running through the minimum Demo and verifying the effect, these technologies are integrated into the KubeAI platform (Dewu AI platform) and provided for everyone to get started quickly.

This article is mainly divided into two parts: technical theory and technical practice.

Technical theory mainly explains the theoretical part of fine-tuning training and quantitative reasoning. Fine-tuning training includes LoRA, QLoRA, deployment includes GPTQ quantitative reasoning, etc., and conducts day-reading for key codes and performance testing for deployment.

In the technical combat part, we integrate these technologies into the KubeAI platform so that everyone can quickly get started with actual combat. According to the feedback from the previous students, the large model training can be completed and the reasoning can be deployed online within about one day.

2. LoRA and QLoRA training technology

2.1 LoRA Technology Introduction

LoRA, the English full name is Low-Rank Adaptation of Large Language Models (Chinese is the low-level adaptation of the large language model).

This is a technology developed by Microsoft researchers to solve the fine-tuning of large language models. Its github address is https://github.com/microsoft/LoRA , and HuggingFace's PEFT library https://github.com/ has been obtained. huggingface/peft support.

For large speech models, the amount of parameters is very large. GPT3 has 175 billion parameters, and the LLAMA series models include 7B, 13B, 33B, 65B, and the smallest 7B has 7 billion parameters. To adapt these models to specific business scenarios, they need to be fine-tuned. If these models are directly fine-tuned, due to the huge number of parameters, the required GPU cost will be very high. LoRA is the technology used to solve the low-cost fine-tuning of these large language models.

LoRA's approach is to freeze the parameters of these pre-trained large models, that is, when fine-tuning training, the parameters of these models are set to be untrainable. Then add additional network layers to the model, and only train the parameters of these additional network layers. In this way, the number of trainable parameters becomes very small, and large language models can be fine-tuned with low-cost GPUs.

Reference  https://arxiv.org/abs/2106.09685

LoRA injects a trainable rank factorization matrix at each layer of the Transformer architecture, compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10000 times, reduce GPU memory requirements by 3 times, and in effect Comparable or better than traditional fine-tuning techniques.

Let's take Transformer's linear layer as an example to explain how LoRA works.

In the linear layer of the Transformer model, matrix multiplication is usually performed, such as Y = XW, where X is the input matrix and W is the weight matrix, which is also a parameter for model training and solution.

For the operation steps of the LoRA method in the linear layer of Transformer:

  • A "bypass" is added next to each linear layer, consisting of a dimensionality reduction matrix A and a dimensionality enhancement matrix B. Low rank decomposition comes into play here, for example we have a 100x100 matrix C, we can decompose it into A and B by low rank decomposition (assuming rank is set to 1), where A is a 100x1 matrix and B is a 1x100 matrix . In this way, the original matrix C with 10,000 parameters is decomposed into matrices A and B with a total of 200 parameters.

  • During the training process, the weight matrix W of the original linear layer remains unchanged, and only the dimensionality reduction matrix A and dimensionality enhancement matrix B are trained.

  • At inference time, the product of matrices B and A is added to the weight matrix W of the original linear layer. Since A and B are of lower rank, this operation adds no extra inference latency.

  • For general tasks, ranks of 1, 2, 4, 8, and 16 are sufficient.

2.2 LoRA key code reading

The key to LoRA has been explained above, and then we will read the key codes for the LoRA implementation in the latest version of PEFT. The core code logic of LoRA is at: https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py

There are two core classes, one is LoraConfig and the other is LoraModel.

LoraConfig is the core configuration class of LoRA. It is a class used to configure LoRAModel, which contains some parameters used to control the behavior of the model.

The main parameters of this class are:

  • r: LoRa (low-rank approximation) attention dimension, which is the rank mentioned above. The default value is 8.

  • target_modules: A list of module names to apply LoRa to.

  • lora_alpha: the alpha parameter of LoRa. The default value is 8.

  • lora_dropout: The dropout probability of the LoRa layer. The default value is 0.0.

  • bias: LoRa bias type. Can be 'none', 'all' or 'lora_only'.

LoraModel is the core class of the LoRA module. The logic of freezing the parameters of the base model, bypassing low-rank matrix creation, replacement, and merging are all in this class. Let's take a walk through his key logic combined with the above introduction.

2.2.1 Initialization function

From the initialization function, we can see that LoraModel also inherits torch.nn.Module, which is equivalent to a network module of pytorch. The base_model in the incoming parameters is equivalent to the basic large model used for fine-tuning, and config contains LoraConfig. In the initialization, LoraModel sets its own forward propagation function forword as the forward method of the large model.

2.2.2 Initialization: Use the new LoraLayer to replace the Layer configured in target_modules to realize the function of adding bypass low-rank matrix mentioned above.

The main function of the above code:

  • According to the tagetModules configured in LoraConfig, find these Modules in the base_model (big model)

  • Create a new LoraLayer. The new LorayLayer will contain the layer of the original target_module, and bypass it in parallel. The bypass is mainly the addition of two low-rank matrices composed of low-rank matrices Lora_A and Lora_B.

  • Replace the original target_module layer with the newly created LoraLayer.

Through this step, the bypass low-rank matrix is ​​added to the layer of the target_modules of the large model.

2.2.3 Initialization: freezing the parameters of the large model

It can be seen that except for the newly added LoraLayer module, all other parameters are frozen.

2.2.4 Forward propagation: the operation logic after adding bypass low-rank matrix (take LineLayer as an example)

In the above code:

  • Use the linear layer in the large model target_module to calculate and get the result result.

  •   Use the low-rank matrices of lora_A and lora_B to perform calculations and add the calculation results to result.

The above is the main logic, other logic can go deep into the code to understand. The Lora implementation in the PEFT library is the same as described in the paper.

2.3 QLORA Technology Introduction

Although LoRA technology can save video memory to a certain extent and improve training speed, running large models in float16 mode still takes up a lot of video memory. For example: when the batch size is set to a minimum, a single card A100 (80G video memory) can only fine-tune the 7B series models, the 13B model requires 120G video memory under normal circumstances, and fine-tuning the 65B model requires more than 780G video memory.

For this reason, researchers at the University of Washington proposed QLoRA technology. In extreme cases, 33B fine-tuning can be realized on a single 24GB GPU, and a 65B model can be fine-tuned on a single 48Gi video memory. Of course, in this case fine-tuning will become slower.

Paper reference https://arxiv.org/abs/2305.14314 .

The above figure describes the difference between LoRA and QLoRA in fine-tuning training. From the name of QLoRA, it can be seen that QLoRA is actually Quantize+LoRA technology. Simply put, it is to change the large model (Base Model) from 16bit to Compress to 4bit. Thereby reducing the video memory for training.

  • 4-bit NormalFloat, QLoRA uses NF4 (Normal Float 4)bit to quantize and compress the pre-training model. This is an optimized 4-bit quantization method optimized for the property that neural network weights typically follow a zero-centered normal distribution. Weights are scaled to the range [-1, 1] using the standard normal distribution function. Compared with traditional 4-bit quantization, it has less loss of weight information, thus improving the overall accuracy of model quantization.

  • Double quantization, double quantization is a memory optimization strategy, which performs secondary quantization on the constants used in quantization to further reduce memory usage. This means we can reduce memory requirements while maintaining precision.

  • Page Optimizer, this is a memory management technology that uses NVIDIA's unified memory feature to perform automatic page-to-page transfers between the CPU and GPU. When the GPU memory is insufficient, it can temporarily move part of the data to the CPU memory. and then move back. This reduces problems caused by running out of memory when training large models.

After actual measurement on our platform, training a 33B model requires a minimum of 26G video memory. But you need to set batch-szie to 1, so the training speed will be slower. In actual operation, the value of batch size can be appropriately increased, and with 4bit quantization, a 33B large model can be trained with a small amount of GPU resources. Of course, the 13B large model using QLORA also works well.

At present, the latest version of the PEFT library has also added support for QLoRA. Students who like code can go to learn more about it.

3. Introduction to quantitative reasoning

3.1 Introduction to GPTQ quantization

GPTQ (Generative Pretrained Transformer Quantization) is a new post-training quantization method that can efficiently perform quantization on models with tens of billions of parameters, and can compress these models to 3 or 4 bits per parameter without There is a significant loss of accuracy, the paper refers to https://arxiv.org/abs/2210.17323 .

The so-called post-training quantization refers to quantization after the model training is completed, and the weight of the model will be converted from 32-bit floating-point numbers (or other higher-precision formats) to lower-precision formats, such as 4-bit integers. This transformation greatly reduces the size of the model and reduces the amount of computation required to run it. However, this may also result in a certain loss of precision.

3.2 GPTQ quantitative data comparison

At present, there are several quantization methods in the industry, including GGML, GPTQ, etc. After actual measurement, we found that GPTQ quantization deployment has less accuracy loss and good performance.

We conducted a 4-bit quantization test on the 13B model and found that the comparison after GPTQ quantization is as follows:

4. Actual combat: large model training and reasoning on the kubeai platform

Earlier we introduced the training technology of large models: the working principle of LoRA and QLoRA, and introduced the steps of quantitative deployment through GPTQ. We integrate these steps into KubeAI's training and inference platform for your research, and provide 7B, 13B, and 33B large-scale model options at the same time. Select GPT service/customized version (Finetune) in KubeAI to experience it.

4.1 The training and inference workflow of the kubeAI platform

  • Large model selection support, the kubeAI platform provides three types (7B, 13B, 33B), and more support will be gradually added in the future.

  • Large-scale model fine-tuning training now supports LoRA and QLoRA methods, and other methods will be added in the future.

  • After training, two large models will be generated, one is the 16Bit original model, and the other is the GPTQ4bit quantized model (with QLoRA).

  • We provide a one-click deployment function. After users select the corresponding model, they can be deployed as a service with one click, and provide pages and API interfaces for users to experience the effect.

4.2 Steps for users to perform training and inference deployment of large models in kubeAI

  • Choose the large model, currently provide three versions (7B, 13B, 33B).

  • Upload training data, currently supports alpaca data format.

  • To configure the training parameters, you only need to configure the batch size and training steps according to the GPU situation, and most of them use the default parameters.

  • Click to start training.

  • After training, select the model and click Deploy to deploy it as a service with one click.

  • After deploying the service, click the access link, there will be an access page, and the corresponding API call interface will be provided on the page.

4.3 Reasoning function based on knowledge base of kubeAI platform

  • Realization of large inference models, offline deployment, and training optimization for professional scenarios.

  • The text vector model can be deployed offline, and can also be trained and optimized for local scenarios.

  • It can quickly realize access to various data sources, and supports the access of pdf, txt, md, docx, csv and other file types.

  • In terms of sentence segmentation and document reading, it is optimized for Chinese usage scenarios.

5. Summary

We investigate the fine-tuning training methods LoRA and QLoRA for large models, and the inference deployment and quantization deployment of GPTQ for large models. Integrate the entire link from fine-tuning training to inference deployment above into the kubeAI platform to provide you with quick experiments. In addition, it also integrates the scene of uploading to the knowledge base in the form of documents and cooperating with the knowledge base for reasoning.

In addition to the LORA, QLORA, and GPTQ mentioned above, there are other techniques for training and reasoning methods for large models. Because the large model community is relatively popular, there will definitely be better fine-tuning training and quantitative deployment technologies in the future. We will continue to follow up in the future. If the effect and performance are better than the currently supported methods, the platform will continue to integrate these new methods based on the current framework in a timely manner.

*Text/linggong

This article is an original article of Dewu Technology. For more exciting articles, please see: Dewu Technology Official Website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law!

Clarification about MyBatis-Flex plagiarizing MyBatis-Plus Arc browser officially released 1.0, claiming to be a substitute for Chrome OpenAI officially launched Android version ChatGPT VS Code optimized name obfuscation compression, reduced built-in JS by 20%! LK-99: The first room temperature and pressure superconductor? Musk "purchased for zero yuan" and robbed the @x Twitter account. The Python Steering Committee plans to accept the PEP 703 proposal, making the global interpreter lock optional . The number of visits to the system's open source and free packet capture software Stack Overflow has dropped significantly, and Musk said it has been replaced by LLM
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10092064