Model talk: use IN8 quantitative reasoning to run Meta "open source leaked" large model (LLaMA)

Following the topic in the previous article "Talking about Models: Quickly Get Started with the "Open Source Leaked" Large Model (LLaMA) of the Metaverse Factory Meta, we will continue to talk about how to use INT8 quantization to enable low-memory devices to run the model. .

written in front

A few days ago, I saw "How to evaluate the leak of LLaMA model?" on Zhihu. " Question, because I happened to be busy with something, so I ran an original model with the machine at hand, wrote a simple answer, and attached a picture of the video memory resources required for actual operation.

In the process of tossing, I saw the community project "PyLLaMA" mentioned in the previous article, which can reduce a lot of video memory resources compared with the original version to run the program. Because the video memory of the machine at hand is relatively rich, there was no direct reproduction verification at that time. Later, in the Zhihu answer list that appeared one after another, I saw that other people also mentioned that this solution can run directly on an 8GB graphics card, and there are many results with a lot of likes and certified accounts, so I didn’t I took it too seriously until the project was verified a few days ago, and the problem was discovered. At the same time, some friends said that 8GB could not run .

Speaking of which, the PyLLaMA project can indeed save video memory, and run the model with the minimum parameter version that originally requires 20G video memory, but the video memory capacity used is indeed more than 8GB. In other words, if your graphics card capacity is only 8GB, it really cannot run directly without model segmentation. ( or you need to further adjust the operating parameters )

In order to solve this problem, let’s introduce another solution in the community: tloen/llama-int8 , which can really let you run the 7B version of LLaMA with an 8G graphics card, and the 13B version of LLaMA with a video memory within 16GB.

In order to facilitate the use and verify the effect, I also updated the INT8 reasoning scheme to the "LLaMA Playground" open source project mentioned earlier. Project address: soulteary/llama-docker-playground

Three ways to have fun with LLaMA's Playground projects

Regarding issues such as downloading and integrity verification of model files, they were mentioned in the previous article and will not be repeated. This article only expands on the newly added INT8 quantized inference scheme. In addition, the use of the official reasoning scheme mentioned above and the Pyllama reasoning scheme provided by the community will no longer be expanded. If you are interested, you can read the previous articles by yourself .

Let’s first prepare the Docker operating environment for the model.

Using the LLaMA Docker Playground project

Still just find a suitable directory, use git cloneor download the Zip compressed package, first download the code of the "LLaMA Playground" project to the local.

git clone https://github.com/soulteary/llama-docker-playground.git

# or 
curl -sL -o llama.zip https://github.com/soulteary/llama-docker-playground/archive/refs/heads/main.zip

Then, we use Docker to complete the construction of the basic operating environment based on the latest PyTorch image from Nvidia's original factory. Compared with pulling the prepared image directly from DockerHub, building it ourselves will save a lot of time.

We can build a Docker environment that can use INT8 reasoning by executing the following command in the project directory:

docker build -t soulteary/llama:int8 . -f docker/Dockerfile.int8

Wait for a while, after the image is built, you can start playing.

Run the LLaMA model with INT8 quantization using Docker

Go to the directory where the model file modelsdirectory is located, and use the following command to start the LLaMA model project using INT8 quantization with one click:

docker run --gpus all --ipc=host --ulimit memlock=-1 -v `pwd`/models:/app/models -p 7860:7860 -it --rm soulteary/llama:int8

When we execute the command, the program will automatically load the model to the video memory, and automatically create a Web UI program. When the command is executed, the output log will be similar to the following:

=============
== PyTorch ==
=============

NVIDIA Release 23.01 (build 52269074)
PyTorch Version 1.14.0a0+44dac51

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Allocating transformer on host
Loading checkpoint 0
Loaded in 11.42 seconds with 7.12 GiB
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

After the model is running, we can access IP:7860the address , and start to try the INT8 version of LLaMA. If you are running the program locally, you can directly visit in the browser http://localhost:7860to see the interface below.

A plain user interface

We still enter the question in the text box on the left and click the submit button. After the model "thinks", it will "make up" the answer it thinks is appropriate for you.

In order to facilitate comparison with the other two solutions in the previous article, I still ask the same simple question "tell me more about zhihu" (tell me about Zhihu), and the answer results this time are as follows:

7B model answer using INT8 mode

It seems that the results are still worrying, but INT8 can make more devices run LLaMA. If Prompt is mined, it may also be a cost-effective solution. Of course, for any technical solution, we need to look at its advantages and disadvantages objectively and dialectically.

From the log above, it can be clearly seen that compared with the original version or the PyLLaMA solution, the INT8 solution only needs 7.12GB of initial video memory, but the cost is that the loading time becomes longer , reaching 11.42 seconds, which is more than double that of the original version. Moreover, it is not only the model loading time that is slowed down, but also the actual execution time of reasoning has also increased , which has also become several times that of the original program.

The memory usage of the 7B model running in INT8 mode

Also, during the running of the model, the actual video memory used may also exceed 8GB, so if you only have 8GB of video memory, you may wish to adjust the parameters.

Fri Mar 10 01:52:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 31%   41C    P2   123W / 450W |   8245MiB / 24564MiB |     49%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1301      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      1415      G   /usr/bin/gnome-shell               10MiB |
|    0   N/A  N/A      7080      C   python                           8220MiB |
+-----------------------------------------------------------------------------+

The INT8 reasoning method, in addition to allowing devices with 8-10GB of small video memory to run this model, also brings an additional benefit , allowing home-grade graphics cards with 16GB or even higher video memory to run the LLM model of the next specification: 13B version.

Run the LLaMA 13B model with INT8 quantized inference

If you have a graphics card with more than 13GB of video memory, you can try to run the LLaMA 13B model.

To run the 13B version of the model, we can use the same method as above to reuse the Docker image that has just been built, and only need to adjust the parameters slightly:

docker run --gpus all --ipc=host --ulimit memlock=-1 -v `pwd`/models:/app/models -p 7860:7860 -it --rm soulteary/llama:int8 python webapp.py --ckpt_dir models/13B

After the command is executed, wait for a while, and we will see a log similar to the following:

...
Allocating transformer on host
Loading checkpoint 0
Loading checkpoint 1
Loaded in 22.13 seconds with 13.19 GiB
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

Still the same as before, asking the same little question:

13B model answer using INT8 mode

Compared with the 7B version, the answer of the 13B version is intuitively better.

The actual video memory resources consumed during model inference are shown in the figure:

Memory usage of 13B model running in INT8 mode

Fri Mar 10 01:54:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 31%   44C    P2   139W / 450W |  14841MiB / 24564MiB |     55%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1301      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      1415      G   /usr/bin/gnome-shell               10MiB |
|    0   N/A  N/A      7424      C   python                          14816MiB |
+-----------------------------------------------------------------------------+

other

Regarding the use of INT8 quantization to infer models, NVIDIA’s official technical blog once had a blog. If you are interested, you can move to: "Using NVIDIA TensorRT quantization-aware training to achieve FP32 precision for INT8 reasoning", this solution is mentioned in the article advantages, and principles.

The key dependency of the project used in this article is facebookresearch/bitsandbytes . The current updated code is in the TimDettmers/bitsandbytes project. If you want to use the same method to optimize other projects at hand, you can also consider using it.

at last

Seeing this, students with small video memory may still sigh, but in fact, last week, in addition to these three solutions, a more efficient local reasoning solution appeared. Maybe in a later article, we can talk about how to play it quickly.

The sound of the open source storm is approaching.

–EOF


This article uses the "Signature 4.0 International (CC BY 4.0)" license agreement. You are welcome to reprint or re-use it, but you need to indicate the source. Attribution 4.0 International (CC BY 4.0)

Author of this article: Su Yang

Creation time: March 13, 2023
Counting words: 7779 words
Reading time: 16 minutes Read
this link: https://soulteary.com/2023/03/13/quick-start-llama-model-created-by-meta -research-with-int8.html

Guess you like

Origin blog.csdn.net/soulteary/article/details/129484068