Tutorial on deploying Llama2 (MetaAI) large model under Linux system

Llama2 is Meta's latest open source large language model. The training data set is 2 trillion tokens. The context length has been extended from Llama's 2048 to 4096. It can understand and generate longer texts, including three models of 7B, 13B and 70B. It performs well on several benchmarks and, most importantly, the model can be used for research and commercial purposes.

1. Preparation work

1. The model chosen for deployment in this article is Llama2-chat-13B-Chinese-50W (the download address of the model is: https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W)

2. Since most laptops cannot meet the deployment conditions of the large model Llama2, the autodl platform (computing power cloud) can be selected as the deployment platform. Note: There is a fee, but it is much cheaper than Alibaba Cloud

2. Rent an instance on the autodl platform

Register an account and log in. Click "Console" in the upper right corner to enter the personal console. Click "Container Instance" on the left to enter the page. Then click "Lease a new instance" to rent the instance.

After entering the "Rent a New Instance" page, select "Pay-As-You-Go" for the calculation method , select "Beijing Area C" for the region , and select the computing power model of "V100-32GB" for the host.

Select "Basic Image" for the image : PyTorch/2.0.0/3.8(ubuntu20.04)/11.8

Finally click " Create Now ".

Wait for a while, after the status changes to "Running" , click "Shut Down"

3. Clone the large model Llama2 to the data disk

Click "More" on the right side of the instance and select "Start with no card model" . No GPU is required to download data, and the price is lower if you choose the cardless mode to start up.

After booting, click "JupyterLab" in the shortcut tool to enter JupyterLab .

Among them, autodl-tmp is the data disk , used to store larger files, and the remaining three are system disks. In this experiment, Llama2 large model files are stored in autodl-tmp.

Next, create a new folder "Llama2" to store execution files .

Then enter autodl-tmp , download Llama2-chat-13B-Chinese-50W , and run the following codes in sequence.

1. Install git-lfs

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

Running result graph

2. Clone the large model Llama2 to the data disk

You need to connect to a VPN during the cloning process. You can use the academic acceleration function that comes with the cloud platform and just run the following code.

source /etc/network_turbo

Run the following code to clone the large model :

git clone https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W

After running for a while, it gets stuck or reports an error. Looking at the left side, we find that three files have not been downloaded .

The three files are all relatively large, 9.6G, 9.6G, and 6.4G respectively. If these three files cannot be downloaded due to network reasons, you can go to the huggingface official website to download the model locally, and then upload it to the cloud platform. (You need to connect to VPN to enter huggingface. If necessary, you can directly send three private messages to give away the source files)

Or run the following code to download separately (remember to cd to the Llama2-chat-13B-Chinese-50W/ directory to download):

wget https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W/resolve/main/pytorch_model-00003-of-00003.bin

Running result graph (if the speed is too slow, please set academic acceleration)

After the download is completed, please note that if it is displayed as "N months ago" , it means the download was successful. If it is the most recent date (such as 3 minutes ago) , it means there was a problem during the download process, and you need to download it again .

4. Download and deploy gradio

With the popularity of robot dialogue frameworks such as ChatGPT, a framework called Gradio has also become popular. This framework can open an http service and has an input and output interface, allowing conversational artificial intelligence projects to run quickly. Gradio claims to be able to quickly deploy AI visualization projects.

1. Download the execution files gradient_demo.py and requirements.txt

Enter the URL https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/inference/gradio_demo.py, download the execution files gradio_demo.py and requirements.txt locally and upload them to the folder Llama2 middle.

2. Modify the torch version in requirements.txt to 2.0.0, and then install requirements.txt

Modify the torch version in requirement.txt to 2.0.0 . Remember to click Ctrl+S to save after modification .

Switch to the Llama2 directory , execute the following code, and install requirements.txt

pip install -r requirements.txt

Running result graph (if an error is reported, please set up academic acceleration)

3. Comment lines 59, 60, and 61 in gradient.py and install relevant packages manually.

Comment out lines 59, 60, and 61 in gradient.py , and then manually install the packages imported in gradient and gradient_demo.py:

Install Gradio:

pip install gradio -i http://pypi.douban.com/simple/  --trusted-host pypi.douban.com

Install bitsandbytes:

pip install bitsandbytes

Install accelerate:

pip install accelerate

Install scipy:

pip install scipy

After completing the above steps, close "JupyterLab" and shut down .

5. Start the computer in card mode and run the large model

Return to the AutoDL console and click "Start". After booting, click "JupyterLab" in the shortcut tool to enter JupyterLab .

First cd to the folder Llama2 and set up academic acceleration.

Run the large model:

python gradio_demo.py --base_model /root/autodl-tmp/Llama2-chat-13B-Chinese-50W --tokenizer_path /root/autodl-tmp/Llama2-chat-13B-Chinese-50W --gpus 0

operation result:

Click the link in the red box to bring up the conversation page.

At this point, you have successfully deployed Llama2-chat-13B-Chinese-50W!

6. Possible problems

1. Step 5 (boot in card mode and run the large model), when the code is entered and run, an error message appears.

Error content:

Vocab of the base model: 49954
Vocab of the tokenizer: 49954
Traceback (most recent call last):
  File "gradio_demo.py", line 298, in <module>
    user_input = gr.Textbox(
AttributeError: 'Textbox' object has no attribute 'style'

Solution: Open the gradient_demo.py file and delete the shaded content in lines 301 and 302. After deleting, click Ctrl+S to save.

Run it again and the error disappears.

Thank you Sanlian!

おすすめ

転載: blog.csdn.net/m0_52625549/article/details/134239972