LLM Series | 02: Vicuna Introduction and Model Deployment Test

Introduction

In the dark of the moon, you can see the fishing lamp, and there is a little firefly in the solitary light. Gentle wind and waves, scattered into a river full of stars. Hello friends, I am the editor of the WeChat public account "Xiao Chuang You Ji Machine Learning": the little boy who sells steel wool. Today's small essay mainly introduces the Vicuna model, deployment services based on the official model 13B model, and dialogue measurements.
For more and updated articles, please pay attention to the WeChat public account : Xiaochuang Youji Machine Learning . Follow-up will continue to organize a series of topics such as model acceleration, model deployment, model compression, LLM, AI art, etc., so stay tuned.

insert image description here

Vicuna model

Vicuna officials currently (April 2023) only release Vicuna-7B and Vicuna-13B, and the measured part of the following article is mainly based on Vicuna-13B. Vicuna-13B is a model based on LLaMa-13B using supervised data fine-tuning. The data set comes from user dialogue data generated by ShareGPT.com, with a total of 70K items. ShareGPT is a ChatGPT data sharing website, users will upload ChatGPT answers that they find interesting. Preliminary evaluations using GPT-4 as a judge show that Vicuna-13B achieves >90% of the quality of OpenAI ChatGPT and Google Bard, while outperforming other models such as LLaMA and Stanford Alpaca >90% of the time. The cost of training Vicuna-13B is about $300. The official also released the training code: https://github.com/lm-sys/FastChat and the online demo: https://chat.lmsys.org/. Spoiler alert, the actual test results in the following text, in fact, it’s okay. Let's talk about the CPU and manage expectations well.

overview

The overall process of Vicuna is as follows:

insert image description here

First, about 70,000 conversations were collected from ShareGPT.com. PS: ShareGPT.com is a website where users can share their ChatGPT conversations. Next, further optimize the training script provided by Alpaca to better handle multi-turn dialogues and long sequences . Compared with Alpaca, Vicuna extends the sequence length from 512 to 2048 during training, and solves the memory problem through gradient detection and flash attention ; adjusts the training loss to consider multiple rounds of dialogue, and fine-tunes only based on the output of the model. Vicuna trained on 8 A100s for one day using Pytorch FSDP. In order to provide demo demonstration services, Vicuna officially implemented a lightweight distributed service system .

In terms of evaluation, an initial evaluation of the model output was performed by creating 80 different questions and using GPT-4. To compare the quality of the output of two different models, the output of each model is combined into a single prompt for each problem . The prompt is then sent to GPT-4, which evaluates which model provided the better response. The detailed comparison of LLaMA, Alpaca, and ChatGPT is as follows:

insert image description here

train

Vicuna researchers used the public API of ShareGPT.com to collect ChatGPT conversation data shared by about 70,000 users, and then fine-tuned the LLaMA-13B model to obtain Vicuna-13B. To ensure data quality, HTML is converted back to markdown and some inappropriate or low-quality samples are filtered out. Additionally, lengthy dialogues are broken into smaller parts to fit the model's maximum context length.

The training method has made the following improvements on the basis of Stanford alpaca:

  • Memory optimization : In order for Vicuna to understand long contexts, the maximum context length was extended from 512 in alpaca to 2048, which greatly increases GPU memory requirements. Vicuna officially relieves memory pressure by using gradient checkpointing and flash attentionio .

  • Multiple rounds of dialogue : Adjust the training loss to account for multiple rounds of dialogue , and compute a fine-tuning loss based only on the output of the chatbot, i.e. fine-tuning only on the output of the chatbot.

  • Reduce costs through Spot instances : The dataset becomes 40 times larger, and the length of the training sequence becomes 4 times larger, which poses a considerable challenge to training costs. Vicuna uses SkyPilot Managed Points to reduce costs by utilizing cheaper point instances with automatic recovery preemption and automatic zone switching. This solution cuts the training cost of the 7B model from about $500 to about $140 and the 13B model from about $1000 to $300.

Evaluate

Evaluating an AI chatbot is extremely challenging as it needs to examine language understanding, reasoning, and context awareness. As AI chatbots get smarter, the currently open benchmarks may not be sufficient for this task. For example, the evaluation data set self-instruct used in Stanford's Alpaca compares SOTA chatbots to replies to this data set, which makes it difficult for humans to discern the difference. There are also some other limitations: training/test data pollution and potentially high cost of creating new benchmarks. To address these issues, researchers at Vicuna propose a GPT-4-based evaluation framework to automatically evaluate the performance of chatbots.

First, 8 types of questions are designed, such as Fermi problems, scene role-playing, and programming/mathematical tasks, to test the performance of chatbots in various aspects. Through careful prompt engineering, GPT-4 is able to generate diverse and challenging problems that are difficult for baseline models. Choose 10 questions for each category and collect answers from 5 chatbots: LLaMA, Alpaca, ChatGPT, Bard, and Vicuna. Then, GPT-4 was asked to rate the quality of each answer in terms of usefulness, relevance, accuracy, and detail . Finally, it was found that GPT-4 can not only produce relatively consistent scores, but also explain in detail why such scores are given (detailed example link ). However, it was also noticed that GPT-4 is not very good at judging programming and math problems.

insert image description here

insert image description here

Preparation

Environment installation

According to the tutorial of FastChat , install the fastchat package first after git clone:

pip3 install -e .
-e参数, --editable <path/url> Install a project in editable mode (i.e. setuptools "develop mode") from a local project path or a VCS url.

The package name after installation is fschat:

pip3 show  fschat
Name: fschat
Version: 0.2.2
Summary: An open platform for training, serving, and evaluating large language model based chatbots.
Home-page:
Author:
Author-email:
License:
Location: /opt/python3.10.11/lib/python3.10/site-packages
Requires: accelerate, fastapi, gradio, markdown2, numpy, prompt-toolkit, requests, rich, sentencepiece, shortuuid, tokenizers, torch, transformers, uvicorn, wandb

Model download

We release Vicuna weights as delta weights to comply with the LLaMA model license.

All LLaMA-based models can only give delta weights, which can be downloaded from the official address , and then add this delta weight to the original LLaMA weight (the magnet link address is given below), and finally get the weight of the release model.

# https://huggingface.co/lmsys/vicuna-13b-delta-v0

curl -Lo pytorch_model-00001-of-00003.bin https://huggingface.co/lmsys/vicuna-13b-delta-v0/resolve/main/pytorch_model-00001-of-00003.bin

# SHA256: 627062721346c21f30b679de08edd99abba409d3b37289419480c1d48f5e492a
curl -Lo pytorch_model-00002-of-00003.bin https://huggingface.co/lmsys/vicuna-13b-delta-v0/resolve/main/pytorch_model-00002-of-00003.bin

# SHA256: fe31b044e7b4034d0bf9adea93f1d2ef4e0fa02511914b4c19e72bdcfcacca6b
curl -Lo pytorch_model-00003-of-00003.bin https://huggingface.co/lmsys/vicuna-13b-delta-v0/resolve/main/pytorch_model-00003-of-00003.bin

model fusion

For the downloaded delta model, denote as vicuna-13b-delta-v0. It is also necessary to download the original llama-13B model, and convert the original llama-13B model to the huggingface format model (hf for short), which is recorded as llama-13b-hf:

python3 transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /home/model_zoo/llama --model_size 13B --output_dir /home/model_zoo/llama/llama-13b-hf

It should be noted that the official release delta-v0and delta-v1two versions. According to the official statement, the v1.1 version refactored tokenization and separator. In Vicuna v1.1, the separator was changed from the original "###" to EOS token "</s>". This change makes it easier to control the conditions under which builds stop and allows for better compatibility with other libraries.

Run the following command in the FastChat directory:

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-13b \
    --target /output/path/to/vicuna-13b \
    --delta /home/model_zoo/vicuna/vicuna-13b-delta-v0

For example, for Vicuna-7B:

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-7b \
    --target /output/path/to/vicuna-7b \
    --delta lmsys/vicuna-7b-delta-v1.1

For Vicuna-13B:

python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-13b \
    --target /output/path/to/vicuna-13b \
    --delta lmsys/vicuna-13b-delta-v1.1

Since the official recommendation is to use delta-v1, the running instructions here are as follows:

python3 -m fastchat.model.apply_delta \
    --base /home/model_zoo/llama/7B/hugging_face_format/ \
    --target /home/model_zoo/vicuna/vicuna-7b \
    --delta /home/model_zoo/vicuna/vicuna-7b-delta-v1.1
python3 -m fastchat.model.apply_delta \
    --base /home/model_zoo/llama/llama-13b-hf/ \
    --target /home/model_zoo/vicuna/vicuna-13b \
    --delta /home/model_zoo/vicuna/vicuna-13b-delta-v1.1

If the above command is used directly in the v0 version, an error will be reported:

RuntimeError: The size of tensor a (32000) must match the size of tensor b (32001) at non-singleton dimension 0

So far /home/model_zoo/vicuna/vicuna-13bthe vicuna-13b model is obtained.

Model reasoning (command line mode)

single GPU

Vicuna-13B requires about 28GB of GPU memory.

python3 -m fastchat.serve.cli --model-path /home/model_zoo/vicuna/vicuna-13b

If it is found that some irrelevant information is output, the solution is based on the official feedback of the issue : Tokenizer issues .

The sum https://huggingface.co/lmsys/vicuna-13b-delta-v0/tree/mainin the download needs to be replaced with the sum in the pair. Run it again, the result returned by the model will not have those irrelevant information, and it will be relatively clean.special_tokens_map.jsontokenizer_config.json/home/model_zoo/vicuna/vicuna-13bspecial_tokens_map.jsontokenizer_config.json

Multi-GPU

If you don't have enough video memory, you can use model parallelism to aggregate the video memory of multiple GPUs on the same machine.

python3 -m fastchat.serve.cli --model-path /home/model_zoo/vicuna/vicuna-13b --num-gpus 2

CPU only

If you want to run on CPU, you need about 60GB of memory.

python3 -m fastchat.serve.cli --model-path /home/model_zoo/vicuna/vicuna-13b --device cpu

Model reasoning (Web UI mode)

If you want to provide services in the form of web UI, you need to configure 3 parts.

  1. web servers, user interface
  2. model workers, hosting models
  3. controller, to coordinate web server and model worker

start controller

python3 -m fastchat.serve.controller --host 0.0.0.0

Start the model worker

python3 -m fastchat.serve.model_worker --model-path /home/model_zoo/vicuna/vicuna-13b --model-name vicuna-13b --host 0.0.0.0

When the process finishes loading the model, you will see "Uvicorn running on ...".

send test message

python3 -m fastchat.serve.test_message --model-name vicuna-13b

return result:

Models: ['vicuna-13b']
worker_addr: http://localhost:21002
USER: Tell me a story with more than 1000 words.
ASSISTANT: Once upon a time, in a small village nestled in the heart of a dense forest, there lived a young girl named Maria. She was an orphan

Start Gradio web server

python3 -m fastchat.serve.gradio_web_server --port 8809

Now, you can open the browser and chat with the model.
In order to view the log information separately, the controller, model worker, and gradio web server have opened terminal windows separately.

fine-tuning

data

Vicuna was created by fine-tuning an LLaMA base model using conversations shared by about 70,000 users collected from ShareGPT with a public API.

To ensure data quality, the team converted HTML back to markdown and filtered out some inappropriate or low-quality samples. In addition, the team also breaks lengthy conversations into smaller segments to comply with the model's maximum context length. For specific cleaning codes, please refer to: data_cleaning

Due to some considerations, Vicuna officially does not release the ShareGPT dataset for the time being. If you guys want to try to fine-tune the code, you can use some dummy questions dummy.json to run the code. It is also possible to follow the same format and insert your own data.

Code and hyperparameters

The team fine-tuned the model using code from Stanford Alpaca, with some modifications to support gradient checkpointing and Flash attention. In addition, the team also used similar hyperparameters to Stanford Alpaca.

Local fine-tuning of Vicuna-7B

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path ~/model_weights/llama-7b  \
    --data_path playground/data/dummy.json \
    --bf16 True \
    --output_dir output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

If you encounter OOM problems during the model saving process, you can refer to: solutions

Guess you like

Origin blog.csdn.net/ljp1919/article/details/130449483