Deploy Baichuan language model Baichuan2

Baichuan2 is a new generation of open source large language model launched by Baichuan Intelligence, which is trained with high-quality corpus of 2.6 trillion Tokens. Achieve the best results of the same size on multiple authoritative Chinese, English and multi-language general and domain benchmarks. Contains 7B and 13B Base and Chat versions, and provides 4bits quantization of the Chat version.

Model download

base model

Baichuan2-7B-Base

https://huggingface.co/baichuan-inc/Baichuan2-7B-Baseicon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-7B-BaseBaichuan2-13B-Base

https://huggingface.co/baichuan-inc/Baichuan2-13B-Baseicon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-13B-Base

Alignment model

Baichuan2-7B-Chat

https://huggingface.co/baichuan-inc/Baichuan2-7B-Chaticon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-7B-ChatBaichuan2-13B-Chat

https://huggingface.co/baichuan-inc/Baichuan2-13B-Chaticon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat

Alignment model 4bits quantization

Baichuan2-7B-Chat-4bits

https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitsicon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitsBaichuan2-13B-Chat-4bits

https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bitsicon-default.png?t=N7T8https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat-4bits

Pull code

git clone https://github.com/baichuan-inc/Baichuan2

Install dependencies

pip install -r requirements.txt

Calling method

Python code call

Example of Chat model inference method:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("baichuan-inc/Baichuan2-13B-Chat")
messages = []
messages.append({"role": "user", "content": "解释一下“温故而知新”"})
response = model.chat(tokenizer, messages)
print(response)

Base model inference method demonstration

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Base", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Model loading specifying device_map='auto' will use all available graphics cards.

If you need to specify the device to be used, you can use a method similar to export CUDA_VISIBLE_DEVICES=0,1 (graphics cards 0 and 1 are used).

Command line mode

python cli_demo.py

This command line tool is designed for Chat scenarios and does not support calling Base models using this tool.

Web page demo method

Relying on streamlit to run the following command will start a web service locally. You can access it by putting the address given by the console into the browser.

streamlit run web_demo.py

The demo tool on this webpage is designed for Chat scenarios and does not support using this tool to call the Base model.

Quantitative method

Baichuan2 supports two modes: online quantization and offline quantization.

Online quantification

For online quantification, baichuan2 supports 8bits and 4bits quantification. The usage method is similar to that in the Baichuan-13B project. You only need to load the model into the memory of the CPU first, then call the quantize() interface for quantification, and finally call the cuda() function to The quantized weights are copied to GPU memory. The code to load the entire model is very simple. Take Baichuan2-7B-Chat as an example:

8bits online quantification:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(8).cuda() 

4bits online quantification:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(4).cuda() 

It should be noted that when using the from_pretrained interface, users usually add device_map="auto". When using online quantization, this parameter needs to be removed, otherwise an error will be reported.

Offline quantification

In order to facilitate users' use, baichuan2 provides an offline quantized 4bits version Baichuan2-7B-Chat-4bits for users to download. It is very simple for users to load the Baichuan2-7B-Chat-4bits model, they only need to execute:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat-4bits", device_map="auto", trust_remote_code=True)

For 8bits offline quantization, baichuan2 does not provide a corresponding version, because the Hugging Face transformers library provides a corresponding API interface, which can easily save and load the 8bits quantization model. Users can save and load 8bits models by themselves as follows:

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained(quant8_saved_dir)
model = AutoModelForCausalLM.from_pretrained(quant8_saved_dir, device_map="auto", trust_remote_code=True)

CPU deployment

The Baichuan2 model supports CPU inference, but it should be emphasized that the CPU inference speed is relatively slow. You need to modify the model loading method as follows:

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float32, trust_remote_code=True)

Model fine-tuning

Depends on installation

git clone https://github.com/baichuan-inc/Baichuan2.git
cd Baichuan2/fine-tune
pip install -r requirements.txt

If you want to use lightweight fine-tuning methods such as LoRA, you need to install peft additionally.

If you want to use xFormers for training acceleration, you need to install xFormers additionally.

Stand-alone training

hostfile=""
deepspeed --hostfile=$hostfile fine-tune.py  \
    --report_to "none" \
    --data_path "data/belle_chat_ramdon_10k.json" \
    --model_name_or_path "baichuan-inc/Baichuan2-7B-Base" \
    --output_dir "output" \
    --model_max_length 512 \
    --num_train_epochs 4 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --save_strategy epoch \
    --learning_rate 2e-5 \
    --lr_scheduler_type constant \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed ds_config.json \
    --bf16 True \
    --tf32 True

Lightweight fine-tuning

The code already supports lightweight fine-tuning such as LoRA. To use it, just add the following parameters to the above script:

--use_lora True

The specific configuration of LoRA can be found in the fine-tune.py script.

After fine-tuning with LoRA, you can load the model using the following command:

from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)


 

おすすめ

転載: blog.csdn.net/watson2017/article/details/134398511