Baichuan2 is a new generation of open source large language model launched by Baichuan Intelligence, which is trained with high-quality corpus of 2.6 trillion Tokens. Achieve the best results of the same size on multiple authoritative Chinese, English and multi-language general and domain benchmarks. Contains 7B and 13B Base and Chat versions, and provides 4bits quantization of the Chat version.
Model download
base model
Baichuan2-7B-Base
https://huggingface.co/baichuan-inc/Baichuan2-7B-Basehttps://huggingface.co/baichuan-inc/Baichuan2-7B-BaseBaichuan2-13B-Base
Alignment model
Baichuan2-7B-Chat
https://huggingface.co/baichuan-inc/Baichuan2-7B-Chathttps://huggingface.co/baichuan-inc/Baichuan2-7B-ChatBaichuan2-13B-Chat
Alignment model 4bits quantization
Baichuan2-7B-Chat-4bits
https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitshttps://huggingface.co/baichuan-inc/Baichuan2-7B-Chat-4bitsBaichuan2-13B-Chat-4bits
Pull code
git clone https://github.com/baichuan-inc/Baichuan2
Install dependencies
pip install -r requirements.txt
Calling method
Python code call
Example of Chat model inference method:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Chat", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained("baichuan-inc/Baichuan2-13B-Chat")
messages = []
messages.append({"role": "user", "content": "解释一下“温故而知新”"})
response = model.chat(tokenizer, messages)
print(response)
Base model inference method demonstration
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan2-13B-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-13B-Base", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
Model loading specifying device_map='auto' will use all available graphics cards.
If you need to specify the device to be used, you can use a method similar to export CUDA_VISIBLE_DEVICES=0,1 (graphics cards 0 and 1 are used).
Command line mode
python cli_demo.py
This command line tool is designed for Chat scenarios and does not support calling Base models using this tool.
Web page demo method
Relying on streamlit to run the following command will start a web service locally. You can access it by putting the address given by the console into the browser.
streamlit run web_demo.py
The demo tool on this webpage is designed for Chat scenarios and does not support using this tool to call the Base model.
Quantitative method
Baichuan2 supports two modes: online quantization and offline quantization.
Online quantification
For online quantification, baichuan2 supports 8bits and 4bits quantification. The usage method is similar to that in the Baichuan-13B project. You only need to load the model into the memory of the CPU first, then call the quantize() interface for quantification, and finally call the cuda() function to The quantized weights are copied to GPU memory. The code to load the entire model is very simple. Take Baichuan2-7B-Chat as an example:
8bits online quantification:
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(8).cuda()
4bits online quantification:
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(4).cuda()
It should be noted that when using the from_pretrained interface, users usually add device_map="auto". When using online quantization, this parameter needs to be removed, otherwise an error will be reported.
Offline quantification
In order to facilitate users' use, baichuan2 provides an offline quantized 4bits version Baichuan2-7B-Chat-4bits for users to download. It is very simple for users to load the Baichuan2-7B-Chat-4bits model, they only need to execute:
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat-4bits", device_map="auto", trust_remote_code=True)
For 8bits offline quantization, baichuan2 does not provide a corresponding version, because the Hugging Face transformers library provides a corresponding API interface, which can easily save and load the 8bits quantization model. Users can save and load 8bits models by themselves as follows:
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained(quant8_saved_dir)
model = AutoModelForCausalLM.from_pretrained(quant8_saved_dir, device_map="auto", trust_remote_code=True)
CPU deployment
The Baichuan2 model supports CPU inference, but it should be emphasized that the CPU inference speed is relatively slow. You need to modify the model loading method as follows:
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float32, trust_remote_code=True)
Model fine-tuning
Depends on installation
git clone https://github.com/baichuan-inc/Baichuan2.git
cd Baichuan2/fine-tune
pip install -r requirements.txt
If you want to use lightweight fine-tuning methods such as LoRA, you need to install peft additionally.
If you want to use xFormers for training acceleration, you need to install xFormers additionally.
Stand-alone training
hostfile=""
deepspeed --hostfile=$hostfile fine-tune.py \
--report_to "none" \
--data_path "data/belle_chat_ramdon_10k.json" \
--model_name_or_path "baichuan-inc/Baichuan2-7B-Base" \
--output_dir "output" \
--model_max_length 512 \
--num_train_epochs 4 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 1 \
--save_strategy epoch \
--learning_rate 2e-5 \
--lr_scheduler_type constant \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--logging_steps 1 \
--gradient_checkpointing True \
--deepspeed ds_config.json \
--bf16 True \
--tf32 True
Lightweight fine-tuning
The code already supports lightweight fine-tuning such as LoRA. To use it, just add the following parameters to the above script:
--use_lora True
The specific configuration of LoRA can be found in the fine-tune.py script.
After fine-tuning with LoRA, you can load the model using the following command:
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)