Introduction and fine-tuning of baichuan-7B model

Introduction of baichuan-7B

On June 15, 2023, Baichuan Intelligent Company, founded by Sogou founder Wang Xiaochuan, released a Chinese and English pre-trained large model with 7 billion parameters - baichuan-7B.

baichuan-7B is based on the Transformer structure, a 7 billion parameter model trained on about 1.2 trillion tokens, supports Chinese and English bilingual, and has a context window length of 4096.

baichuan-7B not only surpasses other large models such as ChatGLM-6B by a significant advantage in the C-Eval, AGIEval and Gaokao Chinese authoritative evaluation lists, but also significantly leads LLaMA-7B in the MMLU English authoritative evaluation list.

C-Eval list
insert image description here

In the evaluation of Chinese C-EVAL, the comprehensive score of baichuan-7B reached 42.8 points, surpassing ChatGLM-6B's 38.9 points, and even better than some models with larger parameter scales.

Open source address:

Hugging Face:https://huggingface.co/baichuan-inc/baichuan-7B

Github:https://github.com/baichuan-inc/baichuan-7B

Model Scope:https://modelscope.cn/models/baichuan-inc/baichuan-7B/summary

baichuan-7B reasoning

Edit the predict.py file as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/data/sim_chatgpt/baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/sim_chatgpt/baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

run code

python predict.py

Video memory error
insert image description here
The video memory is not enough, the GPU video memory is 16G, but the model loading needs 27-28G, so consider using its quantized version for loading.
Modify as follows:

model = AutoModelForCausalLM.from_pretrained("/data/sim_chatgpt/baichuan-7B", device_map="auto", load_in_4bit = true, trust_remote_code=True)

You also need to install the following two packages

pip install bitsandbytes
pip install scipy

The result after running is as follows:

Climbing the Stork Tower->Wang Zhihuan Sends the Rain to the North->

2. Du Fu
3. The night rain sent north to return home for more than ten days, and the family lived by the Luoqiao. "Crane Tower") Wang Zhihuan's "Climbing the Stork Tower" poem. Sunrise and river flowers are more red than fire: the Yellow River is far away, when is the return date, and the west wind is another year." 4, Relatives and friends in Luoyang are like asking each other? Do.
-----Du Fu "Recalling Li Bai in Spring" Wang Zhihuan "Liangzhou Ci (Yellow River Far Above the White Clouds)" "Liangzhou Ci (Yellow River Far Above the White Clouds)" (Wang Zhihuan) Yellow River Far Up

Open Protocol
The baichuan-7B code adopts the Apache-2.0 protocol, and the model weight adopts a free commercial agreement, which can be used for free commercial use only by simple registration.

Although baichuan-7B works well on some evaluation data sets, it cannot be used out of the box because it does not have a supervised finetune step, does not align with human intentions, and often does not understand the instructions you give.

baichuan-7B fine-tuning

This fine-tuning reference project: https://github.com/wp931120/baichuan_sft_lora

Download the project repository

git clone https://github.com/wp931120/baichuan_sft_lora.git
cd baichuan_sft_lora

Configuration Environment

conda create -n baichuan-7b python=3.9
conda activate baichuan-7b
pip install -r requirements.txt

Dataset download

Download address: https://huggingface.co/datasets/BelleGroup/train_0.5M_CN/tree/main
This data set has a total of 519255 samples.

fine-tuning process

  • First, Baichuan LLM is quantified using qlora's nf4 and double quantization methods
  • Then use lora to fine-tune the instruction

Modify and run the sft_lora.py file

  • Set the model path in sft_lora.py to your own model path
  • Execute python sft_lora.py to run the code
    insert image description hereand the memory usage is about 9G.

The test process is shown in the figure below:
insert image description here
the settings of relevant parameters can be found in sft_lora.py.
Number of validation sets: VAL_SET_SIZE = 2000
Maximum length of text: CUTOFF_LEN=1024
Number of training eochs: num_train_epochs=1
Training batch_size: per_device_train_batch_size=1
Validation batch_size: per_device_eval_batch_size=1
Learning rate: learning_rate=3e-4
Gradient accumulation: gradient_accum calculation_steps=4
LoRA Medium Low Rank for rank approximation: r=8
Low rank matrix scaling hyperparameter: lora_alpha=16
Dropout for LoRA layer: lora_dropout=0.05

Inference using the fine-tuned model

The training speed is still relatively slow. It took about 48 hours to train 72,000 (18,000 * 4) pieces of data, and the loss is basically around 1.4. (Because the gradient accumulation parameter is 4, the training set 517255/4 is exactly 129313)
insert image description here

The train loss
insert image description here
insert image description here
selects the data in the data set:
Input: "Write a step-by-step guide for making pizza."
Output: "1. Prepare the dough: Mix flour, salt, yeast and water, knead the dough, and leave it in a warm and ventilated place to ferment. 2. Cut condiments: Prepare sauces, cheese and various condiments for pizza, such as vegetables, sausages, ham, etc. 3. Roll out the dough: Roll the fermented dough into a pancake shape, the thinner the better. 4. Add sauce: Spread the sauce evenly on the dough. 5. Sprinkle cheese and condiments: Sprinkle enough cheese and various condiments. 6. Baking: Put the pizza into the preheated oven, Bake for 10-15 minutes, or until the top is golden brown. 7. Slice and enjoy: Remove the pizza, cut into desired sizes, let cool slightly, and serve."

Let's first look at the model that has not been fine-tuned, and look at the reasoning.
insert image description here
The model without fine-tuning is just a generative model.

We use the model after fine-tuning 18,000 pieces of data for inference, and the effect is still good, but the inference speed is still very slow, and it takes about 1 to 2 minutes.
insert image description here
It can be seen that the intention has been understood.

Let's download the weights that have been fine-tuned again.
Download address: https://huggingface.co/wp931120x/baichuan_4bit_lora
insert image description here

reference:

https://zhuanlan.zhihu.com/p/637343740

https://zhuanlan.zhihu.com/p/637785176

https://github.com/wp931120/baichuan_sft_lora

https://huggingface.co/wp931120x/baichuan_4bit_lora

https://github.com/baichuan-inc/baichuan-7B/issues/23

Guess you like

Origin blog.csdn.net/dzysunshine/article/details/131529897