[AI Combat] Open source and commercially available Chinese and English large language model baichuan-7B, built from scratch
Introduction to baichuan-7B
baichuan-7B is an open source and commercially available large-scale pre-trained language model developed by Baichuan Intelligence. Based on the Transformer structure, the 7 billion parameter model trained on about 1.2 trillion tokens supports both Chinese and English, and the context window length is 4096. It achieves the best results of the same size on the standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).
baichuan-7B Chinese evaluation
-
C-Eval
-
Gaokao
baichuan-7B build
-
1. Pull the docker image
docker pull nvcr.io/nvidia/pytorch:21.08-py3
【】Requires cuda 11.1 and above
-
2. Create docker
nvidia-docker run -it -d \ --name baichuan_llm \ -v /llm:/notebooks \ -e TZ='Asia/Shanghai' \ --shm-size 16G \ nvcr.io/nvidia/pytorch:21.08-py3
Into the container:
docker exec -it baichuan_llm env LANG=C.UTF-8 /bin/bash
-
3. Download the code
cd /notebooks/ git clone https://github.com/baichuan-inc/baichuan-7B.git
-
4. Download the model weight file
cd baichuan-7B/ git clone https://huggingface.co/baichuan-inc/baichuan-7B
-
5. According to the dependent library
pip install -r requirements.txt
-
6. Reasoning
from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("baichuan-7B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("baichuan-7B", device_map="auto", trust_remote_code=True) inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt') inputs = inputs.to('cuda:0') pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1) print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
-
output
-
-
7. Training
-
Data preparation
The user divides the training corpus into multiple UTF-8 text files evenly according to the multiple of the total rank number, and places them in the corpus directory (data_dir by default). Each rank process will read different files in the corpus directory, load them all into memory, and start the subsequent training process. The above is a simplified demonstration process. It is recommended that users adjust the data production logic according to their needs during formal training tasks. -
Configure DeepSpeed
and modify config/hostfile. If there are multiple machines and multiple cards, you need to modify the IP configuration of each node in ssh. -
train
sh scripts/train.sh
-
reference
https://huggingface.co/baichuan-inc/baichuan-7B/tree/main
https://github.com/baichuan-inc/baichuan-7B