baichuan-7B: The best large model that is open source and commercially supported in Chinese and English

background

baichuan-7B is an open source and commercially available large-scale pre-trained language model developed by Baichuan Intelligence.

Based on the Transformer structure, the 7 billion parameter model trained on about 1.2 trillion tokens supports both Chinese and English, and the context window length is 4096.

The standard Chinese and English authoritative benchmarks (C-EVAL/MMLU) have achieved the best results under the same parameter scale.

Advantages of baichuan-7B

  • In the same size model, baichuan-7B has reached the current SOTA level.
  • baichuan-7B uses its own Chinese-English bilingual corpus for training, optimizes in Chinese, and reaches the SOTA level in C-Eval.
  • Unlike LLaMA, which completely prohibits commercial use, baichuan-7B uses a more relaxed open source agreement that allows commercial use.

data collection

  • The original data includes open source Chinese and English data, self-grabbed Chinese Internet data, and some high-quality knowledge data.
  • Referring to related data work, frequency and quality are the two dimensions that should be considered in the data processing process. Based on heuristic rules and quality model scoring, we filter the original dataset at the granularity of chapters and sentences. On the full amount of data, the local sensitive hash method is used to filter the text and sentence granularity.

model structure

The overall model is based on the standard Transformer structure and adopts the same model design as LLaMA.

  • Position encoding: rotary-embedding

    It is the position encoding scheme adopted by most models at this stage, and has better extension effect. Although the maximum length in the training process is 4096, the model can be well extended to 5000 tokens in the actual test, as shown in the following figure:

  • Activation layer: SwiGLU, Feedforward changed to (8/3) times the hidden layer size, ie 11008.

  • Layer-Normalization: Pre-Normalization based on RMSNorm .

pre-training

The DeepSpeed ​​framework is used for training, and many modifications are made on the original LLaMA framework to improve the throughput of training, including:

  1. Operator optimization technology: use more efficient operators, such as Flash-attention, RMSNorm of NVIDIA apex, etc.
  2. Operator Segmentation Technology: Segment part of computing operators to reduce memory peak value.
  3. Mixed Precision Technology: Reduce speeds up the calculation process without losing model accuracy.
  4. Disaster recovery technology for training: joint optimization of the training platform and training framework, IaaS + PaaS to achieve minute-level fault location and task recovery.
  5. Communication optimization technologies, including:
    1. Adopt topology-aware set communication algorithm to avoid network congestion and improve communication efficiency.
    2. Adaptively set the bucket size according to the number of cards to improve bandwidth utilization.
    3. According to the model and cluster environment, adjust the triggering timing of communication primitives, so as to overlap computation and communication.

Based on the above-mentioned several optimization technologies, the throughput of the 7B model of 182Tflops has been achieved on the Kcal A800 machine, and the peak computing power utilization rate of the GPU is as high as 58.3%.

The final loss is as follows:

Experimental effect

C-Eval

The C-Eval dataset is a comprehensive Chinese basic model evaluation dataset, covering 52 subjects and four levels of difficulty.

Using the dev set of the dataset as the source of few-shot, a 5-shot test was performed on the test set.

First modify the two values ​​of OPENMODEL_PATH and CEVAL_DATA_PATH evaluate_zh.pyin , which are the path where the model (folder) is stored and the path of the C-Eval dataset, respectively. Then execute the script below.

shot=5  # few-shot
gpu=0  # 显卡id
split=test  # 评估测试集
model_id=baichuan-7b   # 待评估的模型
task=ceval  # 任务名称:ceval
echo gpu_idx-${gpu}-${model_id}_${task}_${split}_${shot}-shot
nohup python  evaluate_zh.py --gpu_idx ${gpu} --model_id ${model_id} --task ${task} --shot ${shot} --split ${split} --show_detail  > ${model_id}_${task}_${split}_${shot}-shot_record.txt 2>&1 &

result

Model 5-shot Average Avg(Hard) STEM Social Sciences Humanities Others
GPT-4 68.7 54.9 67.1 77.6 64.5 67.8
ChatGPT 54.4 41.4 52.9 61.8 50.9 53.6
Claude-v1.3 54.2 39.0 51.9 61.7 52.1 53.7
Claude-instant-v1.0 45.9 35.5 43.1 53.8 44.2 45.4
moss-moon-003-base (16B) 27.4 24.5 27.0 29.1 27.2 26.9
Ziya-LLaMA-13B-pretrain 30.2 22.7 27.7 34.4 32.0 28.9
LLaMA-7B-hf 27.1 25.9 27.1 26.8 27.9 26.3
ChatGLM-6B 34.5 23.1 30.4 39.6 37.4 34.5
Falcon-7B 25.8 24.3 25.8 26.0 25.8 25.6
Open-LLaMA-v2-pretrain (7B) 24.0 22.5 23.1 25.3 25.2 23.2
TigerBot-7B base 25.7 27.0 27.3 24.7 23.4 26.1
Aquila-7B* 25.5 25.2 25.6 24.6 25.2 26.6
Bloom-7b 22.8 20.2 21.8 23.3 23.9 23.3
BLOOMZ-7B 35.7 25.8 31.3 43.5 36.6 35.6
baichuan-7B 42.8 31.5 38.2 52.0 46.2 39.3

Gaokao

Gaokao is a data set that uses Chinese college entrance examination questions as an evaluation of the ability of large language models to evaluate the language ability and logical reasoning ability of the model.

Only the multiple-choice questions were kept, and a unified 5-shot test was performed on all models after random division.

result

Below are the results of the tests.

Model Average
Open-LLaMA-v2-pretrain 21.41
Ziya-LLaMA-13B-pretrain 23.17
Falcon-7B 23.98
TigerBot-7B base 25.94
LLAMA-7B 27.81
ChatGLM-6B 21.41
Bloom-7b 26.96
BLOOMZ-7B 28.72
Aquila-7B* 24.39
baichuan-7B 36.24

AGIEval

AGIEval aims to evaluate a model's general ability in cognitive and problem-solving related tasks.

Only one of the four-choice single-choice questions was kept, and a unified 5-shot test was performed on all models after random division.

result

Model Average
Open-LLaMA-v2-pretrain 23.49
Ziya-LLaMA-13B-pretrain 27.64
Falcon-7B 27.18
TigerBot-7B base 25.19
LLAMA-7B 28.17
ChatGLM-6B 23.49
Bloom-7b 26.55
BLOOMZ-7B 30.27
Aquila-7B* 25.58
baichuan-7B 34.44
  • The Aquila model comes from the official website of Zhiyuan (https://model.baai.ac.cn/model-detail/100098) for reference only

English list

In addition to Chinese, the effect of the model on English was also tested.

MMLU 是包含57个多选任务的英文评测数据集,涵盖了初等数学、美国历史、计算机科学、法律等,难度覆盖高中水平到专家水平,是目前主流的LLM评测数据集。

采用了开源 的评测方案,最终 5-shot 结果如下所示:

结果

Model Humanities Social Sciences STEM Other Average
LLaMA-7B2 34.0 38.3 30.5 38.1 35.1
Falcon-7B1 - - - - 35.0
mpt-7B1 - - - - 35.6
ChatGLM-6B0 35.4 41.0 31.3 40.5 36.9
BLOOM-7B0 25.0 24.4 26.5 26.4 25.5
BLOOMZ-7B0 31.3 42.1 34.4 39.0 36.1
moss-moon-003-base (16B)0 24.2 22.8 22.4 24.4 23.6
moss-moon-003-sft (16B)0 30.5 33.8 29.3 34.4 31.9
baichuan-7B0 38.4 48.9 35.6 48.1 42.3

总结

baichuan-7B模型基于标准的 Transformer 结构,采用了和 LLaMA 一样的模型设计,核心优势如下:

  • 在同尺寸模型中baichuan-7B达到了目前SOTA的水平。
  • baichuan-7B使用自有的中英文双语语料进行训练,在中文上进行优化,在C-Eval达到SOTA水平。
  • 不同于LLaMA完全禁止商业使用,baichuan-7B使用更宽松的开源协议,允许用于商业目的。

文章和示例代码开源在GitHub: GPT实战教程,了解所有主流的开源LLM。

公众号:coding进阶。关注公众号可以获取最新GPT实战内容。

个人网站:Jincheng’s Blog

知乎:无忌

References

  • https://github.com/baichuan-inc/baichuan-7B
  • https://huggingface.co/baichuan-inc/baichuan-7B

Guess you like

Origin blog.csdn.net/perfumekristy/article/details/131259254