LLM - Large model technical report and training detailsBy Baichuan2

Table of contents

I. Introduction

2. Introduction - LLM related progress

1. The larger the model parameters, the stronger the model ability.

2. Open source models promote the rapid development of the LLM field

3. Open source models are concentrated in the English field and have limited capabilities in other languages.

4. Training data of 260 million Tokens is far ahead

5. Optimize the Chat model corresponding to human command release

6. Announced the CKPT during the training process to promote research and development in the field

3. Pre-training - Baichuan pre-training related

1.Pre-training Data - more comprehensive and cleaner data

2.Architecture

3.Tokenizer - lower compression ratio

4.Position Embedding - Position Emd has little impact

5.Activations and Normalizations - Optimize performance efficiency

6.Optimizations - Make training more robust

7.Scaling Laws - Predict final loss based on scaling rate

8.Infrastructure - How to improve cluster GPU efficiency

4.Alignment - Alignment of Human Intentions

1.Supervised Fine-Tuning - Supervised fine-tuning, reward feedback and reinforcement

2.Reward Model - Response The greater the difference, the more accurate the reward.

3.PPO - Optimizing language generation model

4.Training Detail - training parameter details

5.Safety - Model safety guarantee

1.Pre-training Stage - Strict data screening

2.Alignment Stage - red-blue confrontation optimization Prompt and Response

6.Evaluations - Multi-dimensional evaluation of the model

1.evaluate method - evaluation method of generation and selection

2.compare model - open source comparison model with reproducible results

3.Overall Performance - rich evaluation benchmarks

4.Vertical Domain Evaluations - Vertical Domain Evaluations

5.Math and Code – Mathematics and coding skills

6.Multilingual - multilingual ability

7.Safety Evaluations - safer and more reliable

8.Intermediate Checkpoints - Intermediate CKPT output

7.Related Work - Related work in the LLM field

8. Limitations and Ethical Considerations - Limitations and Ethical Considerations

9.More - More model related

1.Scaling laws - model system performance estimation

2.NormHead - more stable training

3.Training Dynamics - dynamic training and evaluation

4.Baichuan Harmless Evaluation Dataset - Baichuan Harmless Dataset

5.Details of MMLU and C-Eval - detailed evaluation information

6.Examples generated by Baichuan 2-13B-Chat - Model examples

10. Summary


I. Introduction

This article analyzes the origin of Baichuan2 based on the technical report provided by Baichaun-inc  .

◆  The baichuan series models are trained with 2.6 trillion tokens.

◆  LLM demonstrates superior performance on a variety of natural language tasks based on a few examples of natural language instructions, reducing the need for extensive feature engineering.

◆  The most powerful LLMs are closed or limited in their capabilities in languages ​​other than English.

2. Introduction - LLM related progress

1. The larger the model parameters, the stronger the model ability.

Large language model managers have advanced significantly, with model parameters ranging from millions to billions or even trillions. ELMo(2018), GPT1(2018) -> GPT-3(2020), PaLM(2022) The significant increase in parameter size has also led to significant improvements in language model capabilities, thereby achieving more human-like fluency and execution of various tasks. Ability to perform natural language tasks. With the release of 2022 ChatGPT, it demonstrates strong language proficiency across authors, highlighting the potential of large language models to automate natural language generation and understanding tasks.

2. Open source models promote the rapid development of the LLM field

LLM has led to exciting breakthroughs and applications. Examples include GPT-4, PaLM-2, and Claude, but these models are closed and developers and researchers have limited access to the models, which makes it difficult for the community to study and fine-tune the corresponding systems. The development and transparency of models can accelerate the development of corresponding fields. LLaMA is a 70B large model open sourced by Meta. Its complete open source has greatly benefited the research community. The complete liberalization of LLaMA and other open source models OPT and Bloom has accelerated research and progress in this field, giving birth to new models such as Alpace, Vicuna, etc.

3. Open source models are concentrated in the English field and have limited capabilities in other languages.

Most open source large-scale language models focus primarily on English. For example, the main data source of LLaMA is Common Crawl, which includes 67% of LLAMA's pre-training data, but is only filtered to English content. Other open source LLMs, such as MPT and Falcon, also focus on English and have limited capabilities in other languages. This hinders the development and application of llm in specific languages ​​(such as Chinese).

4. Training data of 260 million Tokens is far ahead

baichuan2 has two parameter magnitudes of 700 million and 1.3 billion. The models are all trained on 2.6 trillion tokens, which is the largest so far and more than twice that of baichaun1. Based on such a large amount of training data, baichuan2 achieves significant improvements over baichaun1. On common bases such as MLU, CMMLU, and C-Eval, baichaun2 has nearly 30% higher performance than baichuan1. Specifically, baichuan2 is optimized to improve the performance of mathematics and coding problems. In GSM8K and HumanEval evaluations, baichuan2 is basically doubled compared to baichaun1. In addition, baichuan2 also shows strong performance in the medical and legal fields, outperforming other open source models in benchmarks such as MedQA and JEC-QA.

5. Optimize the Chat model corresponding to human command release

Additionally, we released two chat models, Baichuan 2-7B-Chat and Baichuan 2-13B-Chat, optimized to follow human instructions. These models excel at conversational and contextual understanding. We will elaborate on ways to improve the security of baichuan2. By open sourcing these models, we hope to enable the community to further improve the safety of large language models and promote more research into responsible LLM development.

6. Announced the CKPT during the training process to promote research and development in the field

In addition, in the spirit of research collaboration and continuous improvement, we have also released checkpoints for baichuan2 at different stages of training from 20 billion tokens to the full 26,000 tokens. We find that even for a 75 billion parameter model, performance continues to improve after training on more than 2.6 trillion tokens. By sharing these intermediate results, we hope to provide the community with a deeper understanding of the training dynamics of baichuan2. Understanding these dynamics is key to uncovering the inner workings of large language models. We believe the release of these checkpoints will pave the way for further progress in this rapidly evolving field.

3. Pre-training - Baichuan pre-training related

1.Pre-training Data - more comprehensive and cleaner data

DataSource - a more comprehensive data set

During the data collection process, we aimed to pursue comprehensive data scalability and representativeness. We collect data from diverse sources including common internet web pages, books, research papers, code libraries, etc. to build a broad system of world knowledge.

The top 10 rankings are: Technology, Business, Entertainment, Life, Health, Education, Culture, Code, Sports, and Engineering.

Data Processing - more fine-grained data cleaning

For data processing, we focus on data frequency and quality. Data frequency relies on clustering and deduplication. We built a large-scale deduplication and clustering system that supports LSH-like features and densely embedded features. The system can cluster and deduplicate trillions of data in hours. Based on clustering, individual documents, paragraphs, sentences are deduplicated and scored. These scores are then used for data sampling in pre-training. The training data sizes at different stages of data processing are as follows:

Exact deduplication - Exact deduplication - 29.89% deleted

Heuristic approach - Delete 1.77%

Sent-wise quality filter - sentence level quality filter - delete 3.06%

Sent-wise, paragraph-wise deduplication - Sent-wise, paragraph-wise deduplication - 14.47% deleted

Document-wise deduplication - 19.13% deleted

2.Architecture

The model architecture of baichun 2 is based on the popular Transformer (Vaswani et al., 2017). Nonetheless, we made several modifications, which we detail below.

3.Tokenizer - lower compression ratio

The tokenizer needs to balance two key factors: high compression for efficient inference and an appropriately sized vocabulary to ensure adequate training of each word embedding. We considered both aspects. We extend the vocabulary size in baichuan1 from 64,000 to 125,696, aiming to strike a balance between computational efficiency and model performance.

We label the data using Byte Pair Encoding (BPE) (Shibata et al., 1999) from SentencePiece (Kudo and Richardson, 2018). Specifically, we do not apply any normalization to the input text, and we do not add a dummy prefix like Baichuan1. We break numbers into individual digits to better encode numeric data. To handle code data containing extra whitespace, we added a whitespace-only tag to the tokenizer. Character coverage is set to 0.9999, with rare characters falling back to UTF-8 bytes. 

We set the maximum token length to 32 to consider long Chinese phrases. The training data of the baichuan2 tokenizer comes from the baichuan2 pre-training corpus, with more sampled code examples and academic papers to improve coverage (Taylor et al., 2022). The following table shows a detailed comparison of baichuan2's tokenizer with other tokenizers.

4.Position Embedding - Position Emd has little impact

Based on baichuan1, we used rotational position embedding (RoPE) for baichuan2-7B (Su et al., 2021) and ALiBi for baichuan2-13B (Press et al., 2021). ALiBi is a newer positional encoding technology that has shown improved extrapolation performance. However, most open source models use RoPE for position embedding and optimize attention implementation, such as Flash Attention (Dao et al., 2022; Dao, 2023). Currently, RoPE is more suitable because it is based on multiplication, bypassing attention_mask. The need to pass on attentional operations. However, in preliminary experiments, the choice of position embedding did not significantly affect model performance. To further study bias-based and multiplication-based attention, we apply RoPE to baichuan2-7B and ALiBi to baichuan2-13B, consistent with baichuan1.

5.Activations and Normalizations - Optimize performance efficiency

We use the SwiGLU (Shazeer, 2020) activation function, a switch activation variant of GLU (Dauphin et al., 2017), which shows improved results. However, SwiGLU has a "bilinear" layer containing three parameter matrices, unlike the normal Transformer's feedforward layer which has two matrices, so we reduce the hidden size from 4 times the hidden size to 8/3 , and rounded to a multiplication of 128.

For the attention layer of baichuan2, we adopted the memory-efficient attention implemented by xFormers2 (Rabe and Staats, 2021). By leveraging xFormers' optimized attention and biasing capabilities, we can effectively combine ALiBi's bias-based positional encoding while reducing memory overhead. This provides performance and efficiency advantages for large-scale training of baichaun2.

We apply layer normalization (Ba et al., 2016) to the input of the Transformer block, which is more robust to warm-up schedules (Xiong et al., 2020). Furthermore, we use the RMSNorm implementation introduced by (Zhang and Sennrich, 2019), which only calculates the variance of the input features to improve efficiency.

6.Optimizations - Make training more robust

We use the AdamW (Loshchilov and Hutter, 2017) optimizer for training. β1 and β2 are set to 0.9 and 0.95 respectively. We use a weight decay of 0.1 and clip the grad norm to 0.5. The model is warmed up with 2,000 linear scaling steps to the maximum learning rate, and then cosine decay is applied to the minimum learning rate. Parameter details and learning rates are as follows:

The entire model is trained using BFloat16 mixed precision. BFloat16 has better dynamic range compared to Float16, which makes it more robust to large values ​​critical for training large language models. However, the low precision of BFloat16 can cause problems in some cases. For example, in some public RoPE and ALbi implementations, the torch.arange operation fails due to collision when the integer exceeds 256, preventing discrimination of nearby positions. Therefore, we use full precision for some value-sensitive operations such as positional embeddings. 

BFloat16

Both bfloat16 and float16 are 16-bit floating point number formats, and their main difference is the difference in precision. Specifically, the bfloat16 format uses 1 bit for the sign, 8 bits for the exponent, and 7 bits for the mantissa. The float16 format uses 1 bit to represent the sign, 5 bits to represent the exponent, and 10 bits to represent the mantissa. This means that bfloat16 has a larger representation range, but less precision. float 16 has higher precision, but its representation range is relatively small. float32 has 32 bits, of which 1 bit is used to represent the sign, 8 bits are used to represent the exponent, and 23 bits are used to represent the mantissa. So float32 has higher precision and can represent a wider range of values.

In terms of memory space usage, float32 requires 4 bytes - 32 bits, while float16 and bfloat16 only require 2 bytes - 16 bits. This means that using float16 can save memory space and is more suitable for memory-constrained applications.

NormHead

To stabilize training and improve model performance, we normalize (also called "heading") the output embeddings. In our experiments, NormHead has two advantages. First, in our preliminary experiments, we found that the norm of the head is prone to instability. During training, the norm of rare token embeddings becomes smaller, which interferes with training dynamics. NormHead significantly stabilizes dynamics. Second, we find that semantic information is mainly encoded by embedding cosine similarity rather than L2 distance. Since current linear classifiers compute logits via dot product, it is a mixture of L2 distance and cosine similarity. NormHead alleviates the interference of L2 distance in calculating logits.

Max-z Loss

During training, we found that the logits of LLM can become very large. While the softmax function has nothing to do with absolute logit values ​​as it only depends on their relative values. Larger logits can cause problems during inference because common implementations of repetition penalties apply scalars directly to logits. Such shrinkage of very large logits can significantly change the probability after softmax, making the model sensitive to the choice of repeated penalty hyperparameters. Inspired by NormSoftmax and PaLM's auxiliary z-loss, we add a max-z loss to normalize the logits:

The final training losses of baichuan2-7B and baichuan2-13B are shown in the figure below: 

7.Scaling Laws - Predict final loss based on scaling rate

Neural scaling laws, in which error decreases as a power function of training set size, model size, or both, have enabled exciting performance in deep learning and large language models when training has become increasingly expensive. Before training a large language model with billions of parameters, we first train a few small models and fit a scaling law to train the larger model.

We launched a range of model sizes from 10M to 3B, ranging from 1/1000 to 1/10 the size of the final model, training up to 1 trillion tokens per model, using consistent hyperparameters and the same dataset from baichuan2 . Based on the final losses of different models, we can get the mapping from training failure to target loss.

To fit the scaling law of the model, we adopted the formula given by Henighan et al. (2020):

where L∞ is the irreducible loss and the first term is the reducible loss, which is formulated as a power law scaling term. C is the training loss, and LC is the final loss of the model at that failure. We use the curve_fit function from SciPy4library to fit the parameters. The final fitted scaling curves and predicted final losses for the 75 billion and 1.3 billion parameter models are shown below. We can see that the fitting scaling law predicts the final loss of baichaun2 with higher accuracy:

The scaling law of baichuan2. We trained various models from 10 million to 3 billion parameters, with 1 trillion tokens. By fitting a power law term to the loss given a training failure, we predict the loss of training baichuan2-7B and baichuan2-13B on 2.6 trillion tokens. This fitting procedure accurately predicts the loss of the final model (marked with two stars). 

8.Infrastructure - How to improve cluster GPU efficiency

Efficiently utilizing existing GPU resources plays a crucial role in training and developing large language models today. To this end, we develop a co-design approach for elastic training frameworks and smart cluster scheduling strategies. Since our GPUs are shared among multiple users and tasks, the specific behavior of each task is unpredictable, often resulting in idle GPU nodes within the cluster. Considering that a single machine equipped with 8 A800 GPUs can fully meet the memory requirements of the baichuan7B and baichuan13B models, the main design criterion of our training framework is machine-level elasticity, whose resources supporting tasks can be dynamically modified according to the cluster status, thus serving as our The basis of intelligent scheduling algorithms.

To meet the requirements of machine-level elasticity, our training framework integrates tensor parallelism and data parallelism, where we set up tensor parallelism within each machine and use ZeRO shared data parallelism for elastic scaling across machines. Additionally, we employ a tensor splitting technique where we split certain calculations to reduce peak memory consumption, such as cross-entropy calculations with large vocabularies. This approach allows us to meet memory requirements without increasing computation and communication, making the system more efficient. To further speed up training without affecting model accuracy, we implemented mixed precision training, where we performed forward and backward computations in BFloat16 while performing optimizer updates in Float32.

Furthermore, to effectively scale our training cluster to thousands of GPUs, we integrate the following techniques to avoid communication efficiency degradation:

◆  Topology-aware distributed training

In large-scale clusters, network connections often span multiple layers of switches. We strategically arrange the ranking of distributed training to minimize frequent access across different switches, thereby reducing latency and thus improving overall training efficiency.

◆  ZeRO’s hybrid and hierarchical partitioning

By dividing parameters across GPUs, ZeRO3 reduces memory consumption at the expense of additional full-gather communication. When scaling to thousands of GPUs, this approach will lead to significant communication bottlenecks. To solve this problem, we propose a hybrid and hierarchical partitioning scheme. Specifically, our framework first partitions the optimizer state across all GPUs, and then adaptively decides which layers need to activate ZeRO3, as well as the layer partitioning parameters.

By integrating these strategies, our system is able to efficiently train the baichaun2-7B and baichuan2-13B models on 1024 NVIDIA A800 GPUs, achieving a computational efficiency of over 180 TFLOPS.

Tips:

TFLOPS is  the abbreviation of floating point operations per second  , which is the number of floating point operations performed per second. It is used to evaluate a computer's computing power, especially in scientific computing where large amounts of floating point operations are used. The floating point performance of NVIDIA RTX 3060 is approximately 12.5 TFLOPS , while the TFLOPS performance index of China's "Tianhe-2" supercomputer reaches 1000 trillion operations per second.

4.Alignment - Alignment of Human Intentions

baichuan2 also introduced an alignment process, resulting in two chat models: baichuan2-7B-Chat and baichaun2-13B-Chat. baichuan2's alignment process consists of two main components: supervised fine-tuning from human feedback (RLHF) (SFT) and reinforcement learning.

1.Supervised Fine-Tuning - Supervised fine-tuning, reward feedback and reinforcement

In the supervised fine-tuning phase, we use human labelers to annotate cues collected from different data sources. Each tip is marked as useful or harmless based on key principles similar to Claude's. To verify data quality, we use cross-validation—authoritative annotators check the quality of batches of samples annotated by a specific group of population workers, rejecting any batches that do not meet our quality standards. We collected over 100k supervised fine-tuning samples and trained our base model on them. Next, we describe the reinforcement learning process via the RLHF method to further improve the results. The entire process of RLHF, including RM and RL training, is shown in the figure below:

RLHF

RLHF is  Reinforcement Learning from Human Feedback , which is a method that uses reinforcement learning algorithms to optimize language models based on human feedback.

RLHF mainly consists of the following three parts:

        A pretrained language model that can generate natural language text or perform other tasks.

        A reward model learned from human feedback to evaluate the output quality and compliance of language models.

         A reinforcement learning algorithm for training language models that updates the parameters of the language model using guidance from the reward model.

The implementation process of RLHF can be divided into the following three stages:

        In the supervised fine-tuning stage, annotated data is used to initially train the language model to adapt to specific tasks and domains.

        In the reward model training phase, human feedback data is used to train the reward model to capture human preferences and evaluations of the language model output.

        In the RL fine-tuning stage, reinforcement learning algorithms are used to further train the language model to maximize the expected value of the reward model.

This process can be repeated multiple times to more fully align the language model with human values.

2.Reward Model - Response The greater the difference, the more accurate the reward.

We designed a three-tier classification system for all tips, consisting of 6 primary categories, 30 secondary categories, and over 200 tertiary categories. From a user perspective, our goal is for the classification system to comprehensively cover all types of user needs. From a reward model training perspective, the cues within each category should have sufficient diversity to ensure that the reward model generalizes well.

Given a prompt, baichuan2 models of different sizes and stages (SFT, PPO) generate responses to enhance response diversity. Only responses generated by the baichuan2 model family are used in RM training. Responses from other open source datasets and proprietary models did not improve the accuracy of the reward model. This also emphasizes the internal consistency of the baichaun model series from another perspective. The loss function used to train the reward model is consistent with that of InstructGPT. The performance of the reward model from training is consistent with LLAMA 2, indicating that the greater the score difference between the two responses, the higher the discrimination accuracy of the reward model, as shown in the following table:

3.PPO - Optimizing language generation model

After obtaining the reward model, we use the PPO algorithm to train our language model. We employ four models: the actor model (responsible for generating responses), the reference model (used to calculate the KL penalty with fixed parameters), the reward model (providing an overall reward for the entire response with fixed parameters), and the critic model (aimed at learning per token value).

4.Training Detail - training parameter details

During the RLHF training process, the critic model is warmed up 20 training steps in advance. Subsequently, both critic and actor models are updated via the standard PPO algorithm. For all models, we use gradient clipping of 0.5, constant learning rate of 5e-6, and PPO clipping threshold ε = 0.1. We set the KL penalty coefficient β = 0.2, which decays to 0.005 in steps. We train all our chat models for 350 iterations, resulting in baichuan2-7B-Chat and baichuan2-13B-Chat.

5.Safety - Model safety guarantee

We believe that model safety improvements result not only from constraints in the data cleaning or alignment stages, but also from leveraging positive knowledge and identifying negative knowledge throughout all training stages. Guided by this concept, we enhanced model security throughout the baichuan2 training process.

1.Pre-training Stage - Strict data screening

During the pre-training phase, we pay close attention to data security. The entire pre-training dataset undergoes a rigorous data filtering process designed to enhance security. We have designed a system of rules and models to eliminate harmful content such as violence, pornography, racism, hate speech, etc. Additionally, we curate a Chinese-English bilingual dataset that contains millions of web pages from hundreds of reputable websites representing various positive value domains, including policy, law, vulnerable groups, general values, traditional virtues and other fields. We also increased the sampling probability of this dataset.

2.Alignment Stage - red-blue confrontation optimization Prompt and Response

We built a red team process consisting of 6 types of attacks and 100+ granular security value categories, a team of 10 expert annotators, and traditional Internet security experience initialized security alignment prompts. Relevant snippets from the pre-training dataset are retrieved to create responses, resulting in approximately 1K annotated data for initialization.

The expert annotation team guided the 50-person outsourced annotation team through red and blue confrontation with the initialized alignment model, resulting in 200K attack tips. By using a specialized multi-valued supervised sampling method, we maximize the use of attack data to generate responses with different security levels. During the RL optimization phase, we also considered security:

At the beginning of security hardening, the DPO approach effectively leverages a limited amount of annotated data to improve performance on specific vulnerability issues.

PPO safety reinforcement training was conducted by using a reward model integrating useful and non-harmful targets.

DPO / PPO

In the field of natural language processing, DPO and PPO are both optimization algorithms used to train large language models.

DPO (Direct Preference Optimization) is a new training method designed to directly optimize the consistency of language models with human preferences. This approach can more precisely control the behavior of the model and align it with human preferences, but how to achieve a large-scale, data-efficient, robust, and scalable approach remains a challenge.

PPO (Proximal Policy Optimization) is a proximal policy optimization algorithm in reinforcement learning, designed to achieve higher data efficiency and robustness when training large language models. Compared with the TRPO algorithm, the PPO algorithm uses a first-order optimization method and uses only one neural network to estimate the policy and value functions simultaneously, thereby reducing the model complexity and training time. The PPO algorithm has the advantage of clip probability ratio (Clip Probability Ratio) when training large language models. This ratio can limit the amount of change in the model strategy, thereby achieving smoothness and stability of the strategy while maintaining high sample efficiency.

In short, DPO and PPO are both optimization algorithms for large language model training, but their application scenarios and purposes are slightly different. DPO aims to directly optimize the consistency of human preferences with model behavior, while PPO aims to improve data efficiency and robustness while maintaining model complexity and sample efficiency.

6.Evaluations - Multi-dimensional evaluation of the model

In this section, we report zero- or few-shot results on standard benchmarks for the pretrained base model. We evaluate baichuan2 on free-form generation tasks and multiple-choice tasks.

1.evaluate method - evaluation method of generation and selection

◆  Free-form generation

The model is given some sample inputs (shots) and then generates continuations to obtain results, such as question answering, translation and other tasks.

◆  Multiple choice

The model is given a question and multiple choices, and the task is to select the most suitable candidate.

2.compare model - open source comparison model with reproducible results

Given the diversity of tasks and examples, we incorporated the open source evaluation frameworks lm-evaluation-harness and OpenCompass into our in-house implementation to allow for fair comparison with other models. The model we chose to compare is of similar size to baichuan2, is open source, and can reproduce the results:

◆ LLAMA

Meta is a language model trained on 1 trillion tokens. The context length is 2048, and we evaluate LLAMA 7B and LLAMA 13B.

◆  LLAMA2

The successor model to LLaMA 1 was trained on 2 trillion tokens and a better mix of data.

◆  Baichuan1

baichaun7B is trained on 1.2 trillion tokens, and baichuan13B is trained on 1.4 trillion tokens. They both focus on English and Chinese.

ChatGLM2-6B

A chat language model that performs well on several benchmarks.

MPT-7B

The open source LLM is trained on 1 trillion English text and code tokens.

◆ Falcon-7B

A series of LLMs trained on 1 trillion tokens and enhanced with a curated corpus. It is provided under the Apache 2.0 license.

◆ Vicuna-13B 

Language models trained by fine-tuning LLAMA-13B publish their base models, so we adopt the results they report on their website. It uses the conversation data set generated by ChatGPT.

◆ Chinese-Alpaca-Plus-13B

A language model trained by fine-tuning LLAMA13B on a conversational dataset generated by ChatGPT.

◆ XVERSE-13B

13B multi-language large-scale language model trained on more than 1.4 trillion tokens.

3.Overall Performance - rich evaluation benchmarks

This section describes the overall performance of the Baichuan 2 base model compared to other similarly sized models. We choose 8 benchmarks for comparison:

MMLU Massive Multi-Task Language Understanding consists of a series of multiple-choice questions on academic topics.

C-Eval is a comprehensive Chinese assessment benchmark containing over 10k multiple-choice questions.

CMMLU is also a general assessment benchmark specifically designed to assess LLM’s knowledge and reasoning abilities in the Chinese language and cultural context.

AGIEval is a human-centered benchmark designed to assess general human abilities such as cognition and problem solving.

Gaokao is an assessment framework that utilizes Chinese high school entrance exam questions.

BBH is a set of challenging BIG-Bench tasks where language model evaluation does not outperform the average human evaluator.

GSM8K is a mathematics-focused evaluation benchmark.

HumanEval  is a docstring-to-code dataset consisting of 164 coding questions testing different aspects of programming logic.

For CMMLU and MMLU, we adopt the official implementation and use 5-shot for evaluation. For BBH, we use 3-shot evaluation. For C-Eval, Gaokao and AGIEval, we only choose multiple choice with four candidates for better evaluation. For GSM8K, we use the 4-shot test derived from OpenCompass. We also combine the results of GPT-4 and GPT-3.5-Turbo. Unless otherwise stated, the results in this article were obtained using our in-house assessment tools. The overall results are shown in the table below. Compared with other open source models of similar size, baichuan2 has clear performance advantages. Especially in math and coding problems, our model achieves significant improvements over baichuan1:

4.Vertical Domain Evaluations - Vertical Domain Evaluations

We also evaluated baichuan2 in vertical domains, and we chose the legal and medical domains because they have been extensively studied in recent years. In the legal domain, we report JEC-QA scores collected from the Chinese National Judicial Examination. It contains multiple choice and multiple answer questions. For compatibility with our assessment kit, we only test multiple choice questions.

In the medical domain, we report scores on two medical benchmarks, MedQA and MedMCQA, as well as average scores from medical-related subjects in C-Eval, MMLU, and CMMLU. Specifically, MedMCQA is collected from professional medical board examinations in the United States and China, including three subsets, namely USMLE, MCMLE and TWMLE, and we report the results of USMLE and MCMLE with five candidates; MedMCQA is collected from India Collected in the Medical Entrance Examination, we evaluate the multiple-choice questions and report the scores on the development set. Details of MedMCQA include (1) Clinical Medicine, Basic Medicine for C-Eval (val), (2) Clinical Knowledge, Anatomy, College Medicine, College Biology, Nutrition, Virology, Medical Genetics, Specialty Medicine for MMLU, (3) Anatomy, clinical knowledge, university medicine, genetics, nutrition, traditional Chinese medicine, CMMLU virology. Furthermore, all these datasets are evaluated in 5-shot.

Baichuan 2-7B-Base surpasses models such as GPT-3.5 Turbo, ChatGLM 2-6B and LlaMA 2-7B in the Chinese legal field, and is second only to GPT-4. Compared with baichuan1-7B, baichuan2-7B-Base improves by nearly 10 points. In the medical field, Baichuan 2-7B-Base outperforms models such as ChatGLM 2-6B and LlaMA 2-7B, showing significant improvements over Baichuan 1-7B. Similarly, Baichuan 2-13B-Base surpasses models other than GPT-4 in the Chinese legal field. In the medical field, Baichuan 2-13 Base outperforms models such as XVERSE-13B and LLAMA 2-13B. Compared with baichuan1-13B-Base, baichuan2-13B-Base also shows significant improvements:

5.Math and Code – Mathematics and coding skills

This section covers math and coding performance. We use GSM8K (4-shot) and MATH (4-shot) to assess mathematical ability. MATH contains 12,500 more difficult math problems. To evaluate the model's coding capabilities, we report scores in HumanEval (0-shot) and MBPP (3-shot). HumanEval is a series of programming tasks that include model language understanding, reasoning, algorithms, and simple mathematics to evaluate the correctness of a model and measure its ability to solve problems. MBPP, which consists of a data set of 974 Python short functions and program text descriptions, as well as test cases to verify their functional correctness.

We use OpenCompass to evaluate the model's mathematical and coding capabilities. In the field of mathematics, baichuan 2-7B Base surpasses models such as LLAMA 2-7B. In the code domain, it outperforms models of the same size, such as ChatGLM 2-6B. Compared with the baichuan 1-7B model, the baichuan 2-7B-Base shows significant improvements. In mathematics, Baichuan 2-13B-Base surpasses all models of the same size and is close to the level of GPT-3.5 Turbo. In the code domain, Baichuan 2-13B-Base outperforms models such as LlaMA 2-13B and XVERSE-13B. Compared with baichuan 1-13B-Base, baichuan 2-13B-Base showed significant improvement.

◆ N shot

In meta-learning or transfer learning, the "N-shot" is a method of evaluating model performance. "1-shot" means that the model only sees one example of a target category before it needs to learn and predict that category, while "5-shot" means that the model sees five examples of a target category. Both terms are often used in "few-shot learning", where the goal is to train a model to learn on very few samples (e.g., one or a few). This is crucial for many real-world tasks, as we often need to make predictions with only very few samples. In this context, performance on "N-shot" tasks is often used to evaluate a model's ability to generalize, i.e., how accurately a model predicts when it sees new, previously unseen data.

6.Multilingual - multilingual ability

We used the Flores-101 to assess multilingualism. Flores-101 covers 101 languages ​​from around the world. Its data comes from various fields such as news, travel guides and books. We choose the official languages ​​of the United Nations (Arabic (ar), Chinese (zh), English (en), French (fr), Russian (ru) and Spanish (es)), as well as German (de) and Japanese (ja) as a test language. We conducted 8-shot testing on seven subtasks in Flores-101, including zh-en, zh-fr, zh-es, zh-ar, zh-ru, zh-ja, and zh-de. The evaluation was performed using OpenCompass.

In the multilingual domain, Baichuan 2-7B-Base outperforms all models of the same size in all seven tasks and shows significant improvements over Baichuan 1-7B. Baichuan 2-13B-Base outperforms similarly sized models in 4 of the seven tasks. In the zh-en and zh-ja tasks, it surpasses GPT 3.5 Turbo and reaches the level of GPT-4. Compared with Baichuan 1-13B-Base, Baichuan 2-13B-Base achieved significant improvements in zh-ar, zhru, and zh-ja tasks. Although GPT-4 still dominates the multilingual space, open source models are catching up closely. In zh-en tasks, Baichuan 2-13B-Base slightly outperforms GPT-4.

Tips:

8-shot testing refers to a testing method using a Large Language Model (LLM), where "8-shot" means that we provide 8 relevant examples or situations before the model gives a predicted answer. These examples will help the model understand the question it needs to answer or the task it needs to accomplish. In the field of machine learning, this method is called "few-shot learning", in which the model uses a limited number of training samples to learn and predict. 

7.Safety Evaluations - safer and more reliable

The previous chapter Safety described efforts to improve baichuan 2 safety. However, some previous work has shown that usefulness and harmlessness are two sides of a seesaw—while harmlessness increases, usefulness may lead to a decrease in bits. Therefore, we evaluate these two factors before and after safety alignment. The image below shows the usefulness and harmlessness of baichuan2 before and after safe alignment. We can see that our safe alignment process does not compromise usefulness while significantly improving harmlessness:

baichaun Helpful and harmless before and after safe alignment. The x-axis represents the measure before safe alignment, and the y-axis represents the result. We see that after this process, usefulness remains largely the same, while harmlessness improves significantly through safety efforts (more mass in the upper triangle).

We then evaluate the security of our pre-trained model using the Toxigen dataset. As with LLAMA 2, the version we use after cleaning is from the SafeNLP project, distinguishing neutral and hateful types of 13 minority groups, forming a 6-shot dataset consistent with the original Toxigen prompt format. Our decoding parameters use temperature 0.1 and top-p 0.9 kernel sampling. We perform model evaluation using a fine-tuned version of HateBert optimized in Toxigen. As shown in the table below, compared with LLAMA 2, the baichuan 2-7B and baichuan 2-13B models have certain safety advantages:

Inspired by BeaverTails Ji et al. We built the Shirakawa Harmless Evaluation Dataset (BHED), covering 7 major bias/discrimination safety categories, insult/blasphemy, illegal/moral content, physical health, mental health, financial privacy, and sensitive topics to evaluate our chat model security. To ensure comprehensive coverage within each category, we asked human annotators to generate 1400 data samples. This was further extended with self-instruction and cleaned by humans for fluency, resulting in a total of 70,000 samples per category, 10,000 per category. We used these samples to evaluate different models and the results are shown in the table below. We can see that Baichuan 2 performs equally well or better than other chat models in our security evaluation:

8.Intermediate Checkpoints - Intermediate CKPT output

We will also release the intermediate checkpoints of the 7B model, from 22 billion token checkpoints to 264 billion token checkpoints, which is the final output of baichuan 2-7B-Base. We checked their performance on several benchmarks and the results are shown in the figure below. As shown, Baichuan 2 shows consistent improvement as training proceeds. Even after 2.6 trillion tokens, there seems to be ample room for further improvements. This is consistent with previous work on scaling llm, showing that data size is a key factor:

7.Related Work - Related work in the LLM field

The field of language models has experienced a renaissance in recent years, triggered in large part by the development of deep neural networks and Transformers. Kaplan et al. proposed a scaling law for large model pretraining. By systematically analyzing how model performance increases with parameter and data size, they provide a blueprint for the current era of large-scale models with hundreds or even billions of parameters.

On the basis of these scaling laws, organizations such as OpenAI, Google, Meta, and Anthropic engage in a computational arms race to create larger and larger LLMs. OpenAI's 175 billion parameter proprietary language model GPT-3. The few-shot or even zero-shot capabilities of LLM surround most natural language understanding tasks. From code generation to mathematical problem solving and even open world scenarios. Specialized scientific LLMS also emerged, such as the Galaxy, to demonstrate the potential of large-scale models to absorb technical knowledge. However, raw parameter counts alone do not determine model capability - Chinchilla showed that scaling model capacity based on the number of tokens (rather than just parameters) yields better sample efficiency. Parallel to the growth of private LLMs, academic and non-profit efforts have been devoted to developing open source alternatives such as Bloom, OPT, and Pythia. Although some open source large language models contain as many as 175 billion parameters, most are only trained on 50 billion tokens or less. Considering that after training on trillions of tokens, a 7 billion parameter model can still improve significantly. Among these open source models, LLAMA and its successor LLAMA 2 stand for their performance and transparency. The community quickly optimizes those for better inference speed and various applications.

In addition to these basic models, many chat models have been proposed to track human instructions. Most of them fine-tune the base model to align with humans, such as OpenAI. These chat models show significant improvements in understanding human instructions and solving complex tasks. To further improve alignment, Ouyang incorporated reinforcement learning from the RLHF method with human feedback. This involves learning from human preferences by training reward models on human-rated outputs. Other methods such as direct preference optimization DPO and AI feedback reinforcement learning have also been proposed to improve RLHF in terms of efficiency and effectiveness.

8. Limitations and Ethical Considerations - Limitations and Ethical Considerations

Like other large language models, Baichuan 2 also faces ethical challenges. It is prone to bias and toxicity, especially given that much of its training data comes from the internet. Despite our best efforts to mitigate these issues using benchmarks such as Toxigen, the risks cannot be eliminated and toxicity tends to increase with model size. Furthermore, the knowledge of the baichuan 2 model is static and may be outdated or incorrect, which poses challenges to fields that require up-to-date information such as medicine or law. While optimized for safety in Chinese and English, the model has limitations in other languages ​​and may not fully capture biases associated with non-Chinese cultures. There is also potential for abuse, as the model can be used to generate harmful or misleading content. Although we try our best to balance safety and practicality, some safety measures may appear to be overly cautious, affecting the model's usability for certain tasks. We encourage users to use baichuan 2 models responsibly and ethically. At the same time, we will continue to optimize these issues and release updated versions in the future.

9.More - More model related

1.Scaling laws - model system performance estimation

LLM Scaling laws (Learning Laws of Learning Machines) refer to predicting machine learning system performance (such as training time, model quality, etc.) based on computing and learning resources (such as data, model complexity, computing power, etc.) in machine learning etc.) quantitative relationship theory. These quantitative relationships usually show a scale relationship, that is, as a certain resource (such as data volume) increases, system performance (such as training time or model quality) will change in a specific way. This scale relationship is obtained through a large number of experiments and statistical analysis, so it is also called "scale law" or "expansion law". For example, a common LLM scaling law is the "data scale law", which states that as the amount of training data increases, the model's predictive performance (such as accuracy) will generally increase as the amount of training data increases, keeping other conditions unchanged. This law of scale has been widely verified in various machine learning tasks. LLM Scaling laws have important guiding significance for predicting and optimizing the performance of machine learning systems. By studying and understanding these laws of scale, we can better understand how machine learning models work and how to maximize model performance with limited resources. We use 7 models to fit the scaling law of baichuan 2. Parameter details are shown in the table below:

Through multiple sets of controlled experiments with different parameters, we can also use statistical learning methods to predict the performance indicators of the model under different parameter scales.

2.NormHead - more stable training

By performing the word embedding KNN retrieval task, given a query word, the nearest K words are retrieved. We find that semantic information is mainly encoded by embedding cosine similarity rather than L2 distance. That is, the KNN result of cosine similarity is a word with semantic similarity, while the KNN result of L2 distance is meaningless to a certain extent. Since current linear classifiers compute logits via dot product, it is a mixture of L2 distance and cosine similarity. To mitigate interference from L2 distances, we recommend computing logits only in terms of angles. We normalize the output embedding so that the dot product is not affected by the embedding norm.

◆  L2 distance

◆  Dot product

◆  Cosine similarity

To verify this operation, we performed ablation experiments where we added or removed normalization before softmax and trained the 7B model for 12 steps. All hyperparameters and data are the same as baichuan 2 7B. The training loss is shown in the figure below. We can see that when removing NormHead, the training becomes very unstable in the beginning, on the contrary, after we normalize the head, the training becomes very stable, resulting in better performance:

3.Training Dynamics - dynamic training and evaluation

In this section, we analyze the training dynamics of our model. We save the checkpoints of baichuan 2-7B and baichuan 2-13B every 1000 steps. And evaluate these intermediate results on the C-Eval development set, MMLU, CMMLU, JEC-QA, GSM8K and HumanEval. The results are shown below. As shown in the figure, both the 7B and 13B models show considerable gains as training proceeds. However, on common benchmarks such as MLU and C-Eval, improvements after 2 trillion tokens appear to level off. In contrast, the GSM8K math task even exceeded 2 trillion tokens and achieved consistent gains. This suggests that training FLOPs may be closely related to improvements in mathematical problem solving, which could be further studied:

4.Baichuan Harmless Evaluation Dataset - Baichuan Harmless Dataset

We propose the baichuan Harmless Evaluation Dataset (BHED) to evaluate the chat model as described above. Here, we introduce the principles and cases of BHED. The seven main safety categories include Bias and Discrimination, Insult and Profanity, Illegal/Official Content, Physical Health, Mental Health, Financial Privacy, and Sensitive Topics. To ensure diversity within each category, multiple sub-dimensions were considered:

◆  Prejudice/discrimination covers nationality, race, race/color, group, occupation, gender, region, industry, etc. to ensure the diversity of data.

 Insult/blasphemy includes explicit and implicit insults and Internet language abuse.

◆  Illegal/report contents include criminal laws, civil laws, economic laws, international laws, traffic regulations, local administrative regulations, etc.

 Physical health covers health knowledge, medical advice and discrimination related to physical health.

◆  Mental health includes emotional health, cognitive and social health, self-esteem and self-worth, coping with stress and adaptability, psychological advice and discrimination against groups with mental health problems.

◆  Financial privacy includes real estate, personal debt, bank information, income, stock recommendations, etc. Privacy includes personal information, family information, professional information, contact details, private life, etc.

◆   Sensitive topics include racial hatred, international political issues, legal loopholes, human-AI relations, etc.

We collected 10k tips for each category, some examples are shown in the table below:

5.Details of MMLU and C-Eval - detailed evaluation information

C-Eval

MMLU

6.Examples generated by Baichuan 2-13B-Chat - Model examples

Translation

Code implementation

Math

  Choose

  Bilingual judgment  

10. Summary

I have compiled and translated this Baichuan2 technical report, which is almost 20,000 words long, and I have benefited a lot. The report is detailed in content, from data preparation, data processing to model training, intent alignment, and subsequent feedback, evaluation and optimization. In each step, the blogger has compiled a summary of the content of the current paragraph after '-', and you can search according to the directory as needed. In the previous Baichuan2 Express,  we compared the differences between Baichaun1 and Baichuan2:

Based on the content of this article, we have refined and summarized some insights from the article. Of course, the entire report is actually full of useful information, and the departments that are not on display do not mean that they are unimportant.

Data preprocessing

The data of the model includes multiple categories in multiple categories, covering almost every aspect of our lives from the first-level field to the third-level field. At the same time, Baichuan has done fine-grained deduplication, screening, and security testing for different text dimensions, including sentences, paragraphs, and texts. The rich fields and sound filtering mechanism result in relatively 'purer' raw data, making the model's raw materials more reliable and making the model more effective. When we construct data for our own business scenarios, we should also pay attention to more fine-grained cleaning and filtering to improve the effect of fine-tuning such as LoRA. The picture below shows the step-by-step filtering and cleaning of data and retaining the final high-quality data:

Influence of data volume

It is mentioned many times in the article that baichaun2 uses 2.6 trillion tokens, which is nearly double that of baichuan1. The number of supported tokens has also increased from baichaun1’s 64,000 to 125,659. It is also the model with the most high-quality training tokens for the same size model. In the following data It is also mentioned in the set verification that if more data are given for training, the evaluation effect will be better. This is actually aligned with our human cognition, because the more knowledge we have, the better we will perform in more knowledge fields. Therefore, when we use open source models to do related work in our own vertical fields, it is more critical to refine more high-quality training corpus in related fields. The picture below shows Baichuan2 obtaining pre-training knowledge base from multiple fields:

Choice of Base and Chat

baichuan2 introduced an alignment process to generate a corresponding Chat model. The alignment process mainly includes two components: supervised fine-tuning (SFT) from human feedback (RLHF) and reinforcement learning. It optimizes and follows human instructions, making the model better at conversation and context understanding, while doing a lot of security work. The Base model is pre-trained by pre-train and is mainly used to learn different knowledge. The Chat model uses RLHF and SFT with human intentions. It mainly uses Prompt instructions and Rewad feedback to tell the model how to use the learned knowledge and how to use it correctly. learned knowledge. In the tasks in the SFT training business vertical field, after practice and manual evaluation, the Chat-related model is better than the Base-related model. You can try and verify it yourself. In addition to Base and Chat, a more lightweight quantitative model is also provided:

 Similarity between NormHead and Cos

One structural difference between baichuan2 and baichuan1 is to switch the final lm_head to NormHead. The above report pointed out the reason for choosing NormHead, that is, the similarity between semantics relies more on Cos cosine similarity, while the L-2 distance appears Comparatively irrelevant. It coincides with the blogger's previous use of Cos similarity for model effect evaluation. You can also try to use Cos similarity in evaluation. The strategy for selecting vectors here is FirstAndLast: LLM  evaluation effect By Cos . The corresponding vector can be obtained through the hidden_states output by the model for Cos calculation:

◆  Utilization of Scaling laws

Scaling laws This term is the first time I have come across LLM in a long time. It is mainly used in machine learning to predict machine learning system performance based on computing and learning resources. This method is used to evaluate some indicators or parameters. Through multiple sets of crossover experiments, multiple sets of controls can be constructed to fit and predict multiple indicators from multiple dimensions. Baichuan2 successfully predicted the final training loss of Baichuan-2 7B and Baichaun-2 13B through models of different sizes. This is somewhat similar to the previous AB-Test or ablation experiment of traditional machine learning, but the subsequent effect of predicting more or larger size models through fitting is worth learning.

Guess you like

Origin blog.csdn.net/BIT_666/article/details/133035120