[Paper Notes]Baichuan 2: Open Large-scale Language Models

[Paper Notes]Baichuan 2: Open Large-scale Language Models

thick

thick

Original link: [Paper Notes] Baichuan 2: Open Large-scale Language Models - Zhihu

Shell senior engineer

Table of contents

close

Abstract

Introduction

Pre-training

data

data source

Data cleaning

Infrastructure:

Tokenizer:

Location code:

Activation functions and normalization

optimizer

NormHead

Max-z loss

Scaling Laws

Infrastructure

Alignment alignment stage

1 Supervision and fine-tuning

2 Reward Model:

3 PPO:

4 Training details:

Evaluation and Summary

Abstract

Large language models (LLMs) demonstrate excellent performance on natural language tasks, reducing the need for extensive feature engineering . But most effective LLMs are English-focused or closed-source . This technical report introduces Baichuan 2, a family of large-scale multilingual models containing 7 billion and 13 billion parameters, trained from scratch based on 2.6 trillion tokens. Baichuan 2 meets or exceeds the performance of other similar open source models on public benchmarks such as MMLU, CMMLU, GSM8K and HumanEval , and performs well in vertical fields such as medicine and law. We release all pre-trained model checkpoints to help the research community better understand the Baichuan 2 training process.

Introduction

The importance and progress of large-scale language models are highlighted, especially the challenges and opportunities in open source and multilingual processing. Baichuan 2, a new multi-language model, aims to address these challenges and provide a valuable resource to the community.

  1. Progress in Large Language Models : In recent years, the field of large language models (LLMs) has made significant progress. The size of models has grown from millions of parameters (such as ELMo and GPT-1) to billions or even trillions of parameters (such as GPT-3, PaLM and Switch Transformers). This increase in scale significantly increases the ability of language models to perform a variety of natural language tasks fluently and more human-like. In particular, OpenAI’s ChatGPT demonstrates the ability of these models to generate human-like text in a variety of domains.
  2. Open source vs. closed models : Despite exciting breakthroughs and applications of LLMs, most leading LLMs ( such as GPT-4, PaLM-2, and Claude) remain closed source. This limits developers and researchers’ access to the full model parameters, making it difficult for the community to delve into or fine-tune these systems. In contrast, fully open source LLMs (such as LLaMA, OPT, Bloom, MPT, and Falcon ) provide researchers with the opportunity to freely access models, thus accelerating research and progress in this field.
  3. Lack of Chinese models : Most open source large-scale language models mainly focus on English. For example, LLaMA's main data source is Common Crawl, which contains 67% of LLaMA's pre-training data, but is filtered to only contain English content. Other open source LLMs such as MPT and Falcon also focus primarily on English and have limited capabilities in other languages.
  4. Introduction to Baichuan 2 : This technical report introduces Baichuan 2, a family of large-scale multilingual language models. There are two separate models of Baichuan 2: Baichuan 2-7B (7 billion parameters) and Baichuan 2-13B (13 billion parameters). Both models were trained on 2.6 trillion tokens. Baichuan 2 achieves significant improvements on various benchmarks and also performs well in the medical and legal domains.
  5. Open Source : To promote research collaboration and continuous improvement, the authors also release checkpoints of Baichuan 2 at various training stages from 200 billion tokens to the full 2.6 trillion tokens. The release of these intermediate results is intended to provide the community with a deeper understanding of Baichuan 2 training.

Pre-training

data

Adjusted the importance of the sources and processing methods of data used when pre-training the Baichuan 2 model. Comprehensiveness and high quality of data are key to training large language models.

data source

In the data acquisition process, the goal is to pursue the comprehensiveness and representativeness of the data . Data comes from a variety of sources, including common Internet web pages, books, research papers, code libraries, etc., to build an extensive system of world knowledge.

Data cleaning

  • Data frequency : Focuses on the frequency and quality of data. Data frequency relies on clustering and deduplication . To this end, a large-scale deduplication and clustering system was built. This system can cluster and deduplicate trillions of data in a matter of hours.
  • Cluster-based : Based on clustering, individual documents can be labeled as high, medium or low frequency. This helps optimize the quality and diversity of data. Model structure

Infrastructure :

Baichuan 2's model architecture is based on the popular Transformer. However, the authors made several modifications to it.

Tokenizer :

  • Tokenizers need to balance two key factors: high compression to enable efficient inference, and an appropriately sized vocabulary to ensure adequate training of each word embedding .
  • To balance computational efficiency and model performance, the vocabulary size of Baichuan 2 is expanded from 64,000 in Baichuan 1 to 125,696.
  • The vocabulary size and text compression ratio of Baichuan 2's tokenizer compared to other models are listed in a table. Among them, the lower the text compression rate, the better.
  • Baichuan 2 uses byte-pair encoding (BPE) (Shibata et al., 1999 ) in SentencePiece (Kudo and Richar dson, 2018 ) for word segmentation. Specifically, it does not apply any normalization to the input text and does not add dummy prefixes like Baichuan 1 . It splits numbers into individual digits to better encode numerical data. To handle code data containing extra spaces, it adds space-only tokens in the tokenizer. Character coverage is set to 0.9999, with rare characters falling back to UTF-8 bytes.

Tokenzier comparison

Location code  :

Baichuan 2 uses Rotary Positional Embedding (RoPE) in its 7B version and ALiBi (proposed by Press et al. in 2021) in its 13B version. ALiBi is a newer positional encoding technology that has shown improved extrapolation performance.

  • Choice of open source models : Most open source models use RoPE as positional embedding. Optimized attention implementations, such as Flash Attention, are currently more suitable for RoPE because it is based on multiplication, bypassing the need to pass attention_mask to the attention operation. .
  • Preliminary experiments : In preliminary experiments, the choice of position embeddings did not significantly affect the performance of the model.

To facilitate further research on bias-based and multiplication-based attention, the authors applied RoPE on Baichuan 2-7B and ALiBi on Baichuan 2-13B, which is consistent with Baichuan 1.

Activation functions and normalization

  • Activation function : Baichuan 2 uses SwiGLU (Shazeer, 2020) as its activation function . SwiGLU is a variant of GLU (Dauphin et al., 2017) that exhibits better results through a switch activation mechanism. Unlike the feedforward layer of traditional Transformer, which has only two matrices, SwiGLU contains three parameter matrices. Therefore, the model reduces the size of the hidden layer from 4x to 8/3x and makes appropriate adjustments.
  • Attention layer : Baichuan 2 adopts memory-efficient attention implemented by xFormers2 (Rabe and Staats, 2021). By leveraging xFormers’ optimized attention and biasing capabilities, the model is able to effectively integrate ALiBi’s bias-based positional encoding while reducing memory overhead. This brings performance and efficiency benefits to large-scale training of Baichuan 2.
  • Normalization :
    • Baichuan 2 applies layer normalization on the input of the Transformer block, which is more robust to warmup.
    • Furthermore, the model is implemented using RMSNorm, which only calculates the variance of the input features to improve efficiency.

optimizer

  • Hyperparameters:
    • The AdamW optimizer was used during training , with β1 set to 0.9 and β2 set to 0.95.
    • A weight decay of 0.1 is used and the gradient norm is clipped to 0.5.
    • The model's learning rate is first warmed up to the maximum learning rate through 2,000 steps of linear scaling , and then a cosine decay is applied to the minimum learning rate.
    • The entire model is trained using BFloat16 mixed precision. BFloat16 has better dynamic range than Float16, but its low precision can cause problems in some settings. Therefore, for certain value-sensitive operations (such as positional embedding), full precision is used.

training parameters

NormHead

In order to stabilize training and improve model performance, the output embeddings are normalized.

Insight:

1.Output embeddings are prone to instability. During the training process, the norm of rare token embeddings becomes smaller, interfering with training

2. For the KNN retrieval task, we find that the semantic information is mainly encoded by the cosine similarity of the embeddings instead of the L2 distance. Current linear classifiers compute logits via dot product, which is a mixture of L2 distance and cosine similarity. NormHead reduces the interference of L2 distance when calculating logits. The ablation experiments added in the appendix verify that training becomes very unstable when NormHead is removed.

Comparison with and without NormalHead training

Max-z loss

Insight:

During the training process using deepspeed, there will be obvious logits becoming very large. Regarding this situation, the author searched the issue of deepspeed and found that someone mentioned a similar BUG. So in addition to this kind of bug occurring during pretraining, is it possible that deepspeed also experiences larger logits drift during SFT?
Due to the existence of repetition penalty, large logits will cause problems during inference. Very large logits can significantly change the probability after softmax, making the model sensitive to the selection of repetition penalty hyperparameters.
Based on the inspiration of NormSoftmax, the Baichuan team provides a Max-z auxiliary loss to solve this situation .
Lmax-z =2e−4∗z2

Scaling Laws

The importance of scaling laws in predicting the performance of large models and how they guide the training of these models to achieve desired results is emphasized.

  • Scaling laws describe how a model's error decreases as a function of training set size, model size, or both. These laws become increasingly important as the training of deep learning models becomes increasingly computationally intensive.
  • Before training their large model (with billions of parameters), the authors first trained a smaller model to establish the scaling laws for training larger models. Using the same dataset from Baichuan 2 and consistent hyperparameters, models from 10 million to 3 billion parameters were trained, up to 1 trillion tokens.
  • The scaling law is formulated using the formula provided by Henighan et al. (2020), which describes the relationship between training flops and target loss. The formula consists of irreducible loss and reducible loss, where the reducible loss is formulated as a power-law scaling term.
  • The fitted scaling law is able to accurately predict the final loss of the Baichuan 2 model. Specifically, the scaling law accurately predicted the ultimate losses of the Baichuan 2-7B and Baichuan 2-13B models.

Under the given data, the larger the model, the smaller the loss. What seems strange is that these three models converge at 600B at the same time.

Infrastructure

Deepspeed has been modified to optimize training. Skip it here.

Alignment alignment stage

The alignment process of Baichuan 2 mainly consists of two main components: supervised fine-tuning (SFT) and reinforcement learning based on human feedback (RLHF).

1 Supervision and fine-tuning

In the supervised fine-tuning phase, we use human annotators to annotate cues collected from various data sources. Each tip is labeled as helpful or harmless based on key principles similar to Claude (2023). To verify data quality, we use cross-validation—authoritative annotators check the quality of batches of samples annotated by a specific crowdsourcing working group, filtering out low-quality data. A total of 100,000 pieces of SFT data were used in the SFT stage.

2 Reward Model :

We designed a three-tier classification system for all tips, including 6 primary categories, 30 secondary categories, and over 200 tertiary categories. From a user perspective, we want the classification system to comprehensively cover all types of user needs. From the perspective of reward model training, the cues in each category should have enough diversity to ensure that the reward model can generalize well.

detail:

Accuracy rate of RM model on pair data with different gaps

Insight:

RM data uses Baichuan 2 different sized models and stage (SFT, PPO) models to generate responses to enhance response diversity.  Only responses generated by the Baichuan 2 model family were used in RM training . Responses from other open source datasets and proprietary models did not improve the accuracy of the rewarded model. This also emphasizes the internal consistency of the Baichuan model series from another perspective. It is explained here that the preference data must be generated from the SFT or PPO model itself, ensuring that the RM model training comes from data with the same distribution of PPO. It illustrates the importance of identically distributed data to model accuracy.

3 PPO

After getting the reward model, we train our language model using the PPO (PPO algorithm). We use four models: actor model (responsible for generating responses), reference model (used to calculate the KL penalty with fixed parameters), reward model (for the entire response provides an overall reward) and a critic model (designed to learn the value of each token).

4 Training details :

During the RLHF training process, the critic model was first warmed up for 20 training steps in advance. Subsequently, the critic and actor models are updated through the standard PPO algorithm. For all models, we use gradient clipping of 0.5, constant learning rate of 5e-6 and PPO clipping threshold ϵ = 0.1. We set 350 iterations for all chat models, resulting in Baichuan 2-7B-Chat and Baichuan 2-13B-Chat.

PPO iteration process

Evaluation and Summary

It seems that when Baichuan 2 is trained to the same number of tokens as Baichuan 1, the ceval and cmmlu scores are actually not much different, indicating that the main improvement of Baichuan 2 is due to data, and it makes sense to work hard to achieve miracles!

Published on 2023-09-11 16:57・IP is located in Beijing

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/133101991