Baichuan's large model KnowHow

Hello friends, I am rumor.

The large model is an experimental project, involving multiple processes such as data cleaning, underlying framework, algorithm strategies, etc. There are many pitfalls in each link, so it is very important to know how to avoid pitfalls and technology selection, which can save a lot of computing power and time . To put it bluntly, Just piles of Grandpa Mao.

be45d270ec98841e18db82358185489a.png

Recently, Baichuan Intelligent released the 7B and 13B versions of Baichuan2. Maybe many readers were too accustomed to watching the screen and did not read it carefully. They also gave a technical report when they released the model, which was full of useful information , so I gave it a try. , take everyone to take a look at the KnowHow accumulated by Baichuan. At the same time, there are some things that I don’t fully understand. I hope to inspire others and discuss them together in the comment area.

Pre-train

data

Data diversity

  1. To obtain data from different sources, it is best to establish a category system, which can improve the control of the overall data distribution and facilitate subsequent additions and reductions.

  2. For clustering and deduplication, LSH locally sensitive or dense vectors can be used as clustering features. LSH is faster, but the vectors can better encode semantics. But there is a problem here, which is that it requires a stuck threshold . Excessive deduplication will affect diversity and reduce generalization ability. Therefore, the approach chosen by Baichuan is to remove a part and score the remaining samples as the weight of sampling during pre-training.

The overall deduplication process is as follows (what I don’t quite understand here is why Document deduplication is placed in the last step. If it is placed in the previous step, it should be able to significantly reduce the amount of data in sentences and paragraphs):

81c529f4b8a2c99b738a985da590d5cf.png

Data quality

  1. Using sentence-level classifiers for filtering is a common practice in the industry, but the specific data used for training and what standards are used for labeling are not detailed.

  2. Regarding content safety, we use rules and models to wash away harmful content, and also find some additional data sources with positive values ​​to increase the sampling probability.

Model structure

Tokenizer

The difficulty of Tokenizer is to balance the compression ratio and vocabulary size. For example, several Chinese characters that appear frequently can be represented by one token, so the inference will be very fast. However, if combined, the separate embedding training of these Chinese characters may be difficult. Insufficient, the semantic representation will not be good enough when combined with other words.

Therefore, Baichuan used BPE and chose the more compromise size of 120,000, while disclosing the following details:

  1. No normalization is performed on the original data

  2. Take the numbers completely apart to better understand numerical data

  3. For code data, a space token is specially added.

  4. The coverage is 0.9999, with only a small amount of fall back (a method to avoid OOV, which will become a utf8 byte token when encountering unknown Chinese)

position encoding

Due to the need for extrapolation, there has been a lot of new work on positional coding recently. The more popular ones are RoPE and ALiBi, which are used by Baichuan here because they experimentally found that positional coding did not significantly affect the model performance, and at the same time, speed optimization was carried out. :

  1. RoPE + Flash Attention

  2. ALiBi + xFormers

activation function

SwiGLU, which performs better, is used. Since SwiGLU has three matrices and introduces more parameters, Baichuan reduces the size of the FFN layer (4->8/3 is then processed into a multiple of 128).

Normalisations

  1. Use LayerNorm for Transformer input, which is more robust to warm-up.

  2. The implementation of RMSNorm is adopted, which refers to calculating the variance of input features and improving calculation efficiency.

mixed precision

Use BF16 because it has a larger range and can make training more stable, but for position encoding, optimizers, etc., use full precision.

Improve stability

  1. NormHead: Normalizes the output representation. First, the model of low-frequency tokens will become smaller during training, and stability can be improved after normalization. In addition, by clustering the output representation, Baichuan found that cosine distance can cluster similar semantics together but L2 distance cannot. Normalization can eliminate the influence of L2 in the dot product when calculating logits. From the experimental results, it can be clearly found that the loss convergence is better and more stable.

  2. Max-z loss: During the training process, Baichuan found that the logits of the model were very large, which would make it less robust to the hyperparameters during decoding, so increasing max-z loss lowered the value of logits.

Note: For the optimization interpretation of pre-training, I skipped the Infra part, so I don’t understand it that well. .

Alignment

SFT

  1. Data quality: Quality control is carried out through random inspection. A batch of data is selected for inspection and all unqualified data is returned.

  2. Number of data: 100k (currently there is quite a lot of open source SFT data, I don’t know what Baichuan’s considerations are

Reward Model

  1. Prompt diversity: Constructed a data system with 200+ subdivided categories to cover user needs as much as possible, while improving the diversity of each type of prompt, thereby improving generalization capabilities.

  2. Response diversity: Use Baichuan models of different sizes and stages to generate answers, without using other open source models (it has been proven that RM accuracy cannot be improved)

PPO

  1. The critic model is warmed up in advance

  2. To improve RL stability, gradient clipping is performed

Safety

Since the model is open source, Baichuan is very meticulous in content security, including:

  1. Hire 10 professional auditors to build 100+ security categories

  2. Constructed 200K attack instructions with a 50-person annotation team

  3. Produce a great diversity of responses to attack instructions

Summarize

The effect of Baichuan2 is much improved compared to the first version, and the effect on inference tasks is doubled. It is the model that has passed the most Chinese corpus among the current open source models 399756b20ef25fce76f1f3c819f36874.png. Friends who have used it are welcome to give feedback on the effect in the comment area ~

4094d631da1daffa53f747dc1388b35e.jpeg


I am rumor, the punk and geeky AI algorithm girl.

Bachelor's degree from Beihang University, NLP algorithm engineer, Google developer expert

Welcome to follow me, I will help you learn and improve your liver

Let’s spin, jump and blink together in the era of artificial intelligence

"Thanks for the open source, happy for free."a601e984dd2a051e76c757830cfbf896.png

Guess you like

Origin blog.csdn.net/m0_37310036/article/details/132867826