How to upgrade and design large models: ChatGLM, LLAMA, Baichuan and LLM structure analysis

0e4061839cf338af93306e1dc27f8b34.gif

This article provides a systematic perspective and sorts out the key elements of large-scale pre-training models by deeply analyzing the upgrade paths of ChatGLM, LLAMA and Baichuan models, and discussing the structure selection of large-scale language models. We hope that this knowledge can provide powerful reference and guidance for everyone to build more powerful, flexible and efficient large-scale pre-training models in actual projects.

4509cea1523e093f6f4b7ba906a84e3f.png

introduction

At present, large language models have made significant breakthroughs in various fields, from ChatGLM, LLAMA to Baichuan, etc., they have demonstrated amazing performance in processing various natural language tasks. However, as research deepens and application requirements continue to expand, these large models need to be continuously upgraded and optimized to meet higher performance requirements and a wider range of application scenarios.

In this process, as researchers and practitioners, we need to explore in depth: What is the path to upgrading large models? What challenges were faced during the upgrade process? What means and methods are used to achieve the upgrade? This blog aims to conduct an in-depth discussion on this, sort out the upgrade process of models such as ChatGLM, LLAMA and Baichuan, analyze the reasons behind it, and show how to optimize and upgrade large models.

9a0818a9b62dce57d934b1a67e3d3abd.png

77a0fc316a5c9e5d5e8e6bb064e96b02.png

ChatGLM upgrade path

First, compare the results of major rankings before and after the ChatGLM upgrade. Compared with the ChatGLM2-6B model, ChatGLM-6B has achieved nearly 20-30% improvement in each ranking:

  MMLU

be564ab15ce9c7ab81a20c031ce3ed9f.png

The Chat model is tested using the zero-shot CoT (Chain-of-Thought) method, and the Base model is tested using the few-shot answer-only method.

  C-Eval

88d7449429d44345a998213ddd0550c5.png

The Chat model is tested using the zero-shot CoT method, and the Base model is tested using the few-shot answer only method.

  GSM8K

1ed2a273fdc7a39db7caa89d7c4aab00.png

All models are tested using the few-shot CoT method. The CoT prompt comes from http://arxiv.org/abs/2201.11903

Translated 500 questions and CoT prompts in GSM8K using the translation API and performed manual proofreading

  BBH

3ff9417cd83be0de905c3b3103fcb04c.png

All models are tested using the few-shot CoT method. The CoT prompt comes from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts

  ChatGLM

ChatGLM-6B is an open source conversational language model that supports Chinese and English bilinguals. It is based on the General Language Model (GLM) architecture and has 6.2 billion parameters. Combined with model quantization technology, users can deploy it locally on consumer-grade graphics cards (a minimum of 6GB of video memory is required at the INT4 quantization level). ChatGLM-6B uses technology similar to ChatGPT and is optimized for Chinese question and answer and dialogue. After bilingual training in Chinese and English with about 1T identifiers, supplemented by supervised fine-tuning, feedback self-service, human feedback reinforcement learning and other technologies, ChatGLM-6B with 6.2 billion parameters has been able to generate answers that are quite consistent with human preferences.

 General Language Model (GLM) architecture address: https://github.com/THUDM/GLM

6a7bb6fd74073bf102b122d94a628a6f.png

df4fc6a33fb2e95a5912d2eaad3ff4b6.png

For relevant analysis, see: https://zhuanlan.zhihu.com/p/627832567?spm=ata.21736010.0.0.1ee417b1JxcVsy

  ChatGLM2

ChatGLM2-6B is the second-generation version of the open source Chinese-English bilingual dialogue model ChatGLM-6B. While retaining many excellent features of the first-generation model such as smooth dialogue and low deployment threshold, ChatGLM2-6B introduces the following new features:

  1. More powerful performance: Based on the development experience of the first-generation ChatGLM model, we have comprehensively upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of GLM, and has undergone pre-training of 1.4T Chinese and English identifiers and human preference alignment training. The evaluation results show that compared with the first-generation model, ChatGLM2-6B has better performance in MMLU (+23%), CEval The performance on data sets such as (+33%), GSM8K (+571%), and BBH (+60%) has been greatly improved, and it is highly competitive among open source models of the same size.

  2. Longer context: Based on FlashAttention technology, we extended the context length (Context Length) of the base model from ChatGLM-6B's 2K to 32K, and used a context length of 8K for training during the conversation stage. For longer context, we release the ChatGLM2-6B-32K model. LongBench evaluation results show that ChatGLM2-6B-32K has a clear competitive advantage among open source models of equal magnitude.

  3. More efficient inference: Based on Multi-Query Attention technology, ChatGLM2-6B has more efficient inference speed and lower graphics memory usage: under the official model implementation, the inference speed is increased by 42% compared to the first generation, and under INT4 quantification, 6G The conversation length supported by video memory has been increased from 1K to 8K.

  4. A more open protocol: ChatGLM2-6B weights are completely open to academic research, and free commercial use is also allowed after filling in the questionnaire for registration.

ChatGLM-6B address: https://github.com/THUDM/ChatGLM-6B

GLM address: https://github.com/THUDM/GLM

Evaluation result address: https://github.com/THUDM/ChatGLM2-6B#%E8%AF%84%E6%B5%8B%E7%BB%93%E6%9E%9C

FlashAttention address: https://github.com/Dao-AILab/flash-attention

LongBench address: https://github.com/THUDM/LongBench

Multi-Query Attention address: https://arxiv.org/abs/1911.02150

▐Upgrade   process

  • Model structure

Model structure change: Return to pure Decoder-Only structure from Prefix-LM, that is, everything in the SFT process is generated at the beginning through gMASK;

The code comparison is as follows:

f3cf967c7c90bfe665208fd7d2a5f710.png

The diagram is as follows:

0205093523d3fc3a9b1ff64ff22b751d.png

ChatGLM2:

f7beffa8eed9d51a1e49f1fd3f99cf29.png

So what can this change bring?

The answer is to greatly improve the training efficiency of the model.

Image source: https://github.com/THUDM/ChatGLM2-6B/issues/16

25c22ba164b1ef310cceb71aad3ce765.png

In the process of processing multiple rounds of dialogue, there are three rounds of dialogue, Q1A1, Q2A2, Q3A3, and PrefixLM needs to construct three samples:

  1. Q1->A1

  2. Q1A1Q2->A2

  3. Q1A1Q2A2Q3->A3

This method of data construction brings serious data expansion problems, which affects the efficiency of model training.

On the contrary, the Decoder-Only model can take advantage of the characteristics of Causal Mask (each Token can see the real input of all previous Tokens) to implement multiple rounds of dialogue in one sample: 47296ba08a8d8b5966253e5fac285c7f.png

  1. Sample build: Q1 A1 Q2 A2 Q3 A3 

  2. Loss calculation: only need to calculate A1 A2 and A3 parts

Looking back carefully, what is the principle difference between dialogue session level training and split training?

1. Session-level training. One of the effects is that the equivalent batch size becomes larger (one batch can fit more samples), and the samples generated in the same conversation are within one bs.

  1. The gradients generated in different rounds at the session level are averaged, and the split round construction training is summed. This is equivalent to lr becoming larger, and will also affect the distribution of token weights in different rounds. In addition, it will also affect the distribution of token weights in different rounds. Calculation of norm.

Let us use a simplified example for quantitative analysis. We assume that the two training samples are divided into 

1. Question: A Answer: xx

2. Question: A Answer: xx Question: B Answer: xx Question: C Answer: xx

Then the session level training influence gradient is (Ga+(Ga + Gb + Gc)/3)/2. The weights of the impact on A, B, and C are 2/3 1/6 1/6 respectively.

Unpacking the training is (Ga+Ga+ (Ga + Gb)/2 +(Ga + Gb + Gc)/3)/4. The weights of the impact on A, B, and C are 17/24 5/24 1/12 respectively.

Judging from the weight distribution above, rounds with lower session levels have a greater impact than splitting. This is also more reasonable, because in most scenes, the opening lines are similar and repetitive.

  • sequence length

Sequence length: The pre-trained model is trained at 32K length, and the SFT fine-tuned model is trained at 8K length;

7ee6f410e7ec6c443629165ad1caba03.png

In addition, on July 31, Zhipu AI released ChatGLM2-6B-32K, a large model optimized for long contexts that is fine-tuned based on ChatGLM2-6B and can better handle contexts up to 32K in length.

Previously, when ChatGLM2-6B was first released, it was officially announced that the model supports up to 32K long context input. However, LM-SYS official testing showed that ChatGLM2-6B performed poorly when the length exceeded 8K: Evaluation of a large language model that supports ultra-long context input And summary - ChatGLM2-6B performed miserably, and the strongest ones are still the commercial models GPT-3.5 and Claude-1.3 (Address: https://www.datalearner.com/blog/1051688222070709).

8a5fe989ea2a8775845e862e91ab408b.jpeg

Specifically, ChatGLM2-6B-32K updates the positional encoding based on the positional interpolation method and uses a context length of 32K for training in the conversation phase. In actual use, the official recommendation is that if the context length is basically within 8K, it is recommended to use ChatGLM2-6B; if you need to handle a context length exceeding 8K, it is recommended to use ChatGLM2-6B-32K.

For an introduction to position interpolation, see the blog: RoPE rotation position coding in-depth analysis: theoretical derivation, code implementation, length extrapolation (Address: https://zhuanlan.zhihu.com/p/645263524)

ChatGLM2-6B address: https://www.datalearner.com/ai-models/pretrained-models/ChatGLM2-6B

  • Operator optimization

Operator optimization: Flash Attention and Multi-Query Attention improve the speed of training & inference;

edca9a39dbc625e4f473dcab44a76e08.png

This time, the ChatGLM2-6B context was expanded from 2k to 32k and a technology called FlashAttention was also applied. Flash-attention is a fast, efficient, and scalable attention mechanism that utilizes a technique called hash-aware to assign elements in an input sequence to different in the bucket. This way, the model only needs to calculate attention weights between elements within the bucket, rather than the entire sequence. This greatly reduces computational and memory requirements while maintaining high accuracy and expressiveness.

3146f0a9e92863b237d393e36bdeecc8.png

LLAMA upgrade path

First, compare the results of major lists before and after the LLAMA upgrade. Compared with the LLAMA model, LLAMA2 has achieved nearly a 10-30% improvement in each list:

MMLU

c0b58065c102336c6fa3908cb77874c6.png

GSM8K

e18bda1809fe5adf5157d1fdbf819dda.png

  LLAMA

LLaMA(Large Language Model Meta AI), an open and efficient large-scale basic language model released by Meta AI, has  four versions 7B: , 13B, 33B, 65B(65 billion). The sources of its data sets are all public data sets without any customized data sets, ensuring that its work is compatible with open source and reproducible. The entire training data set contains approximately 1.4T tokens after tokenization.

Regarding model performance, LLaMA's performance is very excellent: the LLaMA model with 13 billion parameters can outperform GPT-3 (175 billion parameters) "on most benchmarks" and can run on a single V100 GPU; while the maximum The 65 billion parameter LLaMA model is comparable to Google's Chinchilla-70B and PaLM-540B.

Regarding the training set, its sources are all public data sets without any customized data sets, ensuring that its work is compatible with open source and reproducible. The entire training data set contains approximately 1.4T tokens after tokenization. Among them, LLaMA-65B and LLaMA-33B were trained on 1.4 trillion tokens  token , while the smallest model LLaMA-7B was trained on 1 trillion tokens.

Model structure:

  1. PreLayerNorm-RMSNorm-Root Mean Square Layer Normalization

  2. ROPE rotational position encoding (replaces absolute/relative position encoding)

  3. SwiGLU activation function (replaces ReLU)-GLU Variants Improve Transformer

  LLAMA2

The introduction on the official page is as follows:

e578d58b0dc8909b78e7dacae91b7c74.png

In terms of model structure, there are two main upgrade points:

  1. The number of training data tokens is from 1.4T->2T

  2. Sequence length from 2K->4K

In the SFT process, LLAMA2 emphasizes the importance of data quality and stimulates the model's instruction following ability through 2W of high-quality instruction data.

In the RLHF process, LLAMA2 has done more work and further explained the RLHF process. We built a 1 million Reward data set and trained two independent Reword Models.

The entire LLAMA2 paper is interpreted as follows:

3b362dafd795e94ce006068d8929a836.png

The training process of the LLAMA2-Chat model is as shown below, which mainly includes three steps: pre-training, SFT, and RLHF:

41455e483b70821b9a639c3cab606a41.png

  • pre-training

Major improvements in LLAMA2 include more powerful data cleaning, updated data combinations, a 40% increase in total training tokens, doubling the context length, and the use of grouped query attention (GQA) to improve inference scalability for larger models.

c528c28b5a113623f59ad8e396480ada.png

Model structure:

  1. RMSNorm

  2. SwiGLU

  3. RoPE

  4. 4K sequence length

  5. Group query attention GQA (33B/70B)

  • SFT

The authors found that many third-party SFT datasets were lacking in diversity and quality, so they focused on collecting their own high-quality SFT data.

They observed that using fewer but higher quality examples from their own vendor-based annotation efforts significantly improved results compared to using millions of examples from third-party datasets. They found that tens of thousands of SFT annotations were sufficient to achieve high-quality results, with a total of 27,540 annotations collected.

  • RLHF

We mainly choose three core steps to introduce: data collection, reward model, and iterative training.

Human preference data collection

c07f4d0519317feb46296193b60d5986.png8fa9422d1dbefa8e013a0486cbffc079.png

The preference data is shown in Table 6, which includes the self-built data set of 140WMeta. Compared with the open source data set, the self-built data set has longer turns and conversation length.

reward model

LLAMA2 trains two independent reward models (Helpfulness RM/Safety RM).

Motivation: Research has found (Bai et al., 2022a) that sometimes there is a trade-off between usefulness and security, which makes a single reward model likely to face challenges in the performance of these two aspects.

To solve this problem, the authors trained two independent reward models, one optimized for usefulness (called the usefulness reward model, Helpfulness RM) and the other optimized for safety (called the safety reward model, Safety RM ). This can achieve better results in terms of usefulness and safety respectively, making Llama 2-Chat better in line with human preferences in the reinforcement learning human preference (RLHF) process, and improving the usefulness and safety of generated answers.

loss function

1862f1d55aa85e5113e14c2c59fa62de.png

The boundary m(r) is a discrete function of the preference score. The authors use larger bounds for pairs with larger response gaps and smaller bounds for pairs with similar responses (as shown in Table 27). The authors found that this boundary component can improve the accuracy of the usefulness reward model, especially in samples where the gap between the two responses is larger.

feb0327a385990e1232302cf234a16a5.png

iterative training

LLAMA2 uses two reinforcement learning algorithms: PPO and rejection sampling algorithm.

The main differences between these two reinforcement learning algorithms are:

  1. Breadth: In rejection sampling, the model explores K samples for a given tip, while in PPO, there is only one generative process.

  2. Depth: In PPO, the sample at step t during the training process is a function of the model policy after the gradient update at step t-1. In rejection sampling fine-tuning, all outputs are sampled under the initial policy of the model to collect a new dataset and then fine-tuned similar to SFT. However, due to the iterative model update employed, the essential difference between the two algorithms is not obvious.

LLAMA2 up to RLHF (V4), only uses rejection sampling fine-tuning. The two methods are then combined by first applying PPO to the rejection sampling checkpoint and then rejecting the sample. LLAMA2 only uses the largest 70B Llama 2-Chat model for rejection sampling. Other smaller models are fine-tuned on the rejection sampled data of the larger model, thereby transferring the power of the larger model to the smaller model.

e7568e4ea59b36e7c95289879addaf6c.png

6f860507ab6c17f003d16826b6ed0b90.png

The Road to Upgrading Baichuan

First, compare the results of major rankings before and after the upgrade. Compared with the Baichuan-7B model, Baichuan-13B has achieved nearly a 20% improvement in each ranking:

C-Eval (Address: https://cevalbenchmark.com/index.html?spm=ata.21736010.0.0.1ee417b1JxcVsy#home)

360eb9374b66a5e9671953e16085eefc.png

MMLU (Address: https://arxiv.org/abs/2009.03300)

d2a010da2fe0e97c4565e11be882bd46.png

Note: The official evaluation plan of MMLU is adopted.

CMMLU

27d5fd3f565013ba7e753cc47fb6bbf7.png

Description: CMMLU is a comprehensive Chinese evaluation benchmark specifically used to evaluate the knowledge and reasoning capabilities of language models in the Chinese context. Its official evaluation plan is adopted.

   baichuan-7b

Baichuan-7B is an open source and commercially available large-scale pre-trained language model developed by Baichuan Intelligence. Based on the Transformer structure, the 7 billion parameter model trained on approximately 1.2 trillion tokens supports Chinese and English bilingualism, and the context window length is 4096. It achieves the best results of the same size on both standard Chinese and English benchmarks (C-Eval/MMLU).

The Baichuan model structure is similar to LLAMA, and the following optimizations have been made:

  • tokenizer

Referring to the academic solution, Byte-Pair Encoding (BPE) in SentencePiece is used as the word segmentation algorithm, and the following optimizations are made:

  1. At present, most open source models are mainly based on English optimization, so there is a problem of low efficiency for Chinese corpus. We use 20 million multi-lingual corpus, mainly Chinese and English, to train the word segmentation model, significantly improving the compression rate for Chinese.

  2. For the field of mathematics, we refer to the solutions in LLaMA and Galactica to separate each digit of the number separately to avoid the problem of numerical inconsistency, which is of great help in improving mathematical ability.

  3. For rare words (such as special symbols, etc.), byte encoding of UTF-8 characters is supported, thus achieving full coverage of unknown words.

  4. We analyzed the compression rate of corpus by different word segmenters, as shown in the table below. It can be seen that our word segmenter is significantly better than LLaMA, Falcon and other open source models, and compared with other Chinese word segmenters, the training and inference efficiency is higher when the compression rate is equivalent. .

2cc6dbc0f0827ae62527fb55586a04a9.png

  • Operator optimization

Use a more efficient operator: Flash-Attention, the same as ChatGLM2

   baichuan-13b

Baichuan-13B is an open source commercially available large-scale language model containing 13 billion parameters developed by Baichuan Intelligent after Baichuan-7B. It has achieved the best results of the same size on authoritative Chinese and English benchmarks. This release includes two versions: pre-training (Baichuan-13B-Base) and alignment (Baichuan-13B-Chat). Baichuan-13B has the following characteristics:

  1. Larger size, more data: Baichuan-13B further expanded the number of parameters to 13 billion based on Baichuan-7B, and trained 1.4 trillion tokens on high-quality corpus, exceeding LLaMA-13B by 40%. It is the current open source The model with the largest amount of training data at 13B size. Supports Chinese and English bilingual, uses ALiBi position encoding, and the context window length is 4096.

  2. At the same time, open source pre-training and alignment models: The pre-training model is a "base" for developers, while the majority of ordinary users have stronger needs for alignment models with dialogue functions. Therefore, in this open source, we also released the alignment model (Baichuan-13B-Chat), which has strong dialogue capabilities, can be used out of the box, and can be easily deployed with just a few lines of code.

  3. More efficient reasoning: In order to support the use of a wider range of users, we have open sourced the quantized versions of int8 and int4 this time. Compared with the non-quantified version, it greatly reduces the machine resource threshold for deployment with almost no effect loss, and can be deployed on On consumer-grade graphics cards such as the Nvidia 3090.

  4. Open source, free for commercial use: Baichuan-13B is not only fully open to academic research, but developers can also use it for free after applying by email and obtaining an official commercial license.

Model details

f0604622e1ae2d31e00990c1d8e819a9.png

▐Upgrade   process

  1. Number of parameters: baichuan13B first doubles the number of parameters compared to baichuan7B. A larger number of parameters means a greater capacity of knowledge. Through more training data (1.2T->1.4T), the common sense ability of the base model be promoted;

  2. Position encoding: changed from RoPE to ALiBi, length extrapolation can be performed to a certain extent (TIPS: RoPE can perform extrapolation over a longer range);

3faa3f14f742f5082154b97844ccd553.png

How to build a good base model?

After an in-depth discussion on the upgrade of ChatGLM, LLAMA, and Baichuan large language models, we will further expand the scope of discussion and explore the key capabilities required for large models, the technical means required to realize these capabilities, and the design method of the model structure. This will provide us with a powerful reference and guidance for building and optimizing large models in practical applications.

The following sections will discuss the following aspects: First, we will analyze the core capabilities required for large-scale pre-training models, such as length extrapolation, common sense, etc.; second, we will introduce how to use advanced technologies and methods To achieve these capabilities, including pre-training strategies, optimization algorithms, loss functions, etc.; finally, we will discuss the model structure and analyze how to choose an appropriate LLM (Large Language Model) structure to achieve high-performance large-scale models.

The purpose of this section is to provide you with a comprehensive perspective and understand the key elements of large models so that you can build more powerful, flexible and efficient large-scale pre-training models in actual projects.

▐Ability   and upgrade methods required for large models

Through the analysis of the upgrade process of large language models such as ChatGLM, LLAMA, and Baichuan, it can be found that their improvements are mainly focused on the improvement of basic knowledge capabilities and supported sequence length changes. In this section, we will focus on sorting out and summarizing the upgrade strategies for these two key capabilities.

  • basic knowledge

The improvement of basic knowledge capabilities covers many areas, and we can understand these areas through the following commonly used evaluation sets:

  1. English knowledge—MMLU

  2. Chinese Knowledge—C-Eval

  3. Reasoning — GSM8k/BBH

  4. Code — HumanEval/MBPP

  5. Mathematics—MATH

The author believes that the main strategy for upgrading basic knowledge capabilities is to increase the amount of model parameters and training data, so that the model can better fit knowledge in related fields through a larger amount of parameters and data.

In this process, the most important thing is the quality of the training data. The following are common ways to clean the data:

  • Invalid data, dirty data filtering

Some invalid data, such as meaningless or templated text (such as HTML code, Lorem ipsum, etc.). Even during the construction of multilingual corpora, extracting text from websites for language modeling is extremely challenging. But this is what we must do, because the NTP (Next Token Prediction) method is destined to train the model. The data itself is a good mapping of the real language world. Data cleaning tools, such as justext, trafilatura, etc., can effectively remove HTML template text, while striking a balance between reducing noise (improving precision) and retaining all valid parts (improving recall). Another point is that one of the effective ways to deal with invalid data in web corpus is to use metadata for filtering. For example, when OpenAI built the WebText corpus for GPT-2, it crawled all external links on reddit with at least 3 likes. This heuristic method helps reduce noise in the data set while ensuring data quality.

  • Document length filter

On the one hand, considering NTP (Next Token Prediction), removing very short documents (text containing less than about 100 tokens) from the corpus can help to model dependencies in the text by creating continuous text, thereby removing noise. On the other hand, since most language models nowadays are based on the Transformer architecture, it is useful to preprocess very large documents and split them into contiguous segments of the required length.

  • Machine-generated data filtering

One of the goals of training a language model is to capture the distribution of human language. However, web crawled data sets contain a large amount of machine-generated text, such as text generated by existing language models, OCR text, and machine-translated text. For example, data from http://patents.google.com constitutes the majority of the C4 corpus. This corpus uses machine translation to translate patents from patent offices around the world into English. Additionally, the data in the web corpus contains OCR-generated text from scanned books and documents. OCR systems are not perfect, so the generated text has a different distribution than natural English (often OCR systems make predictable errors in things like misspellings and missing words entirely) - this is important and difficult to do, pdf scan How to do the documentation is really a headache. Although it is difficult to identify machine-generated text, there are tools, such as ctrl-detector, that can be used to identify and detect machine-generated text. When preprocessing a corpus for language modeling, it is important to characterize and record the presence of machine-generated text in the corpus.

  • Remove duplicates

Datasets created by scraping raw text from the Internet often result in the same sequence being repeated multiple times. For example, in the paper "Deduplicating Training Data Makes Language Models Better", the author found that in the C4 data set, a sequence of 50 words was repeated 60,000 times. In fact, training a model on a deduplicated dataset is faster and less prone to memory effects - which is bad. Recently, researchers have also shown that language models trained on repeated data are vulnerable to privacy attacks in which an adversary generates sequences from the trained model and detects which sequences come from the memory of the training set. In the paper "Deduplicating Training Data Mitigates Privacy Risks in Language Models", the author shows that the rate at which a language model regenerates a training sequence is super-linearly related to the number of occurrences of the sequence in the training set. For example, a sequence that appears 10 times in the training data will, on average, generate 1,000 times more data than a sequence that appears only once. Deduplication can be performed at different levels of granularity. Tools ranging from exact match deduplication to fuzzy deduplication (such as deduplicate-text-datasets and datasketch) can help reduce and remove redundant text in the corpus being processed. As many researchers have pointed out, it needs to be understood that the deduplication process requires significant computing resources (CPU and RAM) because of the size of the web crawling dataset, so it is recommended to run such calculations in a distributed environment.

  • Clean contaminated data

This part is quite controversial. There may not be very detailed standards yet. Many companies are also quite utilitarian, so it's hard to say. In the field of NLP, what we often call data cleaning mainly refers to the distinction and processing of training data and test data. In the case of large language models, this process can be challenging to ensure that the training and test data sets do not overlap since both training and test data sets originate from the Internet. Evaluation of large language models often uses benchmark data, such as question-answer pairs, which may lead to an overestimation of benchmark performance if they appear in the training data. Therefore, a decontamination operation is required, that is, the parts that overlap with the benchmark data set are removed from the training data to ensure the integrity of the training data set. When researchers at OpenAI created the WebText dataset, they decontaminated the data by excluding all Wikipedia content because Wikipedia data was widely used in their benchmark dataset. Another case is that of researchers at EleutherAI, who developed a software package called lm-eval harness to decontaminate benchmark data sets. In specific operations, we need to pay attention to two types of data pollution:

  1. Input and output contamination: In this case, data with the same label as the downstream task exists in the pre-training corpus. For tasks such as language modeling, the task label is the target text. If the target text appears in the pre-training corpus, the model may tend to copy the text rather than actually solving the task.

  2. Input contamination: This refers to situations where labels are not included in the evaluation samples, which can also lead to overestimation of performance on downstream tasks. When conducting zero-shot and few-shot evaluations, if there is data overlapping with popular benchmark tasks in the pre-training data set, we must pay attention to data decontamination.

  • Toxicity and Bias Control

Despite its rich diversity, online corpora are often rife with toxic and biased content. For example, in the article "RealToxicityPrompts", the author used PerspectiveAPI to point out that 2.1% and 4.3% of OpenWebText and WebText content respectively have toxicity scores exceeding 50%. Therefore, when training language models, you must be vigilant and use tools such as PerspectiveAPI to filter out toxic content in pre-training data sets to prevent the model from showing bias or generating harmful content in downstream applications. One solution is to filter out text from the "bad words" list, as the authors of C4 do. As another example, researchers on the PILE dataset used spamscanner to classify harmful content. However, such filtering steps must be performed with extreme caution and with downstream applications in mind, lest the filter retain voices that are more likely to adhere to hegemonic views. In-depth analysis of derogatory content and gender/religious bias is necessary before leveraging data to pre-train language models.

  • Control of Personally Identifiable Information

When collecting large data sets, it is critical to understand the legal issues associated with the data set instances, especially when dealing with personally identifiable information (PII) such as real names, organization names, medical records, Social Security numbers, etc. Depending on the application, masking or deleting this information is necessary before pre-training the language model. Tools like presidio and pii-codex provide processes for detecting, analyzing, and processing personally identifiable information in text data. These tools can help ensure that personal information in data sets is handled appropriately to comply with relevant privacy regulations and protect user privacy.

  • sequence length

The sequence length supported by large language models is mainly affected by two aspects:

  1. Maximum length of training phase

  2. Model length extrapolability

The first point is the maximum length of the training phase. Distributed training strategies such as DeepSpeed ​​can be used to reduce the memory usage of the model, thereby increasing the sequence length of training;

The second point is that the length extrapolation of the model is achieved through the design of position coding. For the implementation method, see the model structure design section.

▐Model   structure design

After sorting out the key capabilities required for large language models and the corresponding upgrade strategies, this section will focus on the design method of large model structures. We’ll dive into how to build efficient and powerful large pre-trained models.

d1708d1f2848f7a02e11d15949725a50.png

  • Tokenizer

Referring to the Tokenizer design method mentioned by baichuan, the encoder needs to be able to handle complex Chinese and English tasks.

  1. At present, most open source models are mainly based on English optimization, so there is a problem of low efficiency for Chinese corpus. We use 20 million multi-lingual corpus, mainly Chinese and English, to train the word segmentation model, significantly improving the compression rate for Chinese.

  2. For the field of mathematics, we refer to the solutions in LLaMA and Galactica to separate each digit of the number separately to avoid the problem of numerical inconsistency, which is of great help in improving mathematical ability.

  3. For rare words (such as special symbols, etc.), byte encoding of UTF-8 characters is supported, thus achieving full coverage of unknown words.

  4. We analyzed the compression rate of corpus by different word segmenters, as shown in the table below. It can be seen that our word segmenter is significantly better than LLaMA, Falcon and other open source models, and compared with other Chinese word segmenters, the training and inference efficiency is higher when the compression rate is equivalent. .

5d2bafc2cc25d278cee35256b685da3c.png

  • LayerNorm

LayerNorm is divided into two types: Pre-LN and Post-LN. Some studies have found that Post-LN is unstable during the training process. Therefore, currently large models basically use the Pre-LN training method.

2db4a27d35ca138ddc1a269efd6bdea3.png

LayerNorm calculation method

First calculate the mean and variance:

6d758643ab7bde1375562d0eba7147d1.png

Then calculate the normalization:

5401bfd19b56845427e3d7571850a766.png

Which 66466e0fb7c1c08dec34b51998aa233e.pngis used for scaling and is set to 1 at the beginning.

RMSNorm calculation method

RMSNorm assumes that the mean is 0 and only normalizes the variance. The training speed is faster and the effect difference is not large.

9e15ac1d2bccf55b711a51011b8e479f.png

  • MLP

The MLP subsection mainly involves the selection of activation functions.

resume

ReLU is a very popular activation function, and its mathematical expression is as follows:

aa6be89d14f8198a6a88e1d629af1a89.png

2d9a7e225c345a14d5d3ee14da53a692.png


shave

The mathematical expression of Gaussian Error Linear Units (GELUS) is as follows:

dfa371f45f5713e2d4c2ee752b953cab.png

GELU can also be approximated using the following equation:

67e59a407e02cc5c320b15cded074c51.png

GELUs multiplies the input by a mask composed of 0 and 1, and the generation of the mask is randomly dependent on the input according to probability. Assume that the a1f7773a3a4b64ecbad94a96dcbbe0c1.png input  is When x decreases, the input will have a higher probability of being dropped out, so the activation transformation will randomly depend on the input.a67f49409a7f27b939d83c0e9a0522ef.pnge501413ca4caef681d5eafe049137165.jpeg

53222df9d90caee43674c616995215ec.png

The GeLU code in Bert is as follows:

def gelu(input_tensor):
    cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
    return input_tesnsor*cdf

SwiGLU&GeGLU

SwiGLU and GeGLU are both variants of activation functions explored in Noam Shazeer's article

Specifically, you need to first understand the basic bilinear function of Gated Linear Unit (GLU), which is:

e98d284a499d98b25e5c2384ab01b25f.png

Where ⊗ represents element-wise multiplication, SwiGLU and GeGLU are variants of GLU, defined as follows:

731d64ad2e96ac8eb8d56eac4bb120a7.png

in:

f4d99775b866cd85b3772eacb5e6a82a.png

efdc4bc4f1c6a1522e372b539ad7806b.png

c18b3bd4c1d06bf2d90a290d1fdc5296.png

The author did not describe too much the principles and motivations of the activation function. The paper itself is an attempt to compare the effects of various activation function variants. It can be seen that SwishGLU and GeGLU can achieve the minimum error, and also obtain the minimum error in large models. widely used.

  • Attention

The Attention layer mainly optimizes the Attention operator to accelerate model reasoning and deployment.

FlashAttention

For detailed introduction, please see: https://zhuanlan.zhihu.com/p/626079753

c6f3483066a511795d531cf566bb8684.png

Motivation: When the input sequence (sequence length) is long, the calculation process of Transformer is slow and memory-consuming. This is because the time and memory complexity of self-attention will grow quadratically as the sequence length increases.

The intermediate results S and P of standard Attention (see below) usually need to be accessed through high-bandwidth memory (HBM), and the memory space complexity required for both is 1cf01f7806a331f896f217d482893ba3.png. This article analyzes:

  1. FlashAttention: The number of accesses to HBM is5b6fed8ca092fa0145e12b43e897ebec.png

  2. Attention: The number of accesses to HBM isd5519ea3477740fbc46ac27b99aadf55.png

Often (for example, N=1024, d=64 in GPT2), so FlashAttention will be much faster. The following figure shows the GFLOPs, HBM, and Runtime comparison of Forward+Backward between the two on GPT-2 (A100 GPU):

158fe4e33c2632bcf8d2cf671b265033.jpeg

The storage units in the GPU mainly include HBM and SRAM: HBM has a large capacity but slow access speed, while SRAM has a small capacity but a high access speed. For example: A100 GPU has 40-80GB HBM, with a bandwidth of 1.5-2.0TB/s; each of the 108 streaming multi-core processors has 192KB of on-chip SRAM, and the bandwidth is estimated to be about 19TB/s. As can be seen, on-chip SRAM is an order of magnitude faster than HBM, but is many orders of magnitude smaller.

In summary, the purpose of FlashAttention is not to save FLOPs, but to reduce access to HBM. The key point is that the results of FlashAttention during the training and prediction process are the same as standard Attention and are insensitive to the user, while other acceleration methods cannot do this.

Multi Query Attention

Paper address: https://arxiv.org/pdf/1911.0215

MQA is a new Attention mechanism proposed in 2019, which can speed up the decoder to generate tokens while ensuring the effect of the model.

be47f30950083aed3246062901172862.png

As can be seen from the above chart, MQA's speed increase on the encoder is not very obvious, but the speed increase on the decoder is very significant.

a07f4d028d6966f286d2d8c3ac460f58.jpeg

As can be seen from the explanation of the paper, MQA allows all heads to share the same Key and Value matrix, and each head only retains a separate Query parameter, thus greatly reducing the number of parameters in the Key and Value matrix.

That is: MQA actually extracts the key and value matrices in the head and stores them as shared parameters, while the query is still retained in the original head. Each head has its own unique query parameters.

Code:

The implementation is very simple, change the original dimension directly into the sum of the first number * dimensions.

# Multi Head Attention
self.Wqkv = nn.Linear(                        # 【关键】Multi-Head Attention 的创建方法
    self.d_model, 
    3 * self.d_model,                         # 有 query, key, value 3 个矩阵, 所以是 3 * d_model
    device=device
)


query, key, value = qkv.chunk(                # 【关键】每个 tensor 都是 (1, 512, 768)
    3, 
    dim=2
)




# Multi Query Attention
self.Wqkv = nn.Linear(                                # 【关键】Multi-Query Attention 的创建方法
    d_model,
    d_model + 2 * self.head_dim,                      # 只创建 query 的 head 向量,所以只有 1 个 d_model
    device=device,                                    # 而 key 和 value 不再具备单独的头向量
)


query, key, value = qkv.split(                        # query -> (1, 512, 768)
    [self.d_model, self.head_dim, self.head_dim],     # key   -> (1, 512, 96)
    dim=2                                             # value -> (1, 512, 96)
)

That is, the dimensions of K and V are converted from d_model to self.head_dim

In MQA, in addition to the query vector, there are 8 headers, and the key and value vectors only have 1 "common header".

This also just confirms what is said in the paper that "all heads share a key and value parameter."

The remaining problem is how to use this parameter to all 8 heads at the same time. The code uses matrix multiplication matmul to broadcast, so that each head is multiplied by the same tensor to achieve parameter sharing:

def scaled_multihead_dot_product_attention(
        query,
        key,
        value,
        n_heads,
        multiquery=False,
):
    q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)         # (1, 512, 768) -> (1, 8, 512, 96)
    kv_n_heads = 1 if multiquery else n_heads
    k = rearrange(key, 'b s (h d) -> b h d s', h=kv_n_heads)        # (1, 512, 768) -> (1, 8, 96, 512) if not multiquery 
                                                                    # (1, 512, 96) -> (1, 1, 96, 512)  if multiquery
    v = rearrange(value, 'b s (h d) -> b h s d', h=kv_n_heads)      # (1, 512, 768) -> (1, 8, 512, 96) if not multiquery 
                                                                    # (1, 512, 96) -> (1, 1, 512, 96)  if multiquery


    attn_weight = q.matmul(k) * softmax_scale                       # (1, 8, 512, 512)
    attn_weight = torch.softmax(attn_weight, dim=-1)                # (1, 8, 512, 512)


    out = attn_weight.matmul(v)                                     # (1, 8, 512, 512) * (1, 1, 512, 96) = (1, 8, 512, 96)
    out = rearrange(out, 'b h s d -> b s (h d)')                    # (1, 512, 768)


    return out, attn_weight, past_key_value
  • position encoding

Here are the RoPE and ALiBi position codes for common large model applications. RoPE is more preferred in terms of selection method, and longer length extrapolation can be performed through position interpolation and other methods.

RoPE

For details, see "In-depth Analysis of RoPE Rotation Position Coding: Theoretical Derivation, Code Implementation, and Length Extrapolation" (Address: https://zhuanlan.zhihu.com/p/645263524). The key conclusions are given here:

Method to realize:

6923593f1034005edf32fab29ce52dc6.png

Which c7e3fb4bb6537da64c7230547d357080.pngrepresents bit-by-bit corresponding multiplication.

Advantages: Incorporate relative position information through absolute encoding

Length extrapolation: Position interpolation and 9d982d95d79d6368956371319c639141.jpegbase encoding can perform lossless length extrapolation.

ALiBi

2bf1ee1dcf28ed51b444013caf3dc715.png

Method to realize:

The approach in this article is not to add position embedding, and then add a static non-learning bias, as shown above:

e576e6c0e322b4cd178947186f311776.png

On the basis of matrix dot multiplication of query and key, add a constant negative value. For example, the first digit from the current position is -1, and the first two digits are -2. These constants must be multiplied by the weight m. Pay attention to the n head. Force model, m 1a6db02ae580247a30ecb96016fe1100.pngstarts from.

For example, for the 8-head attention model:

m usage sequence:5ba001c7a820c2c5e354acbc521c5326.png

For the 16-head attention model:

m usage sequence:9ace56b2d53526847d7617907cd10f82.png

Advantage:

  1. Reduces the number of Embeddings required for training and speeds up training

  2. Compared with original position encoding, it has better length extrapolation.

▐Training   data & parameter amount

For details, see "LLM Training Guide: Token and Model Parameter Preparation" (Address: https://zhuanlan.zhihu.com/p/636812912). The key conclusions are given here. When the model calculation amount increases, the amount of training data and parameters will increase. Should maintain a year-on-year increase:

0b7d377352d10bf1376eaac5ad84e302.png

6d6b7cf976c0542333754a7ddd5b3649.png

Summarize

After in-depth discussions on the upgrade of ChatGLM, LLAMA and Baichuan large language models, as well as a comprehensive analysis of LLM structure selection, we can draw the following conclusions:

  1. The upgrade process of large-scale pre-training models is mainly reflected in the improvement of basic knowledge capabilities and supported sequence length changes. By increasing the amount of model parameters and optimizing the quality of training data, the model can better fit knowledge in various fields and further improve model performance; by increasing the training length and adjusting positional encoding extrapolation, longer sequences are supported.

  2. In terms of model structure design, choosing an appropriate LLM structure is crucial to achieving high-performance large-scale pre-training models. By introducing appropriate LayerNorm and activation functions, the stability of training is improved; by introducing efficient operators, such as Flash Attention and Multi Query Attention, calculation efficiency can be significantly improved while maintaining model performance; by introducing RoPE or ALiBi position encoding , improve the length extrapolation of the model.

  3. When building and optimizing large-scale pre-training models, we should not only pay attention to the performance and computing efficiency of the model, but also pay attention to issues such as data quality, deduplication, decontamination, toxicity and bias control, and personal information protection. This will help make the model more secure, robust and reliable in practical applications.

In short, this article provides a systematic perspective and sorts out the key elements of large-scale pre-training models by deeply analyzing the upgrade paths of ChatGLM, LLAMA and Baichuan models, and discussing the structure selection of large-scale language models. We hope that this knowledge can provide powerful reference and guidance for everyone to build more powerful, flexible and efficient large-scale pre-training models in actual projects.

ec5ec9dffed7741a0dec98951d342dda.png

Reference link

  1. Tricks for fine-tuning sample construction of large models (Address: https://zhuanlan.zhihu.com/p/636812912)

  2. https://github.com/facebookresearch/llama

  3. https://github.com/baichuan-inc/Baichuan-7B

  4. https://github.com/THUDM/ChatGLM2-6B/tree/main

  5. https://arxiv.org/pdf/2002.05202.pdf

  6. https://zhuanlan.zhihu.com/p/634236135

  7. https://zhuanlan.zhihu.com/p/626079753

fe5e77b88a25203641576f14cdbab0be.png

team introduction

We are the recommendation algorithm team of the Intelligent Strategy Team of the FC Technology Department under Taotian Group. We are mainly responsible for the research and development and optimization of mobile Tmall recommendation and advertising algorithms, providing users with more accurate recommendation services and improving user experience and satisfaction. In addition, the team is also committed to innovative applications of AI technology, such as intelligent shopping guides, and actively explores innovative business practices.

¤Expand  reading¤  _

3DXR technology  |  terminal technology  |  audio and video technology

Server-side technology  |  Technical quality  |  Data algorithm

Guess you like

Origin blog.csdn.net/Taobaojishu/article/details/132820117