[Natural Language Processing] [Large Model] GLM-130B: An open source bilingual pre-training language model

GLM-130B: An open source bilingual pre-training language model
《GLM-130B: An open bilingual pre-trained model》

Paper: https://arxiv.org/pdf/2210.02414.pdf

Related Blogs
[Natural Language Processing] [Large Model] ChatGLM-6B Model Structure Code Analysis (Standalone Version)
[Natural Language Processing] [Large Model] LaMDA: A Language Model for Conversational Applications
[Natural Language Processing] [Large Model] DeepMind's large model Gopher
[Natural Language Processing] [Large Model] Chinchilla: a large language model with optimal training and computing utilization
[Natural Language Processing] [Large Model] Large language model BLOOM reasoning tool test
[Natural Language Processing] [Large Model] 】GLM-130B: An open source bilingual pre-training language model
[Natural Language Processing] [Large Model] Introduction to 8-bit matrix multiplication for large Transformers
[Natural Language Processing] [Large Model] BLOOM: A 176B parameter and open access [Natural Language Processing] [Large Model
] PaLM: A large language model based on Pathways
[Natural Language Processing] [chatGPT series] Large language models can improve themselves
[Natural Language Processing] [ChatGPT series] WebGPT: Based on human feedback Browser-assisted Q&A
[Natural Language Processing] [ChatGPT Series] FLAN: Fine-tuning the language model is a Zero-Shot learner
[Natural Language Processing] [ChatGPT Series] Where does the intelligence of ChatGPT come from?
[Natural Language Processing] [ChatGPT Series] Emergence of Large Models

1. Introduction

insert image description here

​Large language models (LLMs), especially models with parameters exceeding 100B, present attractive scaling laws, which will suddenly emerge with zero-shot and few-shot capabilities. GPT-3 with 175B parameters is the first to study LLM at 100B scale: using 32 labeled examples can significantly outperform the fully supervised BERT-Large model on various benchmarks. However, GPT-3 itself and how it was trained are still not publicly available. Training a high-quality LLM at this scale and sharing the model and training process with everyone is very valuable.

​Our goal is to pre-train an open-source and high-accuracy 100B model. In the process of our attempts, we gradually realized that compared to training a 10B model, training a dense LLM of more than 100B faces many unexpected technical and engineering challenges, such as pre-training efficiency, stability and convergence. Similar difficulties also occurred in the training of OPT-175B and BLOOM-176B, further demonstrating the importance of GPT-3 as a pioneer research.

In this paper, we introduce the pre-training of the 100B-scale model GLM-130B, including engineering efforts, model design choices, training strategies for efficiency and stability, and quantification for reducing inference costs. Because it is widely recognized that enumerating all possible designs for training a 100B-scale LLM is computationally unaffordable, we not only program the successful parts of training GLM-130B, but also introduce many failed options to learn from. Training stability is a key factor for a model of this scale to be able to train successfully. Unlike the artificially adjusted learning rate in OPT-175B and the embedding norm used in BLOOM-176B, we experimented with various options and found that the embedding gradient shrink strategy can significantly stabilize the training of GLM-130B.

​ Specifically, GLM-130B is a bilingual bidirectional dense model with 130 billion parameters, which was pre-trained on a cluster of 96 NVIDIA DGX-100 (8*40G) nodes with 400B tokens, and the training started in 2022 May 6 to July 3, 2022. Instead of using a GPT-style architecture, we employ the General Language Model (GLM) algorithm to take advantage of bi-directional attention and autoregressive blank-fill objective functions. Table 1 above compares GLM-130B, GPT-3, OPT-175B, BLOOM-176B and PaLM540B.

Overall, conceptual independence and engineering effort allow GLM-130B to outperform GPT-3 on a wide range of benchmarks, and in many instances PaLM540B, while OPT-175B and BLOOM-176B It did not show performance beyond GPT-3. For zero-shot performance, GLM-130B outperforms GPT-3 175B (+5.0%), OPT-175B (+6.5%) and BLOOM-176B (+13.0%) on LAMBADA, and on BIG-Bench-Lite Three times better than GPT-3. For the 5-shot MMLU task, it outperforms GPT-3 175B (+0.9%) and BLOOM-176B (+12.7%). Due to a bilingual LLM containing Chinese, it significantly outperforms ERNIE TITAN 3.0 260B on the 7 zero-shot CLUE dataset (+24.26%), and 5 zero-shot FewCLUE (+12.75%). Importantly, GLM-130B, as an open model, is significantly less biased and less toxic than other 100B models.

​ Finally, our goal in designing GLM-130B is to allow more people to conduct 100B LLM research. First of all, compared to OPT and BLOOM with 175B+ parameters, the 130B size can be inferred on a single A100 (8*40G) server. Second, to further reduce the need for GPUs, we quantize GLM-130B to INT4 accuracy without quantization-aware training, while OPT and BLOOM are only able to reach INT8. Due to the uniqueness of GLM-130B architecture, GLM-130B's INT4 quantization introduces negligible performance drop, e.g. -0.74% on LAMBADA or even +0.05% on MMLU, making it still outperform uncompressed GPT-3. This allows GLM-130B to perform fast inference on 4xRTX3090 (24G) or 8xRTX2080 Ti (11G) while guaranteeing performance, the most affordable GPU required for 100B LLM so far.

2. Design options of GLM-130B

1. The structure of GLM-130B

​ **GLM as the backbone. **Most recent 100B-scale LLMs, such as GPT-3, PaLM, OPT, and BLOOM follow a GPT-style architecture, decoder-only autoregressive language model. In GLM-130B, we try to exploit the potential of bidirectional GLMs as the backbone network.

GLM is a transformer-based language model that uses autoregressive blank filling as the training target. Simply put, for a text sequence x = [ x 1 , … , xn ] \textbf{x}=[x_1,\dots,x_n]x=[x1,,xn] , from which to sample text segments{ s 1 , … , sm } \{\textbf{s}_1,\dots,\textbf{s}_m\}{ s1,,sm} , eachsi \textbf{s}_isiRepresents continuous tokens segments [ si , 1 , … , si , li ] [s_{i,1},\dots,s_{i,l_i}][si,1,,si,li] and is replaced by a single mask token, thus formingxcorrupt \textbf{x}_{corrupt}xcorrupt. The model was asked to recover using an autoregressive approach. In order to allow interaction between destroyed fragments, the visibility of each other is determined by a randomly sampled permutation. The pre-training objective is defined as:
L = max ⁡ θ E z ∼ Z m [ ∑ i = 1 m log ⁡ ∏ j = 1 lip ( si , j ∣ xcorrupt , sz < i , si , < j ) ] \mathcal{L }=\max_{\theta}\mathbb{E}_{\textbf{z}\sim Z_m}\Big[\sum_{i=1}^m\log\prod_{j=1}^{l_i} p (s_{i,j}|\textbf{x}_{corrupt},\textbf{s}_{z_{<i}},\textbf{s}_{i,<j})\Big]L=imaxEzZm[i=1mlogj=1lip(si,jxcorrupt,sz<i,si,<j) ]
Among them,Z m Z_mZmrepresents the set of all permutations of the sequence, sz < i \textbf{s}_{z_{<i}}sz<idisplay [ sz 1 , … , szi − 1 ] [\textbf{s}_{z_1},\dots,\textbf{s}_{z_i-1}][sz1,,szi1]

GLM's bi-directional attention mechanism on unmaksed context, making GLM-130B comparable to GPT-style LLM using unidirectional attention mechanism. To support both comprehension and generation, it mixes two destruction objectives, each represented by a special mask token:

  • [MASK]: A short blank in the sentence whose length is added to a certain part of the input;
  • [gMASK]: A long space of random length at the end of the sentence, and provide prefix context;

Conceptually, gap-filling objective functions with bi-directional attention are more effective in compressing context than GPT-style models: when using [MASK], GLM-130B behaves like BERT and T5; when using [gMASK], GLM-130B behaves like PrefixLM.

​ Empirically, GLM-130B achieved a record 80.2% accuracy rate on zero-shot LAMBADA, which is better than GPT-3 and PaLM. By setting the attention mask, the GLM-130B unidirectional variant is comparable to GPT-3 and OPT-175B.
insert image description here

​Layer Normalization . The main challenge in training LLM is training instability. A proper choice of LN helps to stabilize the training of LLM. We experimented with existing practices, Pre-LN, Post-LN and Sandwich-LN, none of which were suitable for stabilizing GLM-130B.

​Our subsequent research focuses on Post-LN because it works well on downstream tasks, although it is unstable on GLM-130B. Fortunately, the newly proposed DeepNorm yields great training stability. Specifically, given the number N of GLM-130B layers, we adopt
DeepNorm( x ) = LayerNorm ( α ⋅ x + Network ( x ) ) \text{DeepNorm(\textbf{x})}=\text{LayerNorm} (\alpha\cdot\textbf{x}+\text{Network}(\textbf{x}))DeepNorm(x)=LayerNorm ( ax+Network ( x ))
where,α = ( 2 N ) 1 2 \alpha=(2N)^{\frac{1}{2}}a=( 2N ) _21. and applied a scaling factor of ( 2 N ) − 1 2 (2N)^{-\frac{1}{2}} on , ffnandv_projout_proj( 2N ) _21Xavier initialization. Additionally, all bias terms are initialized to 0. Figure 3 above shows the training stability of GLM-130B.

​Position encoding and FFN. We tested different options for positional encoding (PE) and FFN for training stability and downstream performance. The position encoding in GLM-130B selects Rotary Positional Encding (RoPE) instead of ALiBi. To improve FFN in Transformer, we choose GLU and GeLU as alternatives for activation.

2. Pre-training settings for GLM-130

Inspired by recent work, the GLM-130B pre-training objective not only incorporates self-supervised GLM autoregressive blank filling, but also performs multi-task learning on a small set of tokens. This helps improve downstream zero-shot performance.

​ **Self-supervised blank filling (97% tokens). **Recall that the GLM-130B used both [MASK] and [gMASK] for this task. Specifically, [MASK] is used to mask 30% of the continuous segments of the training token for blank filling. The length of the segment follows the Poisson distribution ( λ = 3 \lambda=3l=3 ), and add to 15% of the input. For the other 70% of tokens, the prefix of each sequence is kept as context, and [gMASK] is used to mask the rest. The length of the mask is sampled from a uniform distribution. The pre-training data includes 1.2T Pile English corpus, 1.0T Chinese Wudao corpus and 250G Chinese corpus crawled from the Internet (including online forums, encyclopedias and Q&A).

​ Multi-task Instruction pre-training (MIP, 5% token). T5 and ExT5 suggest that multi-task learning in pre-training is helpful for fine-tuning, so we propose to include instruction prompted datasets of various language understanding, generation and information extraction in GLM-130B pre-training.

Compared with the recent use of multi-task prompted fine-tuning to improve the migration of zero-shot tasks, MIP only considers 5% of tokens and is set in the pre-training stage to prevent other general capabilities of LLM from being destroyed, such as: unconditional free generation. Specifically, we included 74 prompted datasets.

3. Platform-aware parallel strategy and model configuration

GLM-130B was trained for 60 days on a cluster of 96 DGX-A100 GPUs (8x40G). The goal is to train as many tokens as possible, and recent studies have shown that most LLMs are undertrained.

​3D parallel strategy . Data parallelism and tensor parallelism are standard practices for training billion-parameter models. In order to further deal with the huge GPU memory requirements, and the decrease in overall GPU utilization caused by tensor parallelism between nodes. We incorporate pipelined model parallelism to form a 3D parallel strategy.

​Pipeline parallelism divides the model into sequential stages for each parallel group, and further minimizes the "bubbles" introduced by the pipeline. We use PipeDream-Flush implemented by DeepSpeed ​​to train GLM-130B, and use a global batch size of 4224 to Waste of retrieval time and GPU memory. We use 4-way tensor parallelism and 8-way pipeline parallelism to achieve 135 TFLOP/s perGPU (40G).

​GLM -130B configuration . Our goal is to ensure that a 100B LLM can run on a single DGX-A100 with FP16 precision. Based on the hidden state dimension 12288 we obtained from GPT-3, the obtained model size must not be larger than 130B parameters. To maximize GPU utilization, we configure the model based on the platform and the corresponding parallelism strategy. In order to avoid insufficient memory utilization in the middle stage due to the extra word embedding at both ends, we remove one layer from it to balance the pipeline division, so that there are 9 × 8 − 2 = 70 9\times 8-2=70 in GLM-130B9×82=70 transformer layers.

During a 60-day visit to the cluster, we trained GLM-130B on 400B tokens with a fixed sample length of 2048. For the [gMASK] training target, we use a context window of 2048 tokens. For [MASK] and multi-task targets, we use a context window of 512 and concatenate 4 samples to a length of 2048. We warm-up the batch size from 192 to 4224 on the first 2.5% of samples. We use AdamW as the optimizer, β 1 \beta_1b1and β 2 \beta_2b2Set to 0.9 and 0.95 with a weight decay value of 0.1. On the first 0.5% of samples, we change the learning rate from 1 0 − 7 10^{-7}107 warm up to8 × 1 0 − 5 8\times 10^{-5}8×105 , then by10 × 10\times10 × consine dispatch for decay. We use a dropout rate of 0.1 and a value of 1.0 for the clipping gradient.

3. Training stability of GLM-130B

Training stability is an absolute factor of GLM-130B quality, which is largely affected by the number of tokens it passes. Therefore, considering the limitations of computing usage, a trade-off between efficiency and stability must be made for the floating-point (FP) format: low-precision floating-point format can improve computing efficiency, but may cause overflow and underflow problems, resulting in training crashes .

​ **Mixed precision. **We follow the common practice of mixed precision, i.e. FP16 for forward and backward pass, FP32 for optimizer state and main weights, thus reducing GPU memory usage and improving training efficiency. Similar to OPT-175B and BLOOM-176B, due to this choice, training GLM-130B faces frequent loss peaks, which will become more and more frequent as the training progresses. Accuracy-related spikes often have no clear reason: some will recover on their own; others will be accompanied by a sudden spike in the norm of the gradient, and eventually the loss will peak or NaN.

​ OPT-175B solves this problem by manually skipping data and adjusting hyperparameters; BLOOM-176B uses embedding norm technology. We spent months studying the peaks and realized that there are some problems when the transformer gets bigger:

First, if Pre-LN is used, the value size of the main branch of the transformer can be very large in deeper layers. In GLM-130B, DeepNorm based on Post-LN is used to solve it, which will make the scale of the value always bounded.

Second, as the model size increases, the attention score becomes so large that it exceeds the range of FP16. There are few options in LLM to overcome this. In BLOOM-176B, the BF16 format is used to replace FP16 due to its wider range on NVIDIA Ampere GPUs. However, since BF16 is converted to FP32 in gradient accumulation, it consumes about 15% more video memory than FP16 in our experiment, and more importantly, it does not support the second GPU platform. Another option from BLOOM-176B is to apply embedding norm, but it has damage to the performance of the model.

​Embedding layer gradient shrinkage . Our empirical study shows that the gradient norm can be an informative indicator of training crashes. Specifically, we find that training crashes typically lag behind gradient norm peaks by a few training steps. Such spikes are usually caused by abnormal gradients of the embedding layer, as we observe that the gradient norm of the GLM-130B is several orders of magnitude higher than that of other layers in the early training stage. Furthermore, it fluctuates a lot early in training. This is handled in the visual model by fixing the patch projection layer. Unfortunately, we cannot fix the embedding layers in the language model.

Finally, we found that the gradient shrink in the embedding layer helps to overcome the loss peak and stabilize the training of GLM-130B. This was first used in the multimodal transformer model CogView. Specifically, let α \alphaα is the contraction factor, which can be easily realized by
word_embedding = word_embedding × α + word_embedding.detach() × ( 1 − α ) \text{word\_embedding}=\text{word\_embedding}\times\alpha+\ text{word\_embedding.detach()}\times(1-\alpha)word_embedding=word_embedding×a+word_embedding.detach()×(1α )
According to experience, will setα = 0.1 \alpha=0.1a=0.1 helps avoid most spikes, with negligible loss of speed.

​ In fact, the earliest GLM-130B training only experienced 3 loss divergences in the later stage, and failed countless times due to hardware failures.

4. Reasoning GLM-130B on RTX 2080 Ti

insert image description here

The main goal of GLM-130B is to reduce the hardware requirements for accessing 100B scale LLM without reducing efficiency and effectiveness.

​ The model size of 130B can make it possible to run the complete GLM-130B on a single A100(40G*8) machine, instead of running on the high-end A100(80G*8) machine like OPT-175B and BLOOM-176B. To speed up the inference of GLM-130B, we also utilize FasterTransformer to implement GLM-130B. Compared with BLOOM-176B implemented by Pytorch in Huggingface, GLM-130B decodes and inferences 7-8.4 times faster on the same single A100 server.

​INT4 quantization for . To further support popular GPUs, GLM-130B was compressed as much as possible while preserving performance benefits, especially through quantization.

It is common practice to quantize both model weights and activations as INT8. However, our analysis suggests that the activations of LLMs may contain extreme outliers. Meanwhile, emerging outliers are also found in OPT-175B and BLOOM-176B, which only affect 0.1% of feature dimensions, so the problem can be solved by matrix multiplication decomposition.

​ The difference is that there are about 30% outliers in the activation of GLM-130B, which greatly reduces the efficiency of the above-mentioned techniques. Therefore, we decided to focus on good model weights, while keeping model activations at FP16 precision. We simply use absmax quantization after training, and dynamically convert the weights to FP16 precision at runtime, which introduces a small amount of computational overhead, but greatly reduces the use of GPE memory.

What is exciting is that GLM-130B has achieved INT4 full quantization, while existing successful cases have only achieved INT8 level. Compared with INT8, the INT4 version helps to save an additional half of the required video memory to 70GB, which allows the GLM-130B to perform inference on 4 x RTX 3090 Ti (24) or 8 x RTX 2080 Ti (11G). The left side of Table 2 above shows that without any post-training, the INT4 version of GLML-130B has almost no performance degradation, and maintains an advantage over GPT-3 on common benchmarks.

insert image description here

​GLM INT4 fully quantifies the scaling law . The right side of Figure 5 above shows the trend of performance increasing with model size, indicating that there is a scaling law in the weight quantization performance of GLM INT4. We evaluated this unique underlying mechanism in GLM. The weight value distribution is plotted on the left side of Figure 5 above, which shows the direct impact on the quantization quality. Specifically, the value distribution requires larger bins for quantization according to a broad linear layer, otherwise it will lead to more accuracy loss. attn-denseThe sum matrix with widely distributed values w2​​explains why BLOOM's INT4 quantization fails. Conversely, GLM has a narrower distribution of values ​​than GPT of similar size. As the GLM model size increases, the gap between INT4 and FP16 versions further decreases.

5. Results

We evaluate GLM-130B following common settings for LLMs such as GPT-3 and PaLM. In addition to being evaluated in English, GLM-130B will also be evaluated on the Chinese benchmark as a bilingual model.

​Discussion on the scope of Zero-shot Learning in GLM . Because GLM-130B has been trained using MIP, the scope of zero-shot evaluation is clarified here. In fact, "zero-shot" seems to have controversial interpretations, with no consensus in the community. We follow an important related review where "zero-shot learning is set at test time whose goal is to assign an unseen class label to a test image", where the involvement of unseen class labels is key. Therefore, we conclude that the criteria for selecting the GLM-130B zero-shot dataset are:

  • English : 1) For those tasks with fixed labels (such as natural language inference), evaluation should not be performed on task datasets in such tasks; 2) For tasks without fixed labels (question answering, topic classification): only with MIP datasets with explicit domain transfers are considered;
  • Chinese : All data sets can be evaluated because there is zero-shot cross-language transfer;

​ ** Filter the test data set. **Following the practice of previous work and the guidelines mentioned above, we filter and avoid reporting evaluation results on potentially contaminated datasets. For LAMBADA and CLUE, we found that the 13-gram setting is the smallest coverage. Pile, MMLU and BIG-bench are either set aside or released after crawling the corpus.

1. Language Modeling

insert image description here

​LAMBADA . LAMBADA is a dataset that tests last-word language modeling capabilities. GLM-130B achieves a zero-shot accuracy of 80.2% using bi-directional attention, achieving a new record on LAMBADA.

​Pile . Pile's test set contains a series of semantic modeling principles. Compared with GPT-3 and Jurassic-1, GLM-130B achieves the best performance on the shared test set of 18 weighted BPBs, demonstrating its strong semantic capability.

2. Massive Multi-Task Language Understanding (MMLU)

insert image description here

MMLU is a diversity benchmark consisting of 57 multiple-choice question answering tasks, focusing on human knowledge from high school level to expert level. It was released after Pile crawling and is an ideal benchmark for LLM few-shot learning. The results of GPT-3 come from MMLU, and BLOOM-176B is tested using the same prompts as GLM-130B.

​As shown in Figure 6 above, after seeing about 300B tokens, the few-shot (5-shot) performance of GLM-130B on MMLU is close to GPT-3 (43.9). As the training progresses, it will continue to rise, and when the training ends, the accuracy rate is 44.8. This is consistent with the observation that most existing LLMs are far from adequately trained.

3. BIG-bench

BIG-bench benchmarks on challenging tasks involving model reasoning, knowledge, and commonsense capabilities. Evaluating on 150 tasks is time-consuming for LLM, so we report on the BIE-bench-list, an official subset with 24 tasks. Looking at Figure 7 and Table 4 above, GLM-130B is superior to GPT-3 175B and even PaLM 540B on zero-shot. This may be thanks to GLM-130B's bi-directional contextual attention and MIP, which are shown to improve zero-shot results on unseen tasks. As the number of shots increases, the effect of GLM-130B continues to increase and is still ahead of GPT-3.

​Limits and Discussion . In the above experiments, we observed that the improved performance of GLM-130B with the increase of few-shot samples is not as significant as that of GPT-3. Here we try to understand this phenomenon intuitively.

First of all, the bidirectional nature of GLM-130B can bring a strong zero-shot effect, making it close to the few-shot upper bound of the same scale. Second, this could be due to the flaws in the existing MI paradigm, which only involves zero-shot predictions during training and leads to the bias of GLM-130B towards stronger zero-shot learning capabilities, but relatively weaker in-context few- shot effect . To correct this bias, our proposed potential solution is that if we have the opportunity to continue pre-training for GLM-130B, then MIP will use various shots of different in-context samples, not just zero-shot samples.

Finally, despite using the same GPT architecture as GPT-3, the few-shot relative growth of PaLM 540B in in-context learning is significantly more than that of GPT-3. We speculate that this further acceleration of performance growth is caused by PaLM's high-quality and diverse private training corpus.

4. Chinese Language Comprehension Evaluation (CLUE)

insert image description here
We evaluate the Chinese zero-shot effect of GLM-130B on the Chinese NLP benchmarks CLUE and FewCLUE. Note that we do not include any Chinese downstream tasks in MIP. So far, we have completed partial testing of two benchmarks, including 7 CLUE and 5 FewCLUE datasets. We compare GLM-130B with ERNIE Titan 3.0, the largest existing Chinese monolingual model 260B. We use its settings to report zero-shot results on the dev set. GLM-130B outperforms ERNIE Titan 3.0 on 12 tasks. Interestingly, GLM-130B outperforms ERNIE by at least 260% on the two abstract MRC datasets, probably due to the natural fit of the pre-training objective of GLM-130B to the abstract MRC form.

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/129132457