LLMs:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

LLMs:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

导读:2023年3月10日,千亿对话模型 ChatGLM 开始内测,60亿参数 ChatGLM-6B 模型开源。ChatGLM( GLM-130B+SFT)算法参考了 ChatGPT 的设计思路,在千亿基座模型 GLM-130B(一个包含多目标函数的自回归预训练模型)中注入了代码预训练,通过有监督微调(Supervised Fine-Tuning)等技术实现人类意图对齐。GLM-130B 是一个开源开放的双语(中文和英文)双向稠密模型,拥有 1300 亿参数,模型架构采用通用语言模型(GLM1)。它旨在支持在一台 A100(40G * 8) 或 V100(32G * 8)服务器上对千亿规模参数的模型进行推理。

>> 基于4000 亿个双语标记进行预训练:GLM-130B 在超过 4000 亿个文本标识符上预训练完成,对超过 4000 亿个双语标记(2000 亿英文和 2000 亿中文标记)进行了预训练。

>> 基于Transformer(70层/)+2048+自回归空白填充+两种不同的掩码标识符+RoPE+DeepNorm+GeLU激活函数:GLM-130B 的底层架构是基于通用语言模型(GLM1),GLM-130B 模型含有 70 层 Transformer,隐层维度 12,288,最大序列长度 2,048,以及一个基于 icetk 的 150,000 个标识符的双语分词器。GLM-130B 利用自回归空白填充作为其主要的预训练目标,它掩盖了随机的连续文本区间,并对其进行自回归预测。在实际训练中,GLM-130B 使用两种不同的掩码标识符([MASK] 和 [gMASK]),分别用于短文和长文的生成。此外,它还采用了最近提出的旋转位置编码(RoPE)、DeepNorm层规范化和高斯误差 GeLU高斯误差线性单元(Gaussian Error Linear Unit)激活函数。

目录

《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

ABSTRACT摘要

1、INTRODUCTION引言

2、THE DESIGN CHOICES OF GLM-130B的设计选择

2.1、GLM-130B’S ARCHITECTURE的架构‌

2.2、GLM-130B’S PRE-TRAINING SETUP的预训练设置‌

2.3、PLATFORM-AWARE PARALLEL STRATEGIES AND MODEL CONFIGURATIONS平台感知的并行策略和模型配置‌

3、THE TRAINING STABILITY OF GLM-130B的训练稳定性

7、CONCLUSION AND LESSONS结论和教训

ACKNOWLEDGEMENT致谢


《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

地址

官网:ChatGLM

文章:GLM-130B:开源的双语预训练模型 | GLM-130B

GitHub:https://github.com/THUDM/ChatGLM-6B

论文:https://openreview.net/pdf?id=-Aw0rrrPUF

时间

GLM-130B:2022年8月4日

千亿对话模型 ChatGLM 开始内测:2023年3月10日

作者

清华大学+智谱AI

ABSTRACT摘要

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numer-ous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stabil-ity, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B—the largest Chinese language model—across related benchmarks. Fi-nally, we leverage a unique scaling property of GLM-130B to reach INT4 quanti-zation without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B/.

我们介绍了GLM-130B,一个具有1300亿参数的双语(英文和中文)预训练语言模型。它是为了开源一个至少与GPT-3(davinci)一样好的1000亿规模模型,并揭示这样规模的模型如何成功地进行预训练的尝试。在这个过程中,我们遇到了许多意外的技术和工程挑战,特别是在损失峰值和发散方面。在本文中,我们介绍了GLM-130B的训练过程,包括其设计选择、为了提高效率和稳定性的训练策略以及工程努力。在许多流行的英文基准测试中,GLM-130B模型相对于GPT-3 175B(davinci)表现出显著的优势,但在OPT-175B和BLOOM-176B上并没有观察到性能优势。它还在相关基准测试中始终明显优于中国语言模型中最大的ERNIE TITAN 3.0 260B。最后,我们利用GLM-130B的独特扩展属性,实现了在没有后训练的情况下达到INT4量化,并且几乎没有性能损失,使其成为1000亿规模模型中的首例,并且更重要的是,允许在4×RTX 3090(24G)或8×RTX 2080 Ti(11G)GPU上进行有效推理,这是使用1000亿规模模型所需的最经济的GPU。GLM-130B模型的权重可以公开访问,其代码、训练日志、相关工具和经验教训已在https://github.com/THUDM/GLM-130B/上开源。

1、INTRODUCTION引言

Large language models (LLMs), particularly those with over 100 billion (100B) parameters (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022; Wang et al., 2021), have presented attractive scaling laws (Wei et al., 2022b), where emergent zero-shot and few-shot capabilities suddenly arose. Among them, GPT-3 (Brown et al., 2020) with 175B parameters pi- oneers the study of 100B-scale LLMs by strikingly generating better performance with 32 labeled examples than the fully-supervised BERT-Large model on a variety of benchmarks. However, both GPT-3 (and many other closed-sourced 100B-scale ones)—the model itself—and how it can be trained, have been thus far intransparent to the public. It is of critical value to train a high-quality LLM of such scale with both the model and training process shared with everyone.

We thus aim to pre-train an open and highly-accurate 100B-scale model with ethical concerns in mind. Over the course of our attempt, we have come to realize that pre-training a dense LLM at such a scale raises numerous unexpected technical and engineering challenges compared to training 10B-scale models, in terms of pre-training efficiency, stability, and convergence. Similar difficulties

大型语言模型(LLMs),特别是具有超过1000亿(100B)参数的模型(Brown等人,2020年;Thoppilan等人,2022年;Rae等人,2021年;Chowdhery等人,2022年;Wang等人,2021年),展现了吸引人的规模扩展规律(Wei等人,2022b年),其中出现了突然出现的零样本和少样本能力。其中,GPT-3(Brown等人,2020年)以1750亿参数的规模引领了1000亿规模LLM的研究,令人惊讶地在多个基准测试上生成的性能比全监督的BERT-Large模型好,只需32个标记样本。然而,目前为止,无论是GPT-3(和许多其他封闭的1000亿规模模型)本身还是它的训练方法,对公众来说都是不透明的。训练一个具有如此规模的高质量LLM,并将模型和训练过程与所有人分享,具有重要的价值。

因此,我们的目标是在考虑伦理问题的前提下,预训练一个开放且高度准确的1000亿规模模型。在我们的尝试过程中,我们意识到与训练10亿规模模型相比,以这样的规模预训练一个密集的LLM会引发许多意想不到的技术和工程挑战,包括预训练效率、稳定性和收敛性方面的挑战。

In this work, we introduce the pre-training of a 100B-scale model—GLM-130B, in terms of engi- neering efforts, model design choices, training strategies for efficiency and stability, and quantization for affordable inference. As it has been widely realized that it is computationally unaffordable to empirically enumerate all possible designs for training 100B-scale LLMs, we present not only the successful part for training GLM-130B but also many of the failed options and lessons learned. Particularly, the training stability is the decisive factor in the success of training models of such a scale. Different from practices such as manually adjusting learning rates in OPT-175B and using embedding norm in the sacrifice of performance in BLOOM-176B, we experiment with various op- tions and find the strategy of embedding gradient shrink can significantly stabilize the training of GLM-130B.

Specifically, GLM-130B is a bilingual (English and Chinese) bidirectional dense model with 130 bil- lion parameters, pre-trained over 400 billion tokens on a cluster of 96 NVIDIA DGX-A100 (8×40G) GPU nodes between May 6 and July 3, 2022. Instead of using the GPT-style architecture, we adopt the General Language Model (GLM) algorithm (Du et al., 2022) to leverage its bidirectional at- tention advantage and autoregressive blank infilling objective. Table 1 summarizes the comparison between GLM-130B, GPT-3 and another two open-source efforts—OPT-175B and BLOOM-176B, as well as PaLM 540B (Chowdhery et al., 2022)—a 4× larger model—as a reference.

在本文中,我们介绍了一个1000亿规模模型GLM-130B的预训练工作,包括工程努力、模型设计选择、为了提高效率和稳定性的训练策略,以及用于经济推理的量化方法。正如众所周知,对于训练1000亿规模LLM来说,计算上无法承受对所有可能的设计进行实证枚举,因此我们不仅展示了训练GLM-130B的成功部分,还展示了许多失败的选择和经验教训。特别地,训练的稳定性是训练这样规模模型成功的决定性因素。与在OPT-175B中手动调整学习率以及在BLOOM-176B中使用嵌入范数以牺牲性能的做法不同,我们尝试了各种选择,并发现嵌入梯度缩减策略可以显著稳定GLM-130B的训练。

具体来说,GLM-130B是一个具有1300亿参数的双语(英文和中文)双向密集模型,经过2022年5月6日至7月3日期间的96台NVIDIA DGX-A100(8×40G)GPU节点集群上的4000亿标记的预训练。我们没有使用GPT样式的架构,而是采用了General Language Model(GLM)算法(Du等人,2022年),利用其双向注意力优势和自回归填充空白目标。表1总结了GLM-130B、GPT-3以及另外两个开源项目OPT-175B和BLOOM-176B以及PaLM 540B(Chowdhery等人,2022年)——一个4倍大的模型作为参考的比较。

Altogether, the conceptual uniqueness and engineering efforts enable GLM-130B to exhibit perfor- mance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in OPT-175B and BLOOM-176B (Cf. Figure 1 left). For zero-shot performance, GLM-130B is better than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA (Pa-perno et al., 2016), and achieves 3× better performance than GPT-3 on Big-bench-lite (Srivastava et al., 2022). For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B(+0.9%) and BLOOM-176B (+12.7%). As a bilingual LLM also in Chinese, it offers significantly better results than ERNIE TITAN 3.0 260B (Wang et al., 2021)—the largest Chinese LLM—on 7 zero-shot CLUE (Xu et al., 2020) datasets (+24.26%) and 5 zero-shot FewCLUE (Xu et al., 2021) ones (+12.75%). Importantly, as summarized in Figure 1 right, GLM-130B as an open model is associated with significantly less bias and generation toxicity than its 100B-scale counterparts.

Finally, we design GLM-130B to empower as many people as possible to conduct 100B-scale LLM studies. First, instead of using 175B+ parameters as OPT and BLOOM, the 130B size is decided be- cause such a size supports inference on a single A100 (8×40G) server. Second, to further lower the GPU requirements, we quantize GLM-130B into INT4 precision without post training while OPT and BLOOM can only reach INT8. Due to a unique property of the GLM architecture, GLM-130B’s INT4 quantization introduces negligible performance degradation, e.g., -0.74% on LAMBADA and even +0.05% on MMLU, making it still better than the uncompressed GPT-3. This enables GLM-130B’s fast inference with performance guarantee on a server of 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G), the most affordable GPU required for using 100B-scale LLMs to date.

We open-source the model checkpoints, code, training logs, related toolkits, and lessons learned.

总的来说,GLM-130B的独特概念和工程努力使其在广泛的基准测试(共112个任务)上展现出超越GPT-3的性能,并且在许多情况下优于PaLM 540B,但在OPT-175B和BLOOM-176B上并未观察到优势(参见图1左侧)。对于零样本性能,GLM-130B在LAMBADA(Pa-perno等人,2016年)上比GPT-3 175B(+5.0%)、OPT-175B(+6.5%)和BLOOM-176B(+13.0%)更好,并在Big-bench-lite(Srivastava等人,2022年)上的性能比GPT-3好3倍。对于5样本MMLU(Hendrycks等人,2021年)任务,它比GPT-3 175B(+0.9%)和BLOOM-176B(+12.7%)更好。作为一个中英双语LLM,它在7个零样本CLUE(Xu等人,2020年)数据集(+24.26%)和5个零样本FewCLUE(Xu等人,2021年)数据集(+12.75%)上的结果明显优于中国最大的中文LLM ERNIE TITAN 3.0 260B(Wang等人,2021年)。重要的是,如图1右侧所总结的那样,作为一个开放模型,GLM-130B与其1000亿规模的同类模型相比,偏见和生成毒性显著减少。

最后,我们设计GLM-130B,以使尽可能多的人能够进行1000亿规模LLM的研究。首先,我们选择了1300亿参数的大小,而不是像OPT和BLOOM一样使用1750亿以上的参数,因为这样的大小支持在单个A100(8×40G)服务器上进行推理。其次,为了进一步降低GPU需求,我们将GLM-130B量化为INT4精度,而OPT和BLOOM只能达到INT8。由于GLM架构的独特属性,GLM-130B的INT4量化引入了可忽略的性能下降,例如,在LAMBADA上为-0.74%,甚至在MMLU上为+0.05%,使其仍然优于未压缩的GPT-3。这使得GLM-130B能够在4×RTX 3090(24G)或8×RTX 2080 Ti(11G)服务器上快速推理,并具有性能保证,这是迄今为止使用1000亿规模LLM所需的最经济的GPU。

我们开源了模型检查点、代码、训练日志、相关工具和经验教训。

2、THE DESIGN CHOICES OF GLM-130B设计选择

The architecture of a machine learning model defines its inductive bias. However, it has been real- ized that it is computationally unaffordable to explore various architectural designs for LLMs. We introduce and explain the unique design choices of GLM-130B.

机器学习模型的架构定义了其归纳偏好。然而,对于LLM来说,探索各种架构设计在计算上是无法承受的。我们介绍和解释了GLM-130B的独特设计选择。

2.1、GLM-130B’S ARCHITECTURE的架构

GLM as Backbone. Most recent 100B-scale LLMs, such as GPT-3, PaLM, OPT, and BLOOM, follow the traditional GPT-style (Radford et al., 2019) architecture of decoder-only autoregressive language modeling. In GLM-130B, we instead make an attempt to explore the potential of a bidi- rectional GLM—General Language Model (Du et al., 2022)—as its backbone.

GLM is a transformer-based language model that leverages autoregressive blank infilling as its train- ing objective. Briefly, for a text sequence x = [x1, · · · , xn], text spans {s1, · · · , sm} are sampled from it, each of which si denotes a span of consecutive tokens [si,1, · · · , si,li ] and is replaced (i.e., corrupted) with a single mask token to form xcorrupt. The model is asked to recover them autoregres- sively. To allow interactions between corrupted spans, their visibility to each other is decided by a randomly sampled permutation on their order.

以GLM为主体。最近的1000亿规模LLM,如GPT-3、PaLM、OPT和BLOOM,都遵循传统的GPT风格(Radford等人,2019年)的解码器自回归语言建模架构。在GLM-130B中,我们尝试探索双向GLM(General Language Model,Du等人,2022年)作为其主体的潜力。

GLM是一种基于Transformer的语言模型,利用自回归填充空白作为其训练目标。简单来说,对于一个文本序列x = [x1, · · · , xn],从中采样出文本片段{s1, · · · , sm},其中每个si表示连续标记的片段[si,1, · · · , si,li ],并用一个单个的掩码标记替换(即损坏)它们,形成xcorrupt。模型被要求自回归地恢复它们。为了允许损坏的片段之间的交互,它们相互之间的可见性由一个随机抽样的排列决定。

GLM’s bidirectional attention over unmasked (i.e., uncorrupted) contexts distinguishes GLM-130B from GPT-style LLMs in which the unidirectional attention is used. To support both understanding and generation, it mixes two corruption objectives, each indicated by a special mask token:

[MASK]: short blanks in sentences whose lengths add up to a certain portion of the input.

[gMASK]: random-length long blanks at the end of sentences with prefix contexts provided.

Conceptually, the blank infilling objective with bidi- rectional attention enables a more effective compre- hension of contexts than GPT-style models: when us- ing [MASK], GLM-130B behaves as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020); when us- ing [gMASK], GLM-130B behaves similarly to Pre- fixLM (Liu et al., 2018; Dong et al., 2019).

Empirically, GLM-130B offers a record-high accuracy of 80.2% on zero-shot LAMBADA by outperforming both GPT-3 and PaLM 540B in Figure 2. By setting the attention mask, GLM-130B’s unidirectional vari- ant is comparable to GPT-3 and OPT-175B. Our ob- servations are in line with existing findings (Liu et al., 2018; Dong et al., 2019).

GLM-130B通过对未掩码(即未损坏)上下文的双向注意力与GPT风格的LLM(使用单向注意力)相区分。为了支持理解和生成,它混合了两种损坏目标,每种目标都由一个特殊的掩码标记表示:

[MASK]:句子中长度相加达到输入的一定部分的短空白。

[gMASK]:提供前缀上下文的句子末尾的随机长度的长空白。

在概念上,具有双向注意力的填充空白目标使得GLM-130B能够比GPT风格模型更有效地理解上下文:当使用[MASK]时,GLM-130B的行为类似于BERT(Devlin等人,2019年)和T5(Raffel等人,2020年);当使用[gMASK]时,GLM-130B的行为类似于Pre- fixLM(Liu等人,2018年;Dong等人,2019年)。

根据实证结果,GLM-130B在零样本LAMBADA上的准确率达到了80.2%,优于GPT-3和PaLM 540B(见图2)。通过设置注意力掩码,GLM-130B的单向变种与GPT-3和OPT-175B相当。我们的观察结果与现有研究结果(Liu等人,2018年;Dong等人,2019年)一致。

Layer Normalization (LN, Ba et al. (2016)). Training instability is one major challenge for training LLMs (Zhang et al., 2022; Scao et al., 2022; Chowdhery et al., 2022) (Cf. Figure 10 in Appendix for collapses in training several 100B-scale models). A proper choice of LNs can help stabilize the training of LLMs. We experiment with existing practices, e.g., Pre-LN (Xiong et al., 2020), Post-LN (Ba et al., 2016), Sandwich-LN (Ding et al., 2021), which are unfortunately incapable of stabilizing our GLM-130B test runs (Cf. Figure 3 (a) and Appendix B.2 for details).

Our search is later focused on Post-LN due to its favorable downstream results in preliminary ex- periments though it does not stabilize GLM-130B. Fortunately, one of the attempts on Post-LN initialized with the newly-proposed DeepNorm (Wang et al., 2022b) generates promising training stability. Specifically, given the number of GLM-130B’s layers N , we adopt DeepNorm(x) = LayerNorm(α · x + Network(x)), where α = (2N ) 2 , and apply the Xavier normal initialization with the scaling factor of (2N )− 2 to ffn, v_proj and out_proj. Additionally, all bias terms are initialized to zero. Figure 3 shows it significantly benefits the training stability of GLM-130B.

层归一化(LN,Ba等人,2016年)。训练LLM时,训练不稳定性是一个主要挑战(Zhang等人,2022年;Scao等人,2022年;Chowdhery等人,2022年)(附录的图10显示了几个1000亿规模模型的训练崩溃)。选择合适的LN方法可以帮助稳定LLM的训练。我们尝试了现有的做法,例如Pre-LN(Xiong等人,2020年)、Post-LN(Ba等人,2016年)和Sandwich-LN(Ding等人,2021年),但很遗憾它们无法稳定我们的GLM-130B测试运行(详情请参见图3(a)和附录B.2)。

由于Post-LN在初步实验中取得了有利的下游结果,尽管它无法稳定GLM-130B,我们的搜索后来集中在Post-LN上。幸运的是,其中一次使用新提出的DeepNorm(Wang等人,2022b)初始化的Post-LN尝试产生了有希望的训练稳定性。具体来说,给定GLM-130B的层数N,我们采用DeepNorm(x) = LayerNorm(α · x + Network(x)),其中α = (2N ) 2,并对ffn、v_proj和out_proj应用Xavier正态初始化,缩放因子为(2N )− 2。此外,所有偏置项初始化为零。图3显示了它对GLM-130B的训练稳定性的显著改善。

Positional Encoding and FFNs. We empirically test different options for positional encoding (PE) and FFN improvements in terms of both training stability and downstream performance (Cf. Ap- pendix B.3 for details). For PEs in GLM-130B, we adopt Rotary Positional Encoding (RoPE, Su et al. (2021)) rather than ALiBi (Press et al., 2021). To improve FFNs in Transformer, we pick GLU with the GeLU (Hendrycks & Gimpel, 2016) activation as the replacement.

位置编码和FFN。我们通过实证测试了不同的位置编码(PE)和FFN改进选项,以评估其对训练稳定性和下游性能的影响(详情请参见附录B.3)。对于GLM-130B的位置编码,我们采用旋转位置编码(RoPE,Su等人,2021年),而不是ALiBi(Press等人,2021年)。为了改进Transformer中的FFN,我们选择了具有GeLU(Hendrycks和Gimpel,2016年)激活函数的GLU作为替代方法。

2.2、GLM-130B’S PRE-TRAINING SETUP的预训练设置

Inspired by recent works (Aribandi et al., 2022; Wei et al., 2022a; Sanh et al., 2022), the GLM-130B pre-training objective includes not only the self-supervised GLM autoregressive blank infilling) but also multi-task learning for a small portion of tokens. This is expected to help boost its downstream zero-shot performance.

Self-Supervised Blank Infilling (95% tokens). Recall that GLM-130B uses both [MASK] and [gMASK] for this task. Each training sequence is applied with one of them independently at a time. Specifically, [MASK] is used to mask consecutive spans in 30% of training sequences for blank infilling. The lengths of spans follow a Poisson distribution (λ = 3) and add up to 15% of the input. For the other 70% sequences, the prefix of each sequence is kept as context and [gMASK] is used to mask the rest of it. The masked length is sampled from the Uniform distribution.

受最近的研究启发(Aribandi等人,2022年;Wei等人,2022a;Sanh等人,2022年),GLM-130B的预训练目标不仅包括自监督的GLM自回归填充,还包括对一小部分标记的多任务学习。这有助于提升其下游的零样本性能。

自监督的填充空白(占95%的标记)。回想一下,GLM-130B在这个任务中使用了[MASK]和[gMASK]。每个训练序列独立地应用其中之一。具体而言,对于30%的训练序列,使用[MASK]来掩盖连续的片段进行填充。片段的长度遵循泊松分布(λ = 3),并且占输入的15%。对于其余70%的序列,保留每个序列的前缀作为上下文,并使用[gMASK]掩盖其余部分。掩盖的长度从均匀分布中进行采样。

The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese Wudao- Corpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents.

Multi-Task Instruction Pre-Training (MIP, 5% tokens). T5 (Raffel et al., 2020) and ExT5 (Aribandi et al., 2022) suggest that multi-task learning in pre-training can be more helpful than fine-tuning, we thus propose to include a variety of instruction prompted datasets including language understanding, generation, and information extraction in GLM-130B’s pre-training.

Compared to recent works (Wei et al., 2022a; Sanh et al., 2022) that leverage multi-task prompted fine-tuning to improve zero-shot task transfer, MIP only accounts for 5% tokens and is set in the pre- training stage to prevent spoiling LLMs’ other general ability, e.g., unconditional free generation. Specifically, we include 74 prompted datasets from (Sanh et al., 2022; Wang et al., 2022a), listed in Appendix C and Table 12. GLM-130B users are suggested to avoid evaluating its zero-shot and few-shot capabilities on these datasets according to the criterion illustrated in Section 5.

预训练数据包括1.2T的Pile(训练集)(Gao等人,2020年)英文数据、1.0T的中文Wudao-Corpora(Yuan等人,2021年)以及我们从网络爬取的250G中文数据集(包括在线论坛、百科全书和问答),这形成了英文和中文内容的平衡组合。

多任务指令预训练(MIP,占5%的标记)。T5(Raffel等人,2020年)和ExT5(Aribandi等人,2022年)指出,与微调相比,预训练中的多任务学习可能更有帮助,因此我们建议在GLM-130B的预训练中包括各种指令提示的数据集,包括语言理解、生成和信息提取。

与利用多任务提示微调来改进零样本任务迁移的最近研究(Wei等人,2022a;Sanh等人,2022年)相比,MIP只占据了5%的标记,并且在预训练阶段设置,以防止破坏LLM的其他通用能力,例如无条件的自由生成。具体来说,我们包含了来自(Sanh等人,2022年;Wang等人,2022a年)的74个指令提示数据集,详见附录C和表12。建议GLM-130B的用户根据第5节中说明的标准,避免在这些数据集上评估其零样本和少样本能力。

2.3、PLATFORM-AWARE PARALLEL STRATEGIES AND MODEL CONFIGURATIONS平台感知的并行策略和模型配置

GLM-130B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers with a 60-day access. The goal is to pass through as many tokens as possible, as a recent study (Hoffmann et al., 2022) suggests that most existing LLMs are largely under-trained.

The 3D Parallel Strategy. The data parallelism (Valiant, 1990) and tensor model paral- lelism (Shoeybi et al., 2019) are the de facto practices for training billion-scale models (Wang & Komatsuzaki, 2021; Du et al., 2022). To further handle the huge GPU memory requirement and the decrease in overall GPU utilization resulted from applying tensor parallel between nodes—as 40G rather than 80G A100s are used for training GLM-130B, we combine the pipeline model parallelism with the other two strategies to form a 3D parallel strategy.

GLM-130B在一组96台DGX-A100 GPU(8×40G)服务器上进行训练,访问期限为60天。目标是尽可能通过尽可能多的标记,因为最近的一项研究(Hoffmann等人,2022年)表明,大多数现有的LLM都没有得到充分训练。 三维并行策略。数据并行(Valiant,1990年)和张量模型并行(Shoeybi等人,2019年)是训练十亿规模模型(Wang&Komatsuzaki,2021年;Du等人,2022年)的事实上的做法。为了进一步处理巨大的GPU内存需求和应用张量并行导致的整体GPU利用率降低(由于GLM-130B的训练使用的是40G而不是80G的A100),我们将管道模型并行与其他两种策略相结合,形成了三维并行策略。

The pipeline parallelism divides the model into sequential stages for each parallel group, and to fur- ther minimize bubbles introduced by pipeline, we leverage the PipeDream-Flush (Narayanan et al.,2021) implementation from DeepSpeed (Rasley et al., 2020) to train GLM-130B with a relative big global batch size (4,224) to reduce time and GPU memory wasting. Through both numeri- cal and empirical examinations, we adopt 4-way tensor parallelism and 8-way pipeline parallelism (Cf. Appendix B.4 for details). Following the calculation in (Chowdhery et al., 2022), we report hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to re-materialization.

GLM-130B Configurations. We aim to enable our 100B-scale LLM to run a single DGX-A100 (40G) node in FP16 precision. Based on the hidden state dimension of 12,288 we adopt from GPT-3, the resultant model size has to be no more than 130B parameters, thus GLM-130B. To maximize GPU utilization, we configure the model based on the platform and its corresponding parallel strategy. To avoid insufficient memory utilization in the middle stages due to the additional word embedding at both ends, we balance the pipeline partition by removing one layer from them, making 9×8-2=70 transformer layers in GLM-130B.

管道并行将模型划分为每个并行组的顺序阶段,并且为了最小化管道引入的空隙,我们利用来自DeepSpeed(Rasley等人,2020年)的PipeDream-Flush(Narayanan等人,2021年)实现来训练GLM-130B,使用相对较大的全局批量大小(4,224)来减少时间和GPU内存的浪费。通过数值和实证检验,我们采用4路张量并行和8路管道并行(详见附录B.4)。根据(Chowdhery等人,2022年)的计算,我们报告硬件FLOP(浮点运算)利用率为43.3%,模型FLOP利用率为32.5%,因为重新材料化导致的。 GLM-130B配置。我们的目标是使我们的100B规模LLM能够在单个DGX-A100(40G)节点上以FP16精度运行。根据从GPT-3采用的隐藏状态维度为12,288,得到的模型大小不能超过130B参数,因此命名为GLM-130B。为了最大化GPU利用率,我们根据平台及其相应的并行策略来配置模型。为了避免由于两端的额外词嵌入而导致中间阶段内存利用不足,我们通过从它们中删除一层来平衡管道划分,使GLM-130B中有9×8-2=70个变压器层。

During the 60-day access to the cluster, we manage to train GLM-130B for 400 billion tokens (roughly 200 billion each for Chinese and English) with a fixed sequence length of 2,048 per sample. For the [gMASK] training objective, we use a context window of 2,048 tokens. For the [MASK] and multi-task objectives, we use a context window of 512 and concatenate four samples together to cater the 2,048-sequence-length. We warm-up the batch size from 192 to 4224 over the first 2.5% samples. We use AdamW (Loshchilov & Hutter, 2019) as our optimizer with β1 and β2 set to 0.9 and 0.95, and a weight decay value of 0.1. We warm up the learning rate from 10−7 to 8 × 10−5 over the first 0.5% samples, then decay it by a 10× cosine schedule. We use a dropout rate of 0.1 and clip gradients using a clipping value of 1.0 (Cf. Table 11 for the full configurations).

 在对该集群的60天访问期间,我们设法用固定的样本长度为2,048训练GLM-130B达到4000亿个标记(大约为中文和英文各2000亿个)。对于[gMASK]训练目标,我们使用2,048个标记的上下文窗口。对于[MASK]和多任务目标,我们使用512个标记的上下文窗口,并将四个样本连接在一起以适应2,048个序列长度。我们将批量大小从192逐渐增加到4224,前2.5%的样本用于热身。我们使用AdamW(Loshchilov&Hutter,2019年)作为优化器,将β1和β2设置为0.9和0.95,权重衰减值为0.1。我们将学习率从10的-7次方逐渐增加到8乘以10的-5次方,然后按照10×余弦调度进行衰减。我们使用0.1的dropout率,并使用剪辑值为1.0的梯度裁剪(详见表11的完整配置)。

3、THE TRAINING STABILITY OF GLM-130B的训练稳定性

The training stability is the decisive factor in GLM-130B’s quality, which is also largely impacted by the number of tokens it passes through (Hoffmann et al., 2022). Thus, given the computing usage constraint, there has to be a trade-off between efficiency and stability with regard to floating- point (FP) formats: low-precision FP formats (e.g., 16-bit precision—FP16) improve computing efficiency but are prone to overflow and underflow errors, resulting in training collapses.

Mixed-Precision. We follow the common practice of a mixed- precision (Micikevicius et al., 2018) strategy (Apex O2), i.e., FP16 for forwards and backwards and FP32 for optimizer states and mas- ter weights, to reduce the GPU memory usage and improve train- ing efficiency. Similar to OPT-175B and BLOOM-176B (C.f. Fig- ure 10 in Appendix), the training of GLM-130B faces frequent loss spikes resulted from this choice, which tends to become increas- ingly frequent as the training goes on. The precision related spikes are often without clear reasons: some recover on their own; others come with a portent of suddenly soaring gradient norm and even- tually a spike or even NaN in loss. OPT-175B attempted to fix by manually skipping data and adjusting hyper-parameters; BLOOM- 176B did so via the embedding norm technique (Dettmers et al., 2021). We spent months to empirically investigate the spikes and realize that a few issues emerge when transformers scale up:

训练稳定性是决定GLM-130B质量的关键因素,也受到通过的标记数量的影响(Hoffmann等人,2022年)。因此,在计算使用约束下,必须在浮点(FP)格式方面在效率和稳定性之间进行权衡:低精度FP格式(例如,16位精度-FP16)提高了计算效率,但容易发生溢出和下溢错误,导致训练失败。

混合精度。我们遵循混合精度(Micikevicius等人,2018年)策略(Apex O2)的常见做法,即前向和后向使用FP16,优化器状态和主权重使用FP32,以减少GPU内存使用并提高训练效率。与OPT-175B和BLOOM-176B(附录中的图10)类似,GLM-130B的训练由于这种选择导致频繁的损失峰值,随着训练的进行,这种现象变得越来越频繁。与精度相关的峰值通常没有明显的原因:一些峰值会自行恢复;其他峰值伴随着梯度范数突然飙升,甚至最终导致损失的峰值甚至NaN。OPT-175B尝试通过手动跳过数据和调整超参数来修复此问题;BLOOM-176B通过嵌入规范化技术(Dettmers等人,2021年)来解决。我们花费数月时间对这些峰值进行了经验性的研究,并意识到当Transformer模型扩展时会出现一些问题:

First, the transformer main branch’s value scale can be extremely large in deeper layers if using Pre-LN. This is addressed in GLM- 130B by using DeepNorm based Post-LN (Cf. Section 2.1), which makes the value scale always bounded.

Second, the attention scores grow so large that they exceed FP16's range, as the model scales up. There are a few options to overcome this issue in LLMs. In CogView (Ding et al., 2021), PB-Relax is proposed to remove bias terms and deduct extremum value in attention computation to avoid the problem, which unfortunately does not help avoid dis- convergence in GLM-130B. In BLOOM-176B, the BF16 format is used instead of FP16, due to its wide range of values on NVIDIA Ampere GPUs (i.e., A100). However, BF16 consumes ∼15% more run-time GPU memory than FP16 in our experiments due to its conversion to FP32 in gradi- ent accumulation, and more importantly it is not supported on other GPU platforms (e.g., NVIDIA Tesla V100), limiting the accessibility of produced LLMs. Another option from BLOOM-176B is to apply embedding norm with BF16, but in sacrifice of a significant penalty on model performance, as they notice that embedding norm can harm model’s zero-shot learning (Cf. Section 4.3 in (Scao et al., 2022)).

Embedding Layer Gradient Shrink (EGS). Our empirical search identifies that the gradient norm can serve as an informative indicator of training collapses. Specifically, we find that a training collapse usually lags behind a “spike” in gradient norm by a few training steps. Such spikes are usually caused by the embedding layer’s abnormal gradients, as we observe that its gradient norm is often several magnitude larger that those of other layers in GLM-130B’s early stage training (Cf. Figure 4 (a)). In addition, it tends to fluctuate dramatically in the early training. The problem is handled in vision models (Chen et al., 2021) via freezing the patch projection layer. Unfortunately, we cannot freeze the training of the embedding layer in language models.

首先,如果使用Pre-LN,变压器主分支的值规模可能非常大。GLM-130B通过使用基于DeepNorm的Post-LN(详见第2.1节)来解决这个问题,使值的规模始终受到限制。

其次,随着模型规模的增大,注意力分数变得非常大,超出了FP16的范围。在LLMs中,有几种方法可以解决这个问题。在CogView(Ding等人,2021年)中,提出了PB-Relax方法,在注意力计算中去除偏置项并减去极值,以避免此问题,但不幸的是,这并不能避免GLM-130B的不收敛。在BLOOM-176B中,使用了BF16格式代替FP16,因为它在NVIDIA Ampere GPU(即A100)上具有较宽的值范围。然而,由于BF16在梯度累积中需要转换为FP32,在我们的实验中,它消耗了比FP16多约15%的GPU内存,而且更重要的是,它不支持其他GPU平台(例如NVIDIA Tesla V100),限制了生成的LLMs的可访问性。BLOOM-176B的另一种选择是使用BF16应用嵌入规范化,但这会对模型的性能造成重大损害,因为他们注意到嵌入规范化可能损害模型的零-shot学习(详见(Scao等人,2022年)的第4.3节)。

嵌入层梯度缩放(EGS)。我们的经验调查发现,梯度范数可以作为训练失败的信息指示器。具体而言,我们发现训练失败通常在梯度范数的“峰值”之后几个训练步骤内发生。正如我们观察到的,在GLM-130B的早期训练阶段,嵌入层的梯度范数通常比其他层的梯度范数大几个数量级(详见图4(a))。此外,它在早期训练阶段往往会剧烈波动。在视觉模型中,通过冻结补丁投影层来解决这个问题(Chen等人,2021年)。不幸的是,在语言模型中,我们无法冻结嵌入层的训练。

Finally, we find the gradient shrink on embedding layers could overcome loss spikes and thus sta- bilize GLM-130B’s training. It is first used in the multi-modal transformer CogView (Ding et al., 2021). Let α be the shrinking factor, the strategy can be easily implemented via word_embedding = word_embedding ∗ α + word_embedding.detach() ∗ (1 − α). Figure 4 (b) suggests that empirically, setting α = 0.1 wipes out most spikes we would have met, with negligible latency.

In fact, the final GLM-130B training run only experiences three late-stage loss divergence cases, though it fails numerous times due to hardware failures. For the three unexpected spikes, it turns out further shrinking the embedding gradient can still help stabilize the GLM-130B training. See the training notes and Tensorboard logs in our code repository for details.

最后,我们发现缩小嵌入层的梯度可以克服损失峰值,并稳定GLM-130B的训练。这种策略最初在多模态变换器CogView(Ding等人,2021年)中使用。设α为缩小因子,该策略可以通过以下方式轻松实现:word_embedding = word_embedding * α + word_embedding.detach() * (1 - α)。图4(b)经验证明,将α设置为0.1可以消除大部分损失峰值,并且延迟可以忽略不计。

事实上,最终的GLM-130B训练过程只出现了三次后期损失发散情况,尽管由于硬件故障导致失败了很多次。对于这三个意外的峰值,进一步缩小嵌入层梯度仍然有助于稳定GLM-130B的训练。有关详细信息,请参阅我们代码库中的训练笔记和Tensorboard日志。

7、CONCLUSION AND LESSONS结论和教训

We introduce GLM-130B, a bilingual pre-trained language model that aims to facilitate open and inclusive LLM research. GLM-130B’s technical and engineering undertakings generate insight into LLMs’ architectures, pre-training objectives, training stability and efficiency, and affordable infer- ence. Altogether, it contributes to the high quality of GLM-130B in terms of both language perfor- mance on 112 tasks and ethical results on bias and toxicity benchmarks. Our experiences of both success and failure are condensed into the lessons for training 100B-scale LLMs, attached in the Appendix B.10.

我们介绍了GLM-130B,一个旨在促进开放和包容的LLM研究的双语预训练语言模型。GLM-130B的技术和工程工作为LLM的架构、预训练目标、训练稳定性和效率以及可承受的推理方面提供了见解。总的来说,它在112个任务的语言性能和偏见和毒性基准测试的伦理结果方面为GLM-130B的高质量做出了贡献。我们的成功和失败经验总结在附录B.10中,提供了培训100B规模LLM的教训。

ACKNOWLEDGEMENT致谢

This research was supported by Natural Science Foundation of China (NSFC) 61825602, 62276148 and Zhipu.AI. We thank all our collaborators and partners from the Knowledge Engineering Group (KEG), Parallel Architecture & Compiler technology of Mobile, Accelerated, and Networked sys- tems Group (PACMAN), Natural Language Processing Group (THUNLP) at Tsinghua University, and Zhipu.AI.

本研究得到了中国自然科学基金(NSFC)61825602、62276148和Zhipu.AI的支持。我们感谢清华大学知识工程组(KEG)、移动、加速和网络系统组(PACMAN)的并行架构与编译器技术以及自然语言处理组(THUNLP)和Zhipu.AI的所有合作伙伴。

猜你喜欢

转载自blog.csdn.net/qq_41185868/article/details/131137544