Overview of large model technology development - (1)

Text content reference paper "A Survey of Large Language Models"

insert image description here

insert image description here

Paper title: A Survey of Large Language Models
Paper link: https://arxiv.org/pdf/2303.18223v10.pdf

Because the content of this paper is too much, my article is divided into several parts to show! The directory is as follows:

Overview of the development of large-scale model technology- (1)
Overview of the development of large-scale model technology- (2)
Overview of the development of large-scale model technology- (3)
Overview of the development of large-scale model technology- (4)

1 Introduction

Language is an essential human ability for expression and communication, developed in early childhood and evolving throughout life. However, for machines to understand and use language to communicate like humans, they need the support of powerful artificial intelligence algorithms. This goal has been a longstanding research challenge.

In promoting machine language intelligence, language modeling (LM) is one of the important technical methods. LMs aim to build generative probabilistic models of word sequences to predict the probability of future (or missing) tokens. The research of LM has received extensive attention and has gone through four main development stages.

The first stage is the statistical language model (SLM), which is based on statistical learning methods to predict the next word by building a word prediction model. SLM has achieved certain results in the fields of information retrieval and natural language processing, but due to data sparsity, accurate estimation of high-order language models becomes difficult.

The second stage is Neural Language Modeling (NLM), which uses neural networks to describe the probabilities of word sequences. NLM introduces the concept of distributed representation of words and improves the performance of NLP tasks by learning effective features of words or sentences. The emergence of NLM has had an important impact on the representation learning of language models.

The third stage is the pre-training language model (PLM), which includes BERT and GPT series. These models learn general context-aware word representations by pre-training on large-scale unlabeled corpora. PLM achieves significant performance gains on multiple NLP tasks through a pre-trained and fine-tuned learning paradigm.

The fourth stage is the large model language model (LLM), which improves performance by increasing the model size or data size. LLMs have demonstrated a surprising ability to solve complex tasks and have sparked a rethinking of the possibilities of artificial general intelligence (AGI). The rapid development of LLM is driving innovation in the field of AI research.

language modeling model example
SLM Text classification, part-of-speech tagging, syntactic analysis
NLM Autoencoders, Variational Autoencoders, Generative Adversarial Networks, Denoising Autoencoders
PLM BERT、GPT、ROBERTa、ALBERT
LLM GPT-3、GPT-4,ChatGPT、InstructGPT

The rapid progress of LLM has had a significant impact on the AI ​​community and has brought new research directions in areas such as language processing, information retrieval, and computer vision. Despite some challenges and unsolved problems, LLM is expected to be one of the key technologies for building artificial general intelligence

This article combines the translation of the paper's overview and my own understanding to write an overview of the development of large-scale model technology. The following parts of the article are as follows:

article chapter content
chapter 2 Background on LLMs, including terminology, settings, resources and organizational overview
Section 3 Summarize available resources needed to develop LLMs
Section 4 Introduction to large model pre-training technology
Section 5 Introduction to Large Model Adaptation
Section 6 Introduction to Large Model Knowledge Utilization
Section 7 Introduction to Performance Evaluation of Large Models
Section 8 Summarize and discuss the future of large models

2. Overview of the large model

2.1 Large model background

Large language models (LLMs) generally refer to Transformer language models with hundreds of billions of parameters, which are trained on large-scale text data. LLMs have demonstrated strong capabilities in natural language understanding and the ability to solve complex tasks through text generation. Before understanding how LLMs work,

We first introduce the basic background of LLMs, including laws of scale, emerging capabilities and key technologies

2.1.1 The scale law of LLMs

The scale law of LLMs means that with the increase of model scale, data scale and calculation amount, the performance improvement of LLMs shows a certain regularity . Currently, most LLMs are built on the Transformer architecture and employ similar pre-training objectives such as language modeling. Research shows that extending LLMs can greatly improve their model performance. In the study, two representative scale laws were proposed.

The first is the KM scale law, proposed by Kaplan et al., which describes the relationship between neural language model performance and model size with a power-law model ( N ) (N)( N ) , data set size( D ) (D)( D ) and training computation( C ) (C)( C ) the relationship between. The KM scale law describes the relationship between performance and these three factors through three basic formulas, namely, model size, dataset size, and training computation.

L ( N ) = ( N c N ) α N , α N ∼ 0.076 , N c ∼ 8.8 × 1 0 13 L(N) = (\frac{N_c}{N})^{\alpha_N},\alpha_N \sim 0.076,N_c \sim 8.8 \times 10^{13};L ( N )=(NNc)aN,aN0.076,Nc8.8×1013
L ( D ) = ( D c D ) α D , α D ∼ 0.095 , D c ∼ 5.4 × 1 0 13 L(D) = (\frac{D_c}{D})^{\alpha_D},\alpha_D \sim 0.095,D_c \sim 5.4 \times 10^{13}L ( D )=(DDc)aD,aD0.095,Dc5.4×1013
L ( C ) = ( C c C ) α C , α C ∼ 0.050 , C c ∼ 3.1 × 1 0 8 L(C) = (\frac{C_c}{C})^{\alpha_C},\alpha_C \sim 0.050,C_c \sim 3.1 \times 10^{8} L ( C )=(CCc)aC,aC0.050,Cc3.1×108

where L( ) denotes the cross-entropy loss. These three laws are obtained by fitting the performance of the model under different data sizes (22M to 23B tokens), model size (768M to 1.5 billion non-embedded parameters) and training calculations. They show that there is a strong dependence between model performance and these three factors .

The second is Chinchilla's law of scale, proposed by Hoffmann et al. They conducted rigorous experiments through a wider range of model sizes (70M to 16B) and data sizes (5B to 500B tokens), and fitted another form of scale law to guide the computational optimization training of LLMs. The Chinchilla scale law provides a way to optimally allocate between model size and data size by optimizing the loss function.

L ( N , D ) = E + A N α + B D β L(N,D)=E+\frac{A}{N^\alpha}+\frac{B}{D^\beta} L ( N ,D)=E+NaA+DbB

Of which E = 1.69 E = 1.69E=1.69, A = 406.4 A = 406.4 A=406.4, B = 410.7 B = 410.7 B=410.7 ,α = 0.34 \alpha = 0.34a=0.34 andbeta = 0.28 \beta = 0.28b=0.28 . By constrainingC ≈ 6 NDC ≈ 6NDC6 Optimize the loss function L under N D ( N , D ) L(N,D)L ( N ,D ) , showing the optimal distribution of computational budget between model size and data size as follows
N opt ( C ) = G ( C 6 ) a , D opt ( C ) = G − 1 ( C 6 ) b N_{opt}(C)=G(\frac{C}{6})^a,D_{opt}(C)=G^{-1}(\frac{C}{6})^bNopt(C)=G(6C)a,Dopt(C)=G1(6C)b
wherea = α α + β a=\frac{\alpha}{\alpha+\beta}a=a + ba, b = β α + β b=\frac{\beta}{\alpha+\beta}b=a + bb, G is a scaling factor that can be calculated from A, B, α and β.

The study of these scale laws reveals the close relationship between the performance of LLMs and the size of the model, the size of the data, and the amount of computation, providing guidance for further research and optimization of LLMs.

2.1.2 Emerging capabilities of LLM

The emergent capabilities of LLMs are formally defined as “ capabilities absent in small models but present in larger models ,” which is one of the most striking features that distinguish LLMs from previous PLMs. When the emergent capability emerges, it further acquires a remarkable feature: when the scale reaches a certain level, the performance significantly exceeds the random level . By analogy, this emerging pattern is closely related to the phenomenon of phase transitions in physics. In principle, emerging capabilities can be associated with some complex tasks, and more attention is paid to general capabilities that can be applied to solve various tasks. Here, we briefly introduce three typical emerging capabilities of LLMs and representative models with such capabilities .

  • Context Learning : Context Learning (ICL:Incremental Context Learning) ability is officially introduced by GPT-3: Assuming that the language model has been provided with natural language instructions or several task demonstrations, it can predict instances and output through input text word sequences without additional training or gradient updates. Among the GPT series models, the GPT-3 model of 175B exhibits strong context learning ability in general, while the GPT-1 and GPT-2 models do not. However, this capability also depends on the specific downstream task. For example, for GPT-3 at 13B, contextual learning ability can emerge on arithmetic tasks such as addition and subtraction of 3-digit numbers, but GPT-3 at 175B does not perform well on Persian question answering tasks.
  • Model fine-tuning .Fine-tuning in LLMs by using new datasets. Through fine-tuning, LLMs can perform new tasks according to the task instructions related to the new dataset, thus improving the generalization ability.
  • step-by-step reasoning . For small language models, it is often difficult to solve complex tasks involving multiple reasoning steps, such as mathematical problems. However, with the Chain of Thinking (CoT) hinting strategy , LLMs can solve such tasks by exploiting the hinting mechanism, which involves intermediate reasoning steps to arrive at the final answer.For complex tasks LLMs can infer the output step by step, just like we send " continue" on gpt

2.1.3 Key technologies of LLM

  1. Expansion : The larger the size of LLMs, the larger the data size and the more training calculations, the higher the model capacity. Use the laws of scale and data scaling to allocate computing resources more efficiently.
  2. Training : Due to the huge size of LLMs, it is very challenging to successfully train capable LLMs. Distributed training algorithms and optimization frameworks are needed to support the learning of large-scale models.
  3. Ability Evoking : Evoking the latent abilities of LLMs by designing appropriate task instructions or contextual learning strategies, enabling them to exhibit superior performance on specific tasks.
  4. Alignment fine-tuning : In order to align with human values, alignment fine-tuning methods are needed to ensure that the content generated by LLMs is beneficial, honest and harmless.
  5. Tool Manipulation : Utilize external tools to complement LLMs for certain tasks, such as using a calculator for accurate calculations or using a search engine for up-to-date information.

These key techniques play an important role in developing LLMs with general and powerful learning capabilities.

2.2 Technical evolution of GPT series models:

The GPT series of models has undergone several stages of technological evolution. The early exploration phase includes the development of GPT-1 and GPT-2 models, which employ a generatively pretrained Transformer architecture. In the leap stage of model scale, the GPT-3 model introduced the concept of contextual learning (ICL), and the parameter scale was expanded to 175B, showing excellent performance and reasoning ability. In the ability improvement phase, the ability of the GPT-3 model was further improved by training code data and aligning with human preferences. Finally, based on the enhanced technology of the GPT-3 model, the GPT-3.5 model and the GPT-4 model were launched, among which ChatGPT is optimized for dialogue, and GPT-4 has stronger capabilities in solving complex tasks and improving security.

The figure below shows the timeline of large language models (more than 10B in size) that have existed in recent years. This timeline is largely determined by the model's technical paper publication date (such as the date submitted to arXiv). If there is no corresponding paper, set the date of the model to the earliest time it was published or announced publicly. LLMs with publicly available model checkpoints are marked in yellow. Due to space constraints in the graph, the graph only includes LLMs with publicly reported evaluation results.

insert image description here

3. LLM model supporting resources

Developing or reproducing LLMS is by no means an easy task, given the technical difficulties and enormous computational resource demands. A possible approach is to learn from existing LLMS and reuse publicly available resources for incremental development or experimental research. In this section, we briefly summarize publicly available resources for developing LLMS, including model checkpoints and API, corpus, and library resources.

Model checkpoint: refers to saving the intermediate state of the model during the training process so that it can be loaded and used when needed. In deep learning, the training of the model usually requires multiple iterative cycles, and the parameters of the model are updated in each cycle. In order to avoid loss of training results caused by unexpected interruptions in the training process, and to facilitate subsequent model evaluation and application deployment, you can periodically save model checkpoints.
Model checkpoints usually contain: the weight parameters of the model, the state of the optimizer, and other training-related information. By saving a model checkpoint, the model can be reloaded when needed and continue training or inferring predictions from the previous training state. This saves training time, facilitates model debugging and optimization, and allows flexible use of the model in different environments.

3.1 Publicly available model checkpoints and APIs

Considering the huge cost of model pre-training, a well-trained model checkpoint is crucial for the research community in the research and development of LLMS.Since parameter scale is a key factor to consider when using LLMS, we classify these public models into two scale levels (hundreds of billions of parameters and tens of billions of parameters), which helps users determine the appropriate resources according to their resource budget. Also, for inference, we can directly use the exposed API to perform the task without running the model locally. Next, we introduce publicly available model checkpoints and APIs.

Models with hundreds of billions of parameters

With the exception of LLaM (the largest version contains 65B parameters) and NLLB (the largest version contains 54.5B parameters), most models in this category range in parameter size from 10B to 20B. Other models in this range include mT5, PanGu-α, T0, GPT-NeoX-20B, CodeGen, UL2, Flan-T5 and mT0

Models with tens of billions of parameters

Only a few models in this category have been released publicly. For example, OPT, OPT-IML, BLOOM, and BLOOMZ have almost the same number of parameters as GPT-3 (version 175B), while GLM and Galactica have 130B and 120B parameters, respectively. Among them, the open sharing of OPT (version 175B) aims to enable researchers to conduct reproducible large-scale research. For the study of cross-lingual generalization, BLOOM (version 176B) and BLOOMZ (version 176B) can be used as base models because they perform well on multilingual language modeling tasks. Among these models, OPT-IML with instruction fine-tuning may be a good candidate model to study the effect of instruction adjustment. Models of this scale typically require thousands of GPUs or TPUs for training. For example, OPT (version 175B) used 992 A100-80GB GPUs, while GLM (version 130B) used a cluster of 96 NVIDIA DGX-A100 (8x40G) GPU nodes.

Public API for LLMs

The API provides a more convenient way for ordinary users to use LLMs without running the model locally. As a representative interface for using LLMs, the API of the GPT family of models is widely used in academia and industry. OpenAI provides seven main interfaces for the models of the GPT-3 series: ada, babbage, curie, davinci (the most powerful version in the GPT-3 series), text-ada-001, text-babbage-001 and text-curie-001. Among them, the first four interfaces can be further fine-tuned on OpenAI's host server. In particular, babbage, curie, and davinci correspond to the GPT-3 (1B), GPT-3 (6.7B) and GPT-3 (175B) models, respectively. In addition, there are two Codex-related APIs, code-cushman-001 (a powerful multilingual version of Codex (12B)) and code-davinci-002.

In addition, the GPT-3.5 series includes a base model code-davinci-002 and three enhanced versions, namely text-davinci-002, text-davinci-003 and gpt-3.5-turbo-0301. It is worth noting that gpt-3.5-turbo-0301 is the interface for calling ChatGPT. Recently, OpenAI also released the corresponding API of GPT-4, including gpt-4, gpt-4-0314, gpt-4-32k and gpt-4-32k-0314. In general, the choice of API interface depends on specific application scenarios and response requirements. Detailed usage can be found on their project website.

3.2 Commonly used corpus

Compared with earlier PLMs, LLMs require more training data covering a wide range of content due to having more parameters. To meet this need, more and more research-ready training datasets have been released. In this section, several commonly used corpora for training LLMs are briefly summarized. We categorize these corpora into six groups according to their content type: Books, CommonCrawl, Reddit Links, Wikipedia, Code, and Others.

The figure below shows the statistics of commonly used data sources
insert image description here

Books

BookCorpus, a dataset commonly used in previous small-scale models such as GPT and GPT-2, contains over 11,000 books covering various topics and genres such as fiction and biography. Another large-scale book corpus is Project Gutenberg, which includes more than 70,000 literary works, including novels, prose, poetry, drama, history, science, philosophy, and other types of works. It is currently one of the largest open source book collections and is used for training MT-NLG and LLaMA. As for Books1 and Books2 used in GPT-3, they are much larger than BookCorpus, but have not been publicly released so far.

CommonCrawl

CommonCrawl is one of the largest open-source web crawler databases, containing PB-level data volumes, and has been widely used as training data for existing LLMs. Due to the huge size of the entire dataset, existing research mainly extracts subsets of web pages from the data of a specific period. However, due to the large amount of noise and low-quality information in network data, data preprocessing is required before use. Based on CommonCrawl, four filtered datasets have been commonly used in existing research: C4, CC-Stories, CC-News, and RealNews. Colossal Clean Crawled Corpus (C4) includes five variants, namely en (806G), en.noclean (6T), realnewslike (36G), webtextlike (17G) and multilingual (38T). The en version has been used for pre-training of T5, LaMDA, Gopher and UL2. Multilingual C4, also known as mC4, was used in mT5. CC-Stories (31G) consists of a subset of CommonCrawl data, and the content is presented in the form of stories. However, the original source of CC-Stories is not available now, so a reproduced version CC-Stories-R has been included in Table 2. In addition, two news corpora extracted from CommonCrawl, namely REALNEWS (120G) and CC-News (76G), are also commonly used as pre-training data.

Reddit link

Reddit is a social media platform that allows users to submit links and text posts to be voted "like" or "down" by others. Usually, posts with high upvotes are considered useful and can be used to create high-quality datasets. WebText is a well-known corpus of highly liked links on Reddit, but it is not publicly available. As an alternative, there is an easily accessible open source alternative called OpenWebText. Another corpus pulled from Reddit is PushShift.io, a real-time updated dataset containing historical data since Reddit's creation. Pushshift provides not only monthly data dumps, but also useful utilities that support users in searching, summarizing, and preliminary investigations on entire datasets. This allows users to easily collect and process Reddit data.

Wikipedia

Wikipedia is an online encyclopedia containing a large number of high-quality articles covering a variety of topics. Most of these articles are written in an expository style (with supporting references) and cover a variety of languages ​​and fields. In general, English-only versions of Wikipedia are widely used in most LLMs (e.g., GPT-3, LaMDA, and LLaMA). Wikipedia is available in multiple languages, so it can be used in a multilingual environment.

the code

To collect code data, existing works mainly crawl open-source licensed codes from the Internet. Primary sources of data include open-source licensed public code repositories (such as GitHub) and code-related question-and-answer platforms (such as StackOverflow). Google has publicly released the BigQuery dataset, which contains a large number of open source licensed code snippets in various programming languages, as a representative code dataset. CodeGen uses BIGQUERY, a subset of the BigQuery dataset, for training the multilingual version of CodeGen (CodeGen-Multi).

other

Pile is a large-scale, diverse, and open-source text dataset containing 800GB of data from multiple sources, including books, websites, code, scientific papers, and social media platforms. It is built from 22 diverse high-quality subsets. The Pile dataset is widely used in models of different parameter scales, such as GPT-J (6B), CodeGen (16B) and Megatron-Turing NLG (530B). In addition, ROOTS consists of various smaller datasets (1.61TB of text in total), covering 59 different languages ​​(including natural language and programming language), which have been used for training of BLOOM.

In practice, pre-training LLMs often requires a mixture of different data sources (see figure below), not just a single corpus . Therefore, existing studies usually mix several off-the-shelf datasets (such as C4, OpenWebText, and Pile) and then perform further processing to obtain a pre-training corpus. Furthermore, in order to train LLMs adapted to specific applications, it is also important to extract data from relevant sources (such as Wikipedia and BigQuery) to enrich the corresponding information in the pre-training data. For a quick reference to the data sources used in existing LLMs, we show pre-trained corpora for three representative LLMs:

GPT-3 (175B) is pre-trained with 300B labels in a mixed dataset, including CommonCrawl, WebText2, Books1, Books2, and Wikipedia.
PaLM (540B) uses a pre-trained dataset containing 780B labeled sources including social media conversations, filtered web pages, books, Github, multilingual Wikipedia, and news.
LLaMA extracts training data from various sources, including CommonCrawl, C4, Github, Wikipedia, Books, ArXiv, and StackExchange. The amount of training data for LLaMA (6B) and LLaMA (13B) is 1.0T marks, and the amount of training data for LLaMA (32B) and LLaMA (65B) is 1.4T marks.

insert image description here

3.3 Library resources

In this section, we briefly introduce a range of available library resources for developing LLMs.

Transformers

An open source Python library for building models using the Transformer architecture, developed and maintained by Hugging Face. It has a simple and user-friendly API for easy use and customization of various pre-trained models. As a powerful library, it has a large and active community of users and developers, regularly updating and improving models and algorithms.

DeepSpeed

DeepSpeed ​​is a library developed by Microsoft for deep learning optimization (compatible with PyTorch), which has been used to train several LLMs such as MT-NLG and BLOOM. It provides support for various optimization techniques for distributed training, such as memory optimization (ZeRO technology, gradient checkpointing) and pipeline parallelism.

Megatron-LM

Megatron-LM is a deep learning library developed by NVIDIA for training large-scale language models. It also provides a wealth of distributed training optimization techniques, including model and data parallelism, mixed precision training, and FlashAttention. These optimization techniques can greatly improve training efficiency and speed, and realize efficient distributed training across GPUs.

JAX

JAX is a Python library developed by Google for high-performance machine learning algorithms that allows users to perform calculations on arrays with hardware acceleration such as GPU or TPU. It is capable of efficient computing on a variety of devices, and supports features such as automatic differentiation and just-in-time compilation.

Colossal-AI

Colossal-AI is a deep learning library developed by HPC-AI Tech for training large-scale AI models. It is implemented based on PyTorch and supports a rich set of parallel training strategies. In addition, it can utilize the method proposed by Patrick Star to optimize heterogeneous memory management. Recently, based on LLaMA, a ChatGPT style model called ColossalChat was developed using Colossal-AI, and two versions 7B and 13B were publicly released.

BMTrain

BMTrain is a library developed by OpenBMB to efficiently train large-scale parameter models, emphasizing code simplicity, low resource usage and high availability. BMTrain has incorporated several common LLMs (such as Flan-T5 and GLM) into its ModelCenter, and developers can use these models directly.

FastMoE

FastMoE is a training library dedicated to MoE (mixture of experts) models. It is developed based on PyTorch and designed with efficiency and user-friendliness in mind. FastMoE simplifies the process of converting Transformer models to MoE models, and supports data parallelism and model parallelism during training.

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/131654611