The evolution history of open source language large models: early innovations

1e6f2e83c086987de8c6f70e75bd083a.jpeg

Although the industry initially emphasized proprietary models, with the release of popular language models such as GPT-3, the LLM research community began to release relevant open source variants. The earliest open source language models lagged behind the best proprietary models in performance, but they laid the foundation for increasing the transparency of LLM research and promoted the subsequent development of powerful models such as LLaMA-2.

This series is divided into three parts, mainly discussing the development history of open source language large models. This article is the first in a series of articles. The author of this article will explore the initial stages of the history of open source LLM. They are crucial to our understanding of the open source LLM revolution. The subsequent development of open source LLM is entirely based on these models. In the next two parts of this series, the author will further introduce the latest open source LLM and explore how to use imitation and alignment techniques to improve model performance.

(The author of this article is Cameron R. Wolfe, AI director of Rebuy and Ph.D. in deep learning. The following content is compiled and published by OneFlow with authorization. Please contact the author for reprinting. Original text: https://cameronrwolfe.substack.com/p/the-history- of-open-source-llms-early)


作者 | Cameron R. Wolfe

OneFlow compilation

Translation|Yang Ting, Wan Zilin

4fa3a7736ac1498ffad3deadcec50906.png

(Quoted from [12, 20])

The research history of language models can be traced back to early models, such as GPT, GPT-2, and recurrent neural network technology (such as ULMFit) that existed before the Transformer language model. Although language models have been around for a long time, they have only recently begun to gain real popularity. The release of GPT-3 brought language models into people's horizons for the first time. By combining self-supervised pre-training and contextual learning, GPT-3 achieved amazing few-shot learning performance on many tasks, which attracted widespread attention and response.

f1924783217ecab0971cbe3b6f56abd3.png

(Quoted from [1])

The wide recognition of GPT-3 promoted the proposal of Language Large Model (LLM). Soon after, research on language model alignment led to more outstanding models, such as InstructGPT and ChatGPT. The amazing performance of these models has sparked considerable interest in language modeling and generative artificial intelligence.

Although these early large language models were very powerful, most of them were closed source. When language models began to be widely recognized, many powerful LLMs were only accessible through paid APIs (such as OpenAI API), and only specific people or laboratories had the ability to research and develop such models. The development model of the closed-source model is very different from common artificial intelligence research practices, which generally encourage openness and sharing to promote progress.

“Due to the limitations of closed source, it is difficult for researchers to understand how and why these large language models work, which hinders improving model robustness and dealing with harmful content such as bias.”——Quoted from [4]

1

The mechanism of language model

Open source LLM research promotes transparency and sharing, creating an environment where researchers can collaborate and innovate faster. In short, the beauty of open source LLM research is that it allows us to study these incredible models, making it possible to gain a deep understanding of how they work. There are no unknown tricks behind paid APIs or black boxes. Open source LLM allows us to look at the code, conduct experiments, and even try out our own ideas and modify them - we have full access to the underlying model!

“To conduct reproducible research and jointly push the field of AI forward, more members of the community need access to these models.”——Quoted from [4]< /span>

However, to deeply understand such models, we first need to understand the basic principles behind them. We will provide an overview of it and try to provide a (relatively) comprehensive understanding of LLM.

Language modeling goals

242607980e92aae2e7cd3121b00fef65.png

Pre-training with language modeling goals

The core of language modeling is next word prediction (also known as the standard language model goal), which is used in almost all language model training. In order to train a language model using next-word prediction, we need a large-scale corpus of raw text. Using this corpus, we train the model by: i) randomly sampling some text from the dataset; ii) training the model to predict the next word. Since the real next word can be derived from the original text, next word prediction is a kind of self-supervised learning.

What is a token? Next word prediction can be roughly understood as predicting the next word in the sequence given the preposition as the context. However, tokens are not exactly the same as words. When a language model receives text as input, the original text is first tokenized (that is, converted into a sequence of discrete words or subwords). See below.

ac9f92c223037040b81407115783ea4f.png

Convert raw text into a sequence of tokens

A token generator associated with a language model typically has a fixed-size vocabulary or a feasible vocabulary that can be created from a text sequence. Collection of tokens.

Predict the next token. After creating a sequence of tokens, the language model has an embedding layer that stores a unique learnable vector embedding for each token in the token generator's vocabulary. Using this embedding layer, we can convert each token in the input sequence into the corresponding vector embedding, forming a sequence of token vectors. See below.

f55246704d043b16a8bdba7745ecb5db.png                                            Tokenization and embedding of raw text data

After adding positional embeddings for each token, we can pass this sequence of token vectors to a Transformer that only contains the decoder (explained in more detail later). The model will transform each token vector and generate a corresponding output vector for each token. It is worth noting that the number of output vectors is the same as the number of input vectors. See below.

8a7f787089a6fff7734232cc483b8638.png

Process tokens using a decoder-only Transformer

Now, each word element has an output representation, and the next word element can be predicted. For each token in the sequence, we simply take its output token vector and use it to predict the next token in the sequence. Below is an example diagram of this process. In fact, to improve efficiency, the goal of next token prediction is to calculate all tokens on the sequence simultaneously (including all sequences in the mini-batch).

e69817d9999069bbe2345dcdedcd345c.png

Calculate the next token prediction training target

Due to the use of a causal (or masked) self-attention mechanism, each output token vector only considers the current and previous tokens when computing its representation in the sequence. If we were to use a bidirectional self-attention mechanism, each output token vector would be computed by looking at the entire sequence of token vectors, which would enable the model to solve for next token prediction by directly copying the next token in the sequence Task. Therefore, in order to predict the next token, we need to use masked self-attention mechanism. So, what is the self-attention mechanism? What is a Transformer? This will be discussed in depth below.

A quick note: The term "language model" may sometimes be used to refer to a model that is not solely focused on performing next token prediction. For example, some consider BERT (18) a "language model", but it is trained with Cloze-style objectives and is not a generative model. Therefore, language models that focus on next token prediction are often called “causal” language models. The terms (generative model and causal language model) will be used interchangeably below to refer to models that focus on next token prediction.

Transformer architecture and its variants

9af73eb4d77c541b6406d2c2c054c7fc.png

(Quoted from [17])

All language models use some variant of the Transformer model. This architecture was originally proposed by Google researchers to solve sequence-to-sequence tasks. However, the architecture has subsequently been extended to solve a variety of different problems, from assessing the semantic similarity of text to image classification. The original form of the Transformer architecture consisted of two components:

  • Encoder: Each encoding block performs bidirectional self-attention and a pointwise feed-forward transformation. Coding blocks are separated by residual connections and LayerNorm.

  • Decoder: Each decoding block performs causal self-attention, cross-attention (i.e., self-attention between encoder and decoder tokens) and a point-to-point feedforward transformation. Similarly, these decoding blocks are also separated by residual connections and LayerNorm.

When both components of the architecture are present, the encoder processes the input sequence and produces an output sequence. The decoder then generates its own output sequence based on the encoder's output sequence as input. In other words, the encoder processes the entire input sequence to form a representation, and the decoder uses this representation as context when generating output. In summary, a Transformer takes a sequence as input and produces a new sequence as output.

2a64fada0e5019d3953c5bc209f6bdab.png

(Quoted from [17])

Decoder only and Encoder only Transformer. Almost all causal language models use the decoder-only Transformer architecture as their basic architecture, which is an ordinary Transformer architecture with the encoder part removed (see the figure above). Additionally, the cross-attention part in each decoder block is also removed since no encoder is present (cannot pay attention to a non-existent encoder). Alternatively, we can build an encoder-only architecture by using only the encoder part of the architecture. Encoder-only architectures such as BERT [18] perform well in solving a variety of discriminative natural language tasks but cannot be used to generate text.

Why choose a decoder? The choice to build a language model using a decoder-only architecture (rather than an encoder-only or a full encoder-decoder Transformer) is not an arbitrary decision. Instead, this choice is driven by the use of next token prediction when training the language model. Using a masked self-attention mechanism in the decoder ensures that the model cannot look at subsequent tokens in the sequence when predicting the next token. Otherwise, the next token prediction will be meaningless because the model can just copy the next token. See below.

77a2ce91d4a49907e1ef8bf97e29b443.png

Using causal self-attention to predict the next token

To avoid cheating when predicting the next token, either the encoder-only or the encoder-decoder Transformer model must avoid including any real next token in the input sequence. To do this, we can: i) enter a prefix; ii) predict the tokens that follow the prefix. However, this approach is less efficient because we can only predict the next token one at a time. In contrast, due to the use of masked self-attention, only the decoder model can receive the complete sequence of tokens and apply the language modeling goal to each token in the sequence. Furthermore, some studies [12] show that decoder-only architecture performs best in next token prediction.

How to generate text? Based on the decoder-only architecture described above, text generation follows a simple autoregressive process. We just keep predicting the next token, add it to the input, and repeat the process. As shown below:

a529d656a32cd99767f8eb814acf0b58.png

Generate text using a language model

Train and use language models

To better understand language models, we need to quickly understand how these models are typically trained and applied in practice. While a lot of research has been done in this area, most language models employ several standard techniques proposed as shown in the figure below.

e59a3887cedf45e0d1e1050e31872408.png

LLM training component (cited from [19])

Language models can be learned in many different ways. This article will focus on pre-training, alignment and context learning. These three aspects basically cover most of the content required for language model training and practical applications.

Pre-training is the initial step in creating an LLM and is also the step that requires the largest amount of calculations. First, starting from a randomly initialized LLM, we need to use a language modeling objective to train the model on a large-scale corpus of raw text curated from multiple different sources. Previous research [1] has shown that by pre-training a huge model (many parameters) on a large-scale data set, a base model can be obtained that can accurately complete a variety of different tasks by performing next token prediction. To get the best results, we need to scale in terms of data and model size.

What else do we need? It can be seen from GPT-3 [1] and Chinchilla [15] that language models can achieve strong performance only through pre-training. However, before the introduction of models such as ChatGPT, LLM was not widely popular because just predicting the next token was not very attractive. Although correct predictions can generate reasonable text, often models produce output that is repetitive, simple, and less valuable. We need ways to make LLMs produce outputs that are more valuable and interesting to humans.

87ac48fc9defa13ae362b4567bd6da2e.png

Figure 4: Metadata results for API distribution. It is important to note that these results are combined across different model sizes due to dataset size limitations. Please refer to Appendix E.2 for analysis results including model size. Compared to GPT-3, the PPO model is better suited as a client assistant, is better at following clear constraints and correct instructions, and produces less "illusion" (fabricating information in tasks such as summarization). (cited from [19])

Alignment refers to fine-tuning a language model to make it more consistent with human user expectations. This process is mainly achieved through two techniques: supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). The desired behavior of a language model depends heavily on the context or application in which it is deployed. Alignment is a general tool that can be used to fine-tune any language model so that it behaves in a specific way. Recent research shows that during the alignment process, models do not learn new knowledge; instead, the alignment process simply teaches models how to better format and present the knowledge they gained from pre-training.

Applying the Language Large Model (LLM). After pre-training and fine-tuning (or alignment), the final step is to specialize the model specifically for our desired application. This process may require additional fine-tuning on domain-specific data, but it does not always require more training. In fact, we can achieve significant results just by learning in context, as shown in the figure below.

b22e738f05e3f7e87467634963a71c15.png

(Quoted from [1])

Simply put, context learning refers to the use of common basic models (such as pre-trained LLM) to solve various problems. Since language models have a common text-to-text structure, this process is actually easy to implement. We simply build a text problem-solving prompt and provide it as input to the LLM. See picture below.

6b0c2c58358ee873cea19c389ecbf87a.png

Different prompt variations for solving arithmetic problems

Next, the language model generates the answer to the question as output. Therefore, we can solve different problems by modifying the input prompts. The process of building good hints for solving problems is called hint engineering, which can be divided into the following two parts:

  • Practical Tips Project

  • Advanced Tips Project

2

Initial attempts at open source LLM

Given the high cost of pre-training, the research community spent some time pushing for the creation of open source LLM, which allowed proprietary models such as GPT-3 to become the standard. However, once the first few models were proposed, the research progress of open source LLM seemed to open the floodgates and was unstoppable (even too fast). Next, we'll look at some early models, and newer releases of open source LLM will be covered later in this series.

GPT-NeoX-20B

GPT-NeoX-20B[6] is one of the earliest open source LLMs, developed by the EleutherAI team, and has 20 billion parameters. GPT-NeoX-20B was created based on the original GPT-Neo model (2.7 billion parameters) [22] and was pre-trained on the Pile dataset to demonstrate impressive performance on various natural language benchmarks. Few-shot learning performance (comparable to GPT-3). Although this model was smaller compared to GPT-3 (20 billion parameters vs 175 billion parameters), it was the largest open source language model released at the time. In addition, all codes and weights used to train and evaluate the model are released based on Apache 2.0, allowing commercial use.

2ee74d292684787dbc8d20bd7499be7c.png

(Quoted from [8])

Model architecture. GPT-NeoX-20B adopts the standard decoder-only Transformer architecture, but improves it in the following two aspects:

  • RoPEembedded

  • Parallel attention and feedforward layers

An improvement over standard position embeddings, RoPE embeddings (shown above) provide a new way to inject position information into self-attention operations. This approach achieves a better balance between absolute position information and relative position information and is used in many other models (e.g., PaLM [9] and Falcon-40B [10]) to improve performance in sequences with long sequence lengths. performance on the task. In addition, by using parallel attention mechanisms and feed-forward layers (see figure below), training throughput can be increased by 15% while minimizing performance degradation.

a29473b0b6a168ee64089e82f9904bb7.png

Execute attention and feed-forward layers in parallel

Interestingly, we created a custom token generator for GPT-NeoX-20B that is comparable to GPT-2 [11], but retrained it on the Pile dataset. Pile is a large and diverse corpus of text modified to more consistently lemmatize whitespace characters. The resulting tokenizer is not only trained on a high-quality corpus but also performs well on lemmatizing codes. Especially efficient (i.e. code with a lot of whitespace characters). Therefore, some open source models (such as MPT-7B [5]) still use this token generator today.

b33843e2d3da133431c80cc5b003b99e.png

(Quoted from [6])

Performance. GPT-NeoX-20B is compared with GPT-3 and other open source models such as GPT-J. As can be seen from the evaluation results, GPT-NeoX-20B performs very well in common language modeling tasks (even compared to proprietary models), as shown in the figure above. Notably, while GPT-3 tends to be the best in terms of performance, GPT-NeoX-20B performs quite well considering its size, even outperforming proprietary models with similar parameter amounts.

f2480c016ddfc36adc436ab570c0928d.png

The performance of GPT-NeoX-20B is not state-of-the-art, but considering its size, the model performs exceptionally well, even compared to the latest recent models!

Open source pre-trained Transformer (OPT) language model

a6148bce06cf118151e95c59a5c8729c.png

In the previous article, we have discussed the relevant details of the open source pre-training Transformer (OPT) library in detail.

OPT Overview. OPT was proposed by Meta AI, aiming to open powerful LLMs to the public and provide usage opportunities. The library includes multiple LLMs of different sizes, with parameter amounts ranging from 125 million to 175 billion . The models are pre-trained on selected datasets from sources including Reddit, Pile, and BooksCorpus. The largest model, the OPT-175B, was one of the first open source LLMs. In addition, these models come with a code repository and even a log detailing the pre-training process of all models. Although OPT models are not available for commercial use, they are a highly influential resource and have an important impact on promoting the open availability of LLM for open source research.

Impact of OPT. The OPT language model is the first LLM that attempts to open up the use of the research community, aiming to make LLM get rid of the status of hiding behind APIs and become completely open source. In addition, OPT's open source training code provides a very efficient training framework using common techniques such as FSDP and tensor parallelism, making it easy to use. The resource utilization efficiency of this code is 17% higher than the research results directly released by NVIDIA, making it an important resource for training LLM.

3442feae09675689e115838233697ae4.png

(Quoted from [5])

The training notes and logs associated with OPT provide a wealth of (previously unknown) insights into the LLM training process. Through these resources, we can better understand the full cost of training LLM, as well as the many problems that may arise in the process (such as loss peaks, hardware failures, etc.). These difficulties in training LLM have become a hot topic of discussion and have been (in most cases) solved by subsequent further work on open source LLM. Refer to the picture above.

ab1144a0b615cb5a76bb9dbc507c6f35.png

(Quoted from [4])

How does OPT perform?? When proposed, OPT-175B was extensively compared with popular models at the time and was found to achieve comparable performance to GPT-3 in zero-shot and few-shot learning environments. See picture above. Overall, the performance of OPT is not outstanding. There is a general consensus that this model lags behind proprietary models in terms of quality. Despite mediocre performance, OPT has taken an important step in the field of artificial intelligence research and significantly increased people's interest in open source LLM. This influence is important because the dominance of proprietary models is increasingly being accepted as the new standard.
 

BLOOM: an open multilingual language model 

“Research labs in academia, nonprofits, and smaller companies find it difficult to create, study, or even use LLMs, which are freely accessible only to a handful of industrial labs with the necessary resources and exclusive access.” ——Quoted from [12]

BLOOM is a 176 billion-parameter LLM trained as part of a large-scale open collaboration among AI researchers (with more than 1,000 researchers participating) called the Big Science Research Workshop, which lasted for year (May 2021 to May 2022), aiming to achieve the following goals: i) create a large-scale multilingual text dataset; ii) train a multilingual LLM on this dataset, and the resulting model is slightly Larger than GPT-3 and open sourced under the Responsible AI License (RAIL), it is capable of generating text in 46 different languages ​​and 13 programming languages.

Developed for training BLOOMData set, named ROOTS corpus, which consists of 498 HuggingFace data sets, covering 46 types Natural language and 13 programming languages, containing over 1.6 terabytes of text. The distribution of this dataset across different languages ​​is shown below.

705b3b610b3fcec4a3de81de905d239e.png

(Quoted from [12])

After obtaining the raw data, the authors applied a series of different quality filters to remove non-natural language text. The exact filtering components used depend on the source of the data, and these components are further elaborated in Section 3.1.3 of [12]. However, the entire processing pipeline has a common goal: to filter out as much low-quality text as possible.

9c568483ddda0575a7890ea24c32567e.png

(Quoted from [12])

BLOOM uses a standard decoder-only Transformer architecture. But as shown in the picture above, BLOOM has made some improvements to this architecture, such as:

  • ALiBi[13]: This improvement helps improve the model's performance under longer context lengths than the training data, enhancing generalization capabilities.

  • Embedding layer normalization: An additional layer of normalization is added after the embedding layer of the model, which has been empirically found to improve training stability.

Overall, BLOOM is not very different from most LLMs. It is worth noting that in [12], the authors conducted a comprehensive analysis of different types of Transformer architectures (e.g., encoder-only models, encoder-decoder models, and decoder-only models) and found that decoder-only models (almost The model used by all causal language models) achieves the best performance after pre-training.

"The research results show that after pre-training, the causal decoder-only model performs best, further validating the decision to choose SOTA LLM." ——Quoted from [12]

What is the performance of BLOOM? Compared with other open source LLMs, BLOOM performs relatively well. In natural language benchmarks, it achieves results comparable to or better than OPT, and performs especially well on machine translation tasks, thanks to its training on multilingual corpora. Refer to the picture below.

e267edcf4773d6922cd91deb325ca3cf.png

(Quoted from [12])

Although BLOOM's performance is better, it is still lower than the top proprietary models. For example, according to the results of the HumanEval benchmark (see figure below), BLOOM's coding capabilities are far inferior to its alternatives (such as Codex [14]). Furthermore, when we compare the performance of BLOOM with models such as Chinchilla [15] and PaLM [9], it is easy to find that the open source model performs far worse than its corresponding proprietary model. In other words, despite BLOOM in the industry, research in the field of open source LLM still lags behind.

e75880e1324bd7fca3de8ba7d74ca145.png

(Quoted from [12])

Other important models

This article attempts to summarize important models proposed in the early days of open source LLM research. But beyond that there are some important models worth paying attention to.

GPT-J [21] is a causal language model that only supports English, has 6 billion parameters, and was proposed before GPT-NeoX-20B [6]. Similar to GPT-NeoX-20B, this model is pre-trained on the Pile dataset. GPT-J-6B is the largest publicly available GPT-3 style language model (as of the time of its release).

10c4cfa4affe94931b6d9c4d5f916def.png(Quoted from [20])

GLM[20] is more like a pre-training target than a traditional language model. GLM explores the idea of ​​unifying different pre-training techniques such as BERT, T5, and GPT, and does so by introducing an autoregressive gap-filling objective. In other words, they predict masked words in sentences in an autoregressive manner, similar to the way language models do. See picture above. Although the model parameters generated by this approach are very small (<1 billion parameters), GLM performs well on multiple popular natural language processing benchmarks, outperforming BERT, T5, and GPT models.

3

future direction

f963b5437047893e0edf2936fd771970.png

The evolution of open source LLM research

Considering that the models produced by the initial open source LLM attempts were far inferior in performance to proprietary models, we can reasonably ask: How can the performance of these models be improved? As this field of research develops, we see active exploration in two main directions:

  • Create a better basic LLM

  • Fine-tuning (i.e. aligning and imitating) open source LLMs

Given that open-source LLM is accessible to everyone, research in these areas is progressing very quickly—incredibly, in less than a year we have gone from OPT to near-state-of-the-art models like LLaMA-2 or Falcon-40B[10].

“We believe that the greatest potential for improving open source models lies in addressing the difficult challenge of creating better underlying LMs.”—quoted from [16]

During this period, the above two research directions were carried out simultaneously, and each direction developed valuable technologies for AI practitioners.

In the following article, I will provide an overview of these two fields and their respective key contributions, exploring how initial open source LLM attempts evolved into highly capable models like LLaMA-2.

References (please slide up and down) 

[1] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

[2] Rae, Jack W., et al. "Scaling language models: Methods, analysis & insights from training gopher." arXiv preprint arXiv:2112.11446 (2021).

[3] Smith, Shaden, et al. "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model." arXiv preprint arXiv:2201.11990 (2022).

[4] Zhang, Susan, et al. “OPT: Open Pre-trained Transformer Language Models.” arXiv preprint arXiv:2205.01068 (2022).

[5] “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable Llms.” MosaicML, 5 May 2023, www.mosaicml.com/blog/mpt-7b.

[6] Black, Sid, et al. "Gpt-neox-20b: An open-source autoregressive language model." arXiv preprint arXiv:2204.06745 (2022).

[7] Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).

[8] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." arXiv preprint arXiv:2104.09864 (2021).

[9] Chowdhery, Aakanksha, et al. "Palm: Scaling language modeling with pathways." arXiv preprint arXiv:2204.02311 (2022).

[10] “Introducing Falcon LLM”, Technology Innovation Institute, 7 June 2023, https://falconllm.tii.ae/.

[11] Radford, Alec, et al. "Language Models are Unsupervised Multitask Learners."

[12] Scao, Teven Le, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022).

[13] Press, Ofir, Noah A. Smith, and Mike Lewis. "Train short, test long: Attention with linear biases enables input length extrapolation." arXiv preprint arXiv:2108.12409 (2021).

[14] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).

[15] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).

[16] Gudibande, Arnav, et al. "The false promise of imitating proprietary llms." arXiv preprint arXiv:2305.15717 (2023).

[17] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[18] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[19] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.

[20] Du, Zhengxiao, et al. "Glm: General language model pretraining with autoregressive blank infilling." arXiv preprint arXiv:2103.10360 (2021).

[21] Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 billion parameter autoregressive language model, 2021.

[22] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with MeshTensorflow.


Comment
 

1. After LLaMA-2 was proposed, it officially replaced Falcon-40B and became the SOTA of open source LLM. More coming soon in part two of this series!

2. Currently, the most commonly used tokenization technology in LLM is byte pair encoding tokenization (https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt).

3. These tasks take sequences as input and produce sequences as output, such as language translation or text summarization.

4. This simply means that the same feedforward transformation is applied individually to the embedding of each token vector in the input sequence.

5. Residual connection simply means that we add the input values ​​of the module to its output. In other words, if a module performs an operation given by function f(x), the same operation with a residual connection will have the form g(x) = f(x) + x.

6. What this means is that, given a starting input sequence, we sequentially: i) generate an output; ii) add this output to our input sequence; iii) repeat.

7. According to the OPT proposal, Meta AI continues to contribute to open source LLM research. They have spawned a variety of models, such as OPT-IML, LLaMa, LIMA, LLaMA-2, etc.

8. For almost all languages ​​(such as Spanish, French and Arabic), BLOOM is the first 100B+ parameter language model trained on these languages.

9. Fine-tuning open source LLM places a strong emphasis on the value of creating a better underlying LLM. Basic LLM can also benefit from fine-tuning!

Everyone else is watching

试用OneFlow: github.com/Oneflow-Inc/oneflow/

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/133917925