The big model of "burning money": a preliminary study on cost disassembly and reasoning optimization methods

Editor's note: The cost of large models has always been a major concern. This article focuses on the cost of training large language models (LLMs), and briefly introduces what LLM is and some techniques for optimizing the performance of large model inference.

Although it is difficult to predict exactly how LLMs will develop in the future, it is certain that if the cost problem is solved, LLM will become an indispensable part of our life!

The following is the translation, Enjoy!

作者 | Dmytro Nikolaiev (Dimid)

Compile | Yue Yang

In the past, machine learning was considered to be a complex professional technology that only a few people could understand. However, as the applications related to machine learning have become more powerful, public interest has also increased, resulting in a flood of content about artificial intelligence. Until November 2022 when we saw ChatGPT, the climax appeared, and the second wave of climax was reached when GPT-4 was released in March 2023. Surprised by the capabilities of the network.

Artificial intelligence has received a lot of attention from the masses, and a lot of content about artificial intelligence has appeared on the Internet. Some of this content is undoubtedly valuable, but a significant amount of it is spreading fear and misleading information, such as spreading the news that artificial intelligence will replace all human jobs or discovering the secrets of neural networks that can make huge fortunes. Therefore, it is increasingly important to dispel misconceptions about machine learning and large language models (LLMs) and provide valuable content to help people better understand these techniques.

This article aims to discuss something that is often overlooked or misunderstood in the field of machine learning today - the cost of training large language models. At the same time, this article will also briefly introduce what LLM is and some techniques that may be used to optimize the inference process of large models. Through a comprehensive introduction, I hope to convince readers that these technologies did not come out of thin air. Understanding the scale of data and the underlying computation helps us better understand these powerful tools.

Most of the time, this article will rely on Meta AI's recently published paper on LLaMA [1], as it clearly shows the amount of data and computation the team used to train these models. This article will be divided into the following sections:

  • First, this article will briefly introduce what the latest LLM is;

  • The paper then discusses the cost of training these models;

  • Finally, the paper will briefly describe some optimization methods for model inference techniques.

As you delve into the world of large language models, you'll find that they can be very simple and complex at the same time.

01 Introduction to Large Language Models

Before we discuss the fees and costs associated with training a large language model (LLM), let's first briefly define what a language model is.

The number of parameters of several language models released in 2018-2019

Today's LLMs typically have tens to hundreds of billions of parameters

Figure 1 is from the DistilBERT paper

In simple terms, a language model is a machine learning algorithm designed to understand or generate natural human language. Recently, language generation models have become more and more popular, including the GPT model series developed by OpenAI: ChatGPT, GPT-4 , etc. (GPT refers to Generative Pre-trained Transformer, so named to indicate that it is based on the Transformer architecture [2]) .

There are also models that are less popular, but still important. For example, GPT-3 (175B) [3] , BLOOM (176B) [4] , Gopher (280B) [5] , Chinchilla (70B) [6] and LLaMA (65B) [7] , where B represents the number of parameters, Many of these models also have versions with fewer parameters.

Some popular architectures for LLMs. Image courtesy of the author

There is currently no information on the number of parameters for ChatGPT and especially GPT-4, but it seems that they are roughly similar.

These models are "trained" by using large amounts of textual data, enabling them to learn the complex patterns and structures of natural language. However, the task they solve during training is quite simple: predict the next word (or token) in a sequence.

Such a model is known as an autoregressive model , which means it uses past outputs as input for future predictions and gradually generates outputs . This can be seen in a sample output from ChatGPT:

GhatGPT generated a reply

The gif is taken from the author's process of using ChatGPT

You can find that ChatGPT generates answers step by step, and the generated content is sometimes incomplete word fragments (chunks), these fragments (chunks) are called tokens.

At each step, the model connects the previous output to the current input, and continues generating until it reaches a special "End of Sequence" (EOS) token. For simplicity, the prompt task is omitted and words are used as tokens. This process can be shown as follows:

Text generation explaining autoregressive models. Image courtesy of the author

This simple mechanism, combined with a large amount of data (one person may not be able to read so much data in a lifetime) enables the model to generate coherent and contextually appropriate text, mimicking the way humans write.

If we're only talking about generative models here, why aren't there other families of models?

The reason is simple - the task of text generation is one of the hardest to solve and also one of the most impressive for humans . ChatGPT gained 1 million users within 5 days [8], faster than any other app before, and the momentum continues [9].

So-called encoders [10] (BERT model family) may not be too stimulating for humans, but they can also solve various problems at human level and help with text classification [11] or named entity recognition (NER) [12 ] and other tasks.

I won't provide concrete examples of what large language models can do, because that's already all over the web. The best way is to try ChatGPT yourself, but there are also some excellent prompts, such as Awesome ChatGPT prompts. Despite the amazing capabilities of large language models, they currently have some limitations. The most common and important of these include:

  • Presence of bias and static knowledge : Since LLM models are trained on data from many sources, they can inadvertently learn and reproduce biases present in this data. Furthermore, they are knowledge static and cannot adapt to new data or update knowledge in real time without retraining.

  • Incomplete understanding of input and presence of disinformation : While LLM models can generate human-like text, they do not always fully understand the context of the input. Also, the way autoregressive generates output text does not prevent the model from producing lies or meaningless content.

  • Consuming too many resources : Training LLM models requires a lot of computing resources, which leads to high training costs and energy consumption. This factor may limit the development of LLM models for smaller companies or individual researchers.

These and other shortcomings are hot topics of discussion in the AI ​​research community. It is worth mentioning that the field of AI is developing so fast that it is difficult to predict in a few months which shortcomings or limitations will be overcome, but there is no doubt that new shortcomings and limitations will emerge.

Earlier models just increased the number of parameters, but it is now considered better to train smaller models and feed them more data over a longer period of time. This reduces the size of the model and the cost of subsequent use of the model.

After a general understanding of LLM, let's get into the main part of this article - estimating the cost of training a large language model.

02 Estimating the cost of machine learning models in general, especially the cost of LLM

To estimate the cost of training a large language model, three key factors must be considered:

  • data

  • computing resources

  • and the architecture (or the algorithm itself)

Let us now dig into these three aspects and understand their impact on training cost.

2.1 Data

LLMs require large amounts of data to learn the patterns and structures of natural language. Estimating the cost of data can be challenging because companies often use long-term accumulation of data from their business operations as well as open-source datasets.

Also, take into account that data needs to be cleaned, labeled, organized, and stored. Given the scale of LLM, data management and processing costs can add up quickly, especially when considering the infrastructure, tools, and data engineers required for these tasks.

As a concrete example, it is known that LLaMA uses a training data set containing 1.4 trillion tokens, with a total size of 4.6TB!

The training data set of the LLaMA model, Table 1 is from the LLaMA paper

The smaller models ( 7B and 13B ) were trained using 1T tokens , while the larger models ( 33B and 65B ) used the full dataset of 1.4T tokens .

The training loss value of the LLaMA model changes with the number of tokens, from the LLaMA paper

It should be understood by now that we are not exaggerating when we advertise how large these datasets are, and understand why this was not possible with large models a decade ago. However, the issue of computing resources is more interesting.

2.2 Computing resources

The cost of the training process occupies a large part of the LLM training cost. Training a large language model requires a lot of computing resources, and due to the need for strong parallel processing capabilities, a powerful graphics processing unit (GPU) is used. NVIDIA releases new GPUs every year, which cost hundreds of thousands of dollars.

If using cloud computing services, the cost of cloud computing services to train these models can also be staggering, basically millions of dollars, especially considering the need to iterate various configurations.

Going back to the LLaMA paper, the article says that they used two thousand GPUs, each with up to 80 GB of video memory, and it takes 21 days to train the largest 65B model with such powerful computing power.

The amount of computing resources used to train the LLaMA model, the picture is from the LLaMA paper

The NVIDIA A100 GPU used by the author is a common choice for neural network training today. Google Cloud Platform provides such GPUs for $3.93 per hour.

Price of NVIDIA A100 GPU

So let's do a quick calculation:

Four million dollars is not something every researcher can afford, right? And it's only for one run! This article estimates the training cost of GPT-3 [13], the author said that it needs 355 GPU-years and a cost of 4.6 million US dollars.

2.3 Architecture (and Infrastructure)

Architecture (and Infrastructure)

The development of a state-of-the-art LLM also requires skilled researchers and engineers to design a reasonable architecture and properly configure the training process. The architecture is the foundation of the model and determines how it learns and generates text.

Designing, implementing, and controlling these architectures requires expertise in various fields of computer science. Engineers and researchers tasked with publishing and delivering the results of great projects can earn salaries in the hundreds of thousands of dollars. It is important to note that the technology stack required to train an LLM may be very different from the technology stack of a "classic" machine learning engineer.

The infrastructure of the machine learning system, the picture comes from the paper "Hidden Technical Debt in Machine Learning Systems" [14]

Training LLM is a very difficult and resource-intensive engineering problem. Let us now briefly discuss some ways to make the LLM inference process more efficient and cost-effective.

03Optimize the reasoning ability of the language model

3.1 Do we really need optimization?

Inference refers to the process of generating predictions or responses using a trained language model, usually as an API or web service. Given the massive resource-consuming nature of LLM, it must be optimized for efficient inference.

For example, the GPT-3 model has 175 billion parameters, which equates to 700GB of float32 numbers. Activation also requires about the same amount of memory, and it's important to note that we're talking RAM.

To make predictions without using any optimization techniques, we would need 16 A100 GPUs with 80GB of video memory!

There are several popular techniques that can help reduce memory requirements and model latency , including model parallelism, model quantization, and more.

3.2 Model Parallelism

Model parallelism [15] distributes the computation of a single model across multiple GPUs, which can be used for both training and inference processes. Splitting a model's layers or parameters across multiple devices can significantly increase the overall inference speed and is often used in practice.

3.3 Model Quantization

Model quantization [16] involves reducing the precision of model values ​​such as weights. By converting floating-point numbers to lower-precision integers, model quantization can achieve significant memory savings and faster computation without substantial loss of model performance. Do you have an idea: use float16 floating point numbers instead of float32, which will reduce the amount of memory by half. It turns out that it is even possible to convert the model weights to int8 with little loss of precision.

3.4 Other technologies

Research on methods for optimizing LLMs has been an active area of ​​research. Other techniques include:

  • Knowledge Distillation [17] - train a smaller student model to mimic the behavior of a larger teacher model;

  • Parameter pruning [18] - removes redundant or unimportant parameters from the model to reduce model size and computational resource requirements;

  • Use frameworks like ORT (ONNX Runtime) [19] to optimize calculation graphs through techniques such as operator fusion and constant folding.

Overall, optimizing the inference of large language models is an important aspect of LLM deployment. By applying various optimization techniques, developers can ensure that LLM is not only powerful and accurate, but also cost-effective and scalable .

04Why does OpenAI open ChatGPT to the public?

Given the high cost of large language model training and inference, one might wonder. While we cannot determine OpenAI's exact motivations, we can analyze the benefits and underlying strategic reasons behind this decision.

First of all, OpenAI uses the most advanced LLM for everyone to use, and has gained a very high reputation . By demonstrating how large-scale language models can work in practice, the company has attracted the attention of investors, customers, and the technology community at large .

Second, OpenAI's mission revolves around the creation and development of artificial intelligence. By opening ChatGPT to public access, the company can be considered closer to fulfilling its mission and preparing for social change. Opening up such powerful AI tools encourages innovation and drives the field of AI research forward . Such advances can lead to more efficient models, more diverse applications, and various new solutions . However, neither the architecture of ChatGPT nor GPT-4 is publicly available, but this is another topic for discussion.

While the costs associated with training and maintaining large language models are undoubtedly significant, opening access to ChatGPT not only increases their popularity and proves their leadership in the field of artificial intelligence, but also allows them to collect more data to train more powerful models . This strategy allows them to continue to advance their mission and in a way makes a remarkable contribution to the development of artificial intelligence and LLM technology.

Ask ChatGPT, why OpenAI is free to use ChatGPT

05 Conclusion

As this article shows, the cost of training a large language model is influenced by various factors, including not only expensive computing resources, but also the need to learn expertise in areas such as big data management and model development architecture .

Today's LLMs generally have billions of parameters, trillions of tokens are used for training, and the training cost is as high as millions of dollars.

Hopefully by now you understand the cost of training and inferring large language models, as well as their limitations and pitfalls.

The field of natural language processing has transitioned from the ImageNet era [20] that lasted for several years to the era of generative models . The widespread application and use of generative language models (generative language models) promises to revolutionize all walks of life and every aspect of our lives. Although it is difficult to accurately predict these changes, we can be sure that LLM will definitely have a certain impact on the world.

Personally, I prefer to train "smarter" models, not just "bigger" models . By exploring more elegant ways to develop and deploy LLMs, the boundaries of AI and NLP can be stretched, opening the door to more innovative solutions and a bright future for the field.

END

References

1.https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

2.https://huggingface.co/course/chapter1/4

3.https://en.wikipedia.org/wiki/GPT-3

4.https://bigscience.huggingface.co/blog/bloom

5.https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval

6.https://arxiv.org/abs/2203.15556

7.https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

8.https://twitter.com/gdb/status/1599683104142430208

9.https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/

10.https://huggingface.co/course/chapter1/5

11.https://paperswithcode.com/task/text-classification

12.https://paperswithcode.com/task/named-entity-recognition-ner

13.https://lambdalabs.com/blog/demystifying-gpt-3

14.https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

15.https://colossalai.org/docs/concepts/paradigms_of_parallelism/

16.https://huggingface.co/docs/optimum/concept_guides/quantization

17.https://neptune.ai/blog/knowledge-distillation

18.https://analyticsindiamag.com/a-beginners-guide-to-neural-network-pruning/

19.https://onnxruntime.ai/

20.https://thegradient.pub/nlp-imagenet/

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://towardsdatascience.com/behind-the-millions-estimating-the-scale-of-large-language-models-97bd7287fb6b

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/130316917