Just now, OpenAI's GPT-4 was "open sourced" by industry insiders again! These include GPT-4 architecture, training and inference infrastructure, parameter volume, training data set, token number, cost, Mixture of Experts model (Mixture of Experts

Just now, OpenAI's GPT-4 was "open sourced" by industry insiders again!

These include very specific parameters and information such as GPT-4's architecture, training and inference infrastructure, parameter volume, training data set, token number, cost, and Mixture of Experts (MoE).

picture

In particular, behind the different projects, how does OpenAI weigh. And how to cross the biggest bottleneck in large model inference.

Who did such a heavy revelation come from?

picture

The authors of the article are two contributors to SemiAnalysis named Dylan Patel and Gerald Wong.

It is worth mentioning that Dylan Patel was also one of the authors of Google’s internal document leak (“We don’t have a moat, and neither does OpenAI”), which had previously caused an uproar in the industry.

picture

DeepMind CEO Hassabis recently confirmed the authenticity of the leaked documents from Google engineers in an interview with foreign media The Verge.

It can be seen that Dylan Patel does have some special channels, which makes today's revelation a little more authentic.

picture

Li Zhifei, the CEO of Going out to ask, also gave a speech

Many companies can make GPT-4

In the opinion of the author of the breaking news article, the reason why OpenAI is not open is not to ensure that human beings are not destroyed by AI, but because the things they build are reproducible.

He even predicts that in the future, all major Internet companies or AI start-ups in China and the United States will be able to build a model that is the same as GPT-4 or even surpasses GPT-4.

But he also admitted that GPT-4 is a great masterpiece of OpenAI. It condenses the engineer's ingenious design, complex structure and various ingenious engineering trade-offs.

picture

The most durable moat of OpenAI is that they have feedback from real users, the top engineering talents in the industry, and the continuous leading position brought by the first-mover advantage.

Model framework

First of all, the author who broke the news believes that GPT-4 contains a total of 1.8 trillion parameters in 120 layers , while GPT-3 only has about 175 billion parameters.

In other words, the scale of GPT-4 is more than 10 times that of GPT-3.

picture

Previously, it was said on the Internet that the parameter of GPT-4 was 100 trillion, but it has been refuted

In order to keep the cost reasonable, OpenAI adopts the MoE model for construction.

Specifically, GPT-4 has 16 expert models with approximately 111 billion parameters per MLP expert . Among them, two expert models are used for forward propagation.

Although there is a lot of discussion in the literature on advanced algorithms for selecting which experts each token points to, it is said that the algorithm used by OpenAI for GPT-4 is actually very simple.

In addition, there are about 55 billion parameters in the model , which are used for the sharing of the attention mechanism .

In each forward pass reasoning (generating a token ), GPT-4 only needs to use about 280 billion parameters and 560TFLOPs.

This is in stark contrast to many purely dense models that require about 1.8 trillion parameters and 3700 TFLOPs per forward pass.

The composition of the data set

OpenAI trained GPT-4 with 13 trillion tokens .

This data set not only contains 13 trillion tokens, but because there are no high-quality tokens, this data set also contains many epochs.

Inside Scale AI and the dataset, millions of lines of instruction fine-tuning data are also included.

However, the author of the revelation said that they did not find much information on these RLHF data.

The context length in the pre-training stage reached 8K (seqlen), and the 32k version was fine-tuned based on the pre-trained 8K version.

The batch size is gradually increased over several days in the cluster, and the final batch size used by OpenAI is 60 million.

Of course, this is "only" the size of the expert model at 7.5 million tokens each , since not every expert model will see all tokens.

parallel strategy

Parallel strategy is very important for A100GPU .

OpenAI uses 8-way tensor parallelism, because NVLink only supports so many .

But in addition, the author of the breaking news heard that OpenAI uses 15 parallel pipelines.

In theory, 15 pipelines is a bit much considering data communication and computing time.

But because of the limitation of memory capacity, so many pipelines are meaningful.

When purely pipelined and tensor-parallel, the FP16 parameter is about 30GB per GPU.

But once the KV cache and cost are added, if most of the GPUs used by OpenAI are 40GB A100, then such an architecture makes sense in theory.

It is possible that OpenAI is using ZeRo Stage 1, and may be using block-level FSDP or hybrid shared data parallelism.

Why didn't they use the full model of FSDP? Probably because of the high communication cost.

Although OpenAI has a high-speed network between most nodes, it does not cover all nodes.

Among them, at least some clusters will have much lower connection bandwidth than others.

However, the author said that he does not quite understand how OpenAI avoids the generation of "huge bubbles" in each batch under such a high pipeline parallelism. It is very likely that OpenAI has resisted these cost.

picture

training cost

OpenAI trains GPT-4 with about 2.15e25 FLOPS, trained on about 25,000 A100s for 90 to 100 days, and the utilization rate is between 32% and 36%.

This extremely low utilization was partly due to the high number of failures, which required restarting training from previous checkpoints. Such as the bubble cost mentioned above.

The wasted training cost in this case is extremely high.

Another reason is that all-reduce among so many GPUs is very expensive.

picture

This diagram assumes that the inability to fuse each operation, the memory bandwidth required by the attention mechanism, and the hardware overhead equivalent to parameter reads lead to inefficiencies. In fact, even with an optimized library such as Nvidia's FasterTransformer library, the total overhead can be even greater

The author of the whistleblower suspects that if this cluster is actually a group of smaller clusters with weaker network connections, then the non-blocking (non-block) connection speed between different parts of the cluster is 800G/1.6T, but these parts The connection speed between them is only 200G/400G.

If the cost of OpenAI cloud computing is about $1/A100 hours, then under such conditions, the training cost is about $63 million.

This does not include all the experiments, failed training and other costs, such as data collection, RLHF, human cost, etc.

If you take into account the factors just mentioned, the real cost is much higher.

Also, this has to be on the premise that others can buy chips/networks/datacenters, incur capital expenditures to build these systems, and lease them to OpenAI.

But today, at $2/H100 hours, pre-training can be done on about 8,192 H100s in just 55 days at a cost of $21.5 million.

picture

The figure above shows the number of parameters and tokens for some of the publicly available advanced models. The line in the figure is Google DeepMind's Chinchilla scaled observations (larger error bars smoothed), each point on the line shows the theoretical FLOPS required to train the model with that parameter and number of tokens

However, the author of the report said that by the end of this year, at least nine companies will have H100 clusters exceeding the above-mentioned size.

While not all of these companies will use all of them for individual model training, if any do, they will have larger models than GPT-4.

For example, Meta will have more than 100,000 H100 by the end of this year, but a considerable part of them will be distributed in its own data center for inference.

But its largest single cluster will still exceed 25,000 H100.

In short, by the end of this year, many companies will have enough computing resources to train GPT-4-sized models.

picture

This table is the theoretically optimal cost of training a model on an Nvidia A100, without considering the manpower required, ML Ops tools, data collection/preprocessing, failure recovery, one-shot/few-shot learning examples, inference, etc., many parts The cost of

Tradeoffs in Mixed Expert Models

MoE (Mixed Model of Experts) is a great way to reduce the amount of parameters during inference , while increasing them at the same time .

But this is necessary for each training token to encode more information, because obtaining enough high-quality tokens is very difficult.

If OpenAI really wants to pursue the best performance, they need to train twice as many tokens to achieve it.

That being said, OpenAI made quite a few trade-offs.

For example, dealing with MoE during inference is very difficult because every part of the model is not used at every token generation.

This means that some parts may be dormant while other parts are working.

This situation can significantly reduce utilization when servicing users.

Researchers have shown that using 64-128 expert models yields better loss profiles than using 16 expert models , but this is just research.

There are many reasons for using relatively few expert models. One of the reasons why OpenAI chose 16 experts is because more expert models are difficult to generalize on many tasks.

It is also more difficult to achieve convergence with more expert models.

In such a huge training process, OpenAI chose to be more conservative in the number of expert models.

Furthermore, using fewer expert models also helps their inference infrastructure. There are various difficult trade-offs and trade-offs when switching to a hybrid expert-model inference architecture.

The author of the breaking news starts with the discussion of the basic trade-offs of LLM reasoning, and then discusses the problems OpenAI faces and the choices they make.

reasoning trade-off

Before introducing the reasoning trade-offs, by the way, after talking with all LLM companies, the whistleblower found that NVIDIA's FasterTransformer reasoning library is very bad, and TensorRT is even more so.

This means that if Nvidia does not modify, people will need to create their own solutions from scratch.

There are three main tradeoffs in reasoning about large language models, the batch size (number of concurrently processed users) dimension and the number of chips used, as follows:

1. Delay

The model must respond within a reasonable latency. Nobody wants to wait a few seconds in a chat app before they start receiving output. The processing time for prefilling (input tokens) and decoding (output tokens) varies.

2. Throughput

The model must output a certain number of tokens per second. Humans need about 30 tokens per second. For various other use cases, both lower and higher throughputs are acceptable.

3. Utilization

The hardware running the model must achieve high utilization rates, or the cost will be prohibitive. While higher latency and lower throughput can be used to combine more user requests together to achieve higher utilization, it also increases difficulty.

The key to LLM reasoning is to balance the two points of memory bandwidth and computation.

picture

Theoretical bandwidth requirements of LLM: It can be assumed that the maximum model size that can be run on iPhone 14 is ~1 billion FP16 parameters, or ~4 billion int4 parameters. This is the basic limit of LLM based on smartphones. Any more Large models will not be adopted

Simply put, each parameter must be read and there are 2 FLOPs associated with it .

Therefore, the ratio of most chips (the H100 SXM has only 3TB/s memory bandwidth, but the FP8 has 2,000 TFLOPs/s) is completely unbalanced in inference with a batch size of 1.

If there is only one user (batch size 1), the memory bandwidth required to read each parameter each time a token is generated dominates the inference time, while the computation time is almost negligible.

To efficiently scale large language models to multiple users, the batch size must exceed 1. Multiple users share the cost of reading parameters. For example, with a batch size of 256/512, you can get 512 FLOP/s or 1024 FLOP/s per byte of memory read.

This ratio is closer to the H100's balance between memory bandwidth and FLOPS. This helps achieve higher utilization, but at the cost of higher latency.

Memory capacity is considered by many to be a major bottleneck for LLM inference, since large models require multiple chips for inference, and higher memory capacities mean they can fit on fewer chips.

However, it is actually better to use more chips so that latency is lower, throughput is increased, and larger batch sizes can be used for higher utilization.

picture

GPT-4 inference tradeoffs and infrastructure

As mentioned above, it is very difficult for GPT-4 reasoning. But being a MoE mod again introduces a whole new set of difficulties.

Each forward pass that generates tokens can be routed to a different set of experts. This poses a problem with the trade-off between throughput, latency, and utilization at larger batch sizes.

OpenAI's GPT-4 has 16 experts, and each forward pass routes to 2 of them .

This means that if the batch size is 8, each expert's parameter read may only have a batch size of 1.

Worse, this could mean that one expert has a batch size of 8 while other experts have batch sizes of 4, 1, or 0.

For each token generated, the routing algorithm sends forward passes in different directions, causing delays between tokens and expert batch sizes to vary significantly.

Inference infrastructure is one of the main reasons why OpenAI chose a smaller number of experts. If they choose more experts, memory bandwidth becomes the bottleneck for inference.

OpenAI's inference cluster can usually reach a batch size of 4k+, which means that even with the best load balance between experts, the batch size of experts is only about 500 or so. This requires a very large amount of usage to achieve.

According to the whistleblower, we learned that OpenAI performs inference on a cluster of 128 GPUs. They have multiple of these clusters across multiple data centers and geographic locations.

Inference uses 8-way tensor parallelism and 16-way pipeline parallelism. Each node consisting of 8 GPUs has only about 130B parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8.

This allows running inference on a 40GB A100 as long as the KV cache size for all batches is not too large.

Layers containing different experts on different nodes are not split because that would cause network traffic to be too irregular and recomputing the KV cache between each generated token would be too expensive.

For future MoE model extensions and conditional routing, the biggest difficulty is how to handle the routing of the KV cache.

The model has 120 layers, so they could simply be distributed to 15 different nodes, but since the first node needs to do data loading and embedding, it makes sense to put fewer layers on the master node of the inference cluster of.

Also, there are some rumors about "speculative decoding" (following), which also explains why masternodes need to contain fewer layers.

reasoning cost

Compared with the Davinchi model with 175 billion parameters, GPT-4 costs 3 times, although its feed-forward parameters only increase by 1.6 times.

This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.

The authors believe that the cost of inferring GPT-4's 8k sequence length on 128 A100s is $0.0049 per 1,000 tokens, while the cost of inferring GPT-4's 8k sequence length on 128 H100s is $0.0021 per 1,000 tokens.

Note that this assumes fairly high utilization and keeps the batch size high.

But it's clear that OpenAI is sometimes very underutilized.

picture

In this regard, the author hypothesized that OpenAI would shut down the cluster during off-peak hours , reconfigure nodes, resume training smaller test models, and try various new technologies to reduce inference costs.

Had OpenAI not done so, their utilization would have been lower and their costs would have more than doubled.

multi-query attention

In addition, OpenAI is also using Multi-Query Attention (MQA).

picture

Paper address: https://arxiv.org/pdf/1911.02150.pdf

In short, only one attention head is required, and the memory footprint of the KV cache can be significantly reduced .

Even so, GPT-4 with a length of 32k certainly cannot run on a 40GB A100 , and there is an upper limit to the maximum batch size of 8k.

continuous batch

OpenAI implements variable batch size and continuous batch processing.

Doing so allows some degree of maximum latency and optimizes inference cost.

picture

picture

Speculative Decoding

It was revealed that OpenAI used "speculative decoding" in the reasoning process of GPT-4, which still has 100% uncertainty.

The variation in latency from token to token, and the difference when doing simple retrieval tasks versus more complex tasks, seems to suggest this is possible, though there are still too many variables to be sure.

Here, the whistleblower made appropriate modifications/added some details to explain the text in a study "Accelerating LLM Inference with Staged Speculative Decoding" by DeepMind.

picture

There are usually two phases to using the LLM.

The first is prefill, where the hint text is fed into the model to generate the KV cache and the log odds (probability distribution of possible token outputs) of the first output. This process is usually fast because the entire prompt text can be processed in parallel.

The second stage is decoding. Select a token from the log odds of the output and feed it into the model, which will generate the log odds of the next token. Repeat this process until the desired number of tokens are generated.

Since the decoding must happen sequentially, each time the weights need to be streamed through the computing unit to generate a single token. So this second stage is very computationally intensive (i.e. compute FLOPs/bytes of memory bandwidth) when running in mini-batches. Therefore, decoding is usually the most expensive part of autoregressive generation.

This is why the input token is much cheaper than the output token in OpenAI's API calls.

The basic idea of ​​"speculative decoding" is to use a smaller, faster draft model to decode multiple tokens ahead of time, and then feed them into the predictive model as a batch.

If the draft model's predictions are correct, i.e. the larger model agrees with those predictions, multiple tokens can be decoded using a single batch, which saves a lot of memory bandwidth and time.

However, if the larger model rejects a token predicted by the draft model, the remaining batch is discarded and the algorithm naturally reverts to standard token-by-token decoding.

"Speculative decoding" may also be accompanied by a rejection sampling scheme to sample from the original distribution. It's worth noting that this is only useful in small-batch settings where bandwidth is the bottleneck.

Speculative decoding, which trades computation for bandwidth, is an attractive performance engineering target for two key reasons:

First, it does not reduce model quality. Second, the performance improvements it offers are often orthogonal to other approaches, since their performance comes from converting "sequential execution" to "parallel execution".

The current inference method is a separate sequence of batch predictions. However, this approach does not scale well to large batches, or low-draft model alignments.

Intuitively, the probability of two models agreeing on contiguously long sequences of tokens is exponentially low, implying that the gains from speculative decoding diminish rapidly as arithmetic density increases.

The whistleblower believes that if OpenAI uses "speculative decoding", they may only use it in sequences of about 4 tokens.

As an aside, the whole conspiracy about OpenAI's castration, resulting in lower quality GPT-4, may simply be because they subject their predictive models to low-probability sequences from "speculative decoding" models.

It has also been speculated that Bard also uses "speculative decoding" because Google waits for the entire sequence to be fully generated before sending it to the user, but in the whistleblower's opinion, this guess is completely incorrect.

Visual multimodal

Visual multimodal capabilities are the least impressive part of GPT-4, at least compared to leading research.

Of course, no one has yet commercialized the results of multimodal LLM research.

The whistleblower said that it is a visual encoder independent of the text encoder, as well as cross-attention, the architecture is similar to Flamingo, and more parameters have been added to GPT-4 1.8T.

GPT-4's multimodal capability is fine-tuned with about 2 trillion tokens after text pre-training.

It is said that on the visual model, OpenAI originally hoped to train from scratch, but because it was not mature enough, it had no choice but to fine-tune from the text training model.

And the next-generation model GPT-5, whose training should train the vision model from scratch, and be able to generate images, and even generate audio.

One of the main purposes of this visual ability is to enable autonomous agents to read web pages and transcribe images, videos.

It is worth mentioning that the data used by OpenAI to train multimodal models includes: "joint data" (LaTeX/text), web page screenshots, YouTube videos (sampling frames, and running Whisper to get subtitles).

An interesting fact about the over-optimization of LLMs is that visual models have a different IO cost than textual models. In the visual model, the data loading IO is about 150 times that of the text model.

picture

The IO cost of the visual model is low

Each token in the visual model is 600 bytes, and the text is 4 bytes/token.

So this requires a lot of work in terms of image compression. This is extremely important for hardware vendors as they are optimizing hardware 2-3 years out around LLM use cases and ratios.

They may find themselves in a world where each model has powerful visual and audio capabilities.

They may find themselves poorly suited to the architecture.

In general, the architecture will definitely surpass the text-based simplified dense models and MoE models we see today.

References:

https://www.semianalysis.com/p/gpt-4-architecture-infrastructure

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131728893