Details of GPT-4 have been leaked

c55d04e5e504ac58cda3dcfd180d2879.png

The author said that the details of GPT-4 have been leaked, and the credibility is not known. Some key information:- GPT-4 is more than 10 times the size of GPT-3. We think it has a total of about 1.8 trillion parameters in 120 layers. - GPT-4 is a mixture of multiple expert models, but not the 8 experts mentioned before, but 16. Researchers have shown that using 64 to 128 experts yields better losses than 16 experts, but this is pure research. One reason OpenAI chose 16 experts is that more experts have difficulty generalizing across many tasks. More experts may also make it harder to reach convergence. - The context length (seqlen) of the pre-training phase is 8k. The 32k seqlen version of GPT-4 is the result of 8k fine-tuning after pre-training. - For parallelization on all A100s GPUs, they used 8-way tensor parallelism, since this is a limitation of NVLink. - If their cost in the cloud is about $1/A100 per hour, the training cost for this run will be about $63M. - GPT-4 inference cost is 3 times that of Davinchi with 175B parameters. This is mainly due to the fact that GPT-4 requires larger clusters and achieves lower utilization. Its cost is estimated to be $0.0049/ 1K tokens. (Current API price for GPT-4 is about $0.03 / 1K tokens) - Conspiracy theory that the new GPT-4 quality drops may just be because they let the oracle model accept lower probability sequences from the speculative decoding model. - Inference runs on a cluster of 128 GPUs. It is done in 8-way tensor parallelism and 16-way pipeline parallelism. Each 8-GPU node has only about 130 billion parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8. - Visual multimodality is a separate visual encoder from the textual encoder with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of GPT-4's 1.8T. After text pre-training, it uses another 2 trillion Tokens for fine-tuning. On the vision model, OpenAI hopes to train from scratch, but it is not mature enough, so they hope to reduce the risk by starting from text. Part of the data they train on is joint data (rendered LaTeX/text), screenshots of webpages, YouTube videos: sample frames, and run Whisper around them to get transcriptions.Translate his tweet thread for reference: GPT-4 details have been leaked. Everything is here:

Number of parameters: GPT-4 is more than 10 times the size of GPT-3. We think it has a total of about 1.8 trillion parameters in 120 layers. Mixed Expert Model - Confirmed. OpenAI is able to keep costs reasonable by using a mixture of experts (MoE, mixture of experts) model. They used 16 experts in their model, each with about 111 billion MLP parameters. Each forward pass is routed to 2 of these experts. Mixture of Experts (MoE) Routing: While advanced routing algorithms for choosing which expert to route each token to are heavily discussed in the literature, OpenAI's GPT-4 model's routing approach is said to be fairly straightforward. There are about 55 billion shared parameters for attention. Inference: Each forward pass inference (generating 1 Token) only uses about 280 billion parameters and about 560 TFLOPs. This is in contrast to the ~1.8 trillion parameters and ~3700 TFLOPs required per forward pass for a purely dense model.

Dataset: GPT-4 is trained on about 13 trillion Tokens. These are not the only tokens, they also count more tokens as epochs. Epoch number: 2 epochs for text data and 4 epochs for code data. There are millions of rows of instruction fine-tuning data from ScaleAI and internally. GPT-4 32K: The context length (seqlen) of the pre-training phase is 8k. The 32k seqlen version of GPT-4 is the result of 8k fine-tuning after pre-training. Batch Size: The batch size gradually increased over the few days the cluster was running, but by the end, OpenAI was using a batch size of 60 million! Of course, this is "only" a batch size of 7.5 million tokens seen by each expert, since not every expert sees all tokens. For the real batch size: divide this number by seq len to get the real batch size. The use of these misleading numbers has been discontinued. Parallel strategy: To parallelize on all A100s GPUs, they used 8-way tensor parallelism, since this is a limitation of NVLink. Besides that, they also used 15-way pipeline parallelism. (Probably used ZeRo Stage 1. They may have used block-level FSDP) Training cost: OpenAI's GPT-4 training FLOPS is about 2.15e25, at about 25, 000 A100s running for 90 to 100 days, the MFU is about 32% to 36%. This extremely low utilization is due in part to the sheer number of checkpoints from which to restart. If their cost in the cloud is about $1/A100 per hour, the training cost for this run would be about $63M. (Today, pre-training can be done in about 55 days with about 8192 H100 at a cost of $21.5 million, or $2 per hour of H100.) Mixed Expert Tradeoffs: Multiple MoE tradeoffs are taken: For example, MoE is very intractable in inference because not every part of the model is used at every token generation. This means that parts may be dormant while other parts are being used. This can really hurt utilization when serving users. Researchers have shown that using 64 to 128 experts yields better losses than 16 experts, but this is pure research. Fewer specialists are selected for various reasons. One reason OpenAI chose 16 experts is that more experts have difficulty generalizing across many tasks. More experts may also make it harder to reach convergence. For such a large training run, OpenAI chose to be more conservative in the number of experts. GPT-4 inference cost: The cost of GPT-4 is 3 times that of Davinchi with 175B parameters. This is mainly due to the fact that GPT-4 requires larger clusters and achieves lower utilization. Its cost is estimated to be $0.0049 / 1K Tokens for inference of GPT-4 8k seqlen with 128 A100s, and $0.0021/ 1K Tokens for inference of GPT-4 8k seqlen with 128 H100s. It should be noted that we assume high utilization and keep the batch size large. Multi-Query Attention: OpenAI uses MQA (Multi-Query Attention) like everyone else. Therefore, only 1 header is required, which can significantly reduce the memory capacity of the KV cache. Even so, GPT-4 with 32k seqlen certainly won't work on the 40GB A100s, and the max bsz at 8k is limited. Continuous batch processing: OpenAI implements variable batch size and continuous batch processing. This is done to allow some degree of maximum latency while optimizing inference cost. Visual Multimodal: This is a visual encoder separate from the text encoder with cross-attention. The architecture is similar to Flamingo. This is in 1 of GPT-4. More parameters are added above 8T. After text pre-training, it uses another 2 trillion Tokens for fine-tuning. On the vision model, OpenAI hopes to train from scratch, but it is not mature enough, so they hope to reduce the risk by starting from text. One of the main purposes of this visual ability is to be able to read web pages and transcribe content from images and videos. Part of the data they train on is joint data (rendered LaTeX/text), screenshots of webpages, YouTube videos: sample frames, and run Whisper around them to get transcriptions. Speculative Decoding: OpenAI may use speculative decoding on GPT-4 inference (not sure 100%). The idea is to use a smaller and faster model to decode several tokens ahead of time, and then feed them into a large oracle model as a single batch. If the small model gets its predictions right - the big model agrees, we can decode several tokens in one batch. But if the big model rejects a token predicted by the draft model, then the rest of the batch is discarded and the big model continues. The new GPT-4 conspiracy theory of quality degradation may simply be because they let the oracle model accept lower probability sequences from the speculative decoding model. Inference Architecture: Inference runs on a cluster of 128 GPUs. There are multiple such clusters in multiple data centers in different locations. It is done in 8-way tensor parallelism and 16-way pipeline parallelism. Each 8-GPU node has only about 130 billion parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8. The model has 120 layers, so it fits in 15 different nodes. [Probably have fewer layers on the first node, since it needs to compute embeddings] According to these numbers: If OpenAI is trying to train on chinchilla optimality, they should be training on 2x as many tokens. [Not to mention surpassing it like we did] It shows that they are working hard to obtain high-quality data. Why is there no FSDP? The likely reason is that some of the hardware infrastructure they get is of an older generation. This is common in on-premises computing clusters, as organizations typically upgrade infrastructure in several "waves" to avoid halting operations entirely. Since pipelines are very parallel, like the rest of us, they are likely to suffer from "batch bubbles": slight idle times between batches. Again: no magic. They know what they're doing, but it's not magic. This shows that they are working hard to obtain high-quality data. Why is there no FSDP? The likely reason is that some of the hardware infrastructure they get is of an older generation. This is common in on-premises computing clusters, as organizations typically upgrade infrastructure in several "waves" to avoid halting operations entirely. Since pipelines are very parallel, like the rest of us, they are likely to suffer from "batch bubbles": slight idle times between batches. Again: no magic. They know what they're doing, but it's not magic.

Guess you like

Origin blog.csdn.net/cq20110310/article/details/131671723