GPT-4 is cracked! With the blessing of the digital intelligence era, low-code development helps the AI model architecture gradually evolve

Said it in front

A few hours ago, DYLAN PATEL and DYLAN PATEL of SemiAnalysis released a technical information about GPT-4, including GPT-4’s architecture, number of parameters, training cost, training data set, etc.

background

With the advent of the era of digital intelligence, AI technology is playing an increasingly important role in all walks of life. Behind AI, model architecture is considered to be one of the key factors that determine its performance and effectiveness. Among them, ChatGPT-4.0, as a leading AI technology model, has attracted widespread attention in the context of the digital intelligence era.

The model architecture of ChatGPT-4.0 is like a majestic building, carefully designed and optimized. It uses advanced technologies such as deep learning to understand and generate meaningful responses during conversations. Such a model architecture gives a sense of consistency with real-person conversations while providing users with an excellent user experience.

Not just technology 

The model architecture of AI technology is not only a technical means, but also represents an understanding and simulation of human thinking and communication methods. The quality of the model architecture directly determines the effectiveness and adaptability of AI technology. With the continuous development of AI technology, the optimization of model architecture has become one of the hot topics of research.

In this context, low-code development platforms provide new assistance for the development of AI model architecture. Take the JNPF rapid development platform as an example. It provides AI developers with a convenient and fast development environment and tools, which greatly reduces the cost of development and debugging. Developers can quickly build customized AI models through simple drag-and-drop and configuration. Experience Details: More Details

The low-code development platform has had a positive impact on the new AI model architecture in the digital intelligence era, which is mainly reflected in the following aspects:

First, low-code development platforms provide higher development efficiency. Optimizing AI model architecture often requires a lot of experimentation and debugging, and traditional development methods are inefficient and complex. The low-code development platform simplifies the development process and frees developers from tedious code writing, allowing them to focus on model design and optimization, which greatly improves development efficiency.

Secondly, low-code development platforms promote innovation in AI model architecture. Traditional development methods often require large teams and complex technical support, which limits innovation and attempts in model architecture. The simplicity and ease of use of the low-code development platform enables more developers to participate in the innovation of AI technology and promotes the continuous evolution and breakthrough of model architecture.

Finally, low-code development platforms offer greater flexibility and scalability. The application scenarios of AI technology are ever-changing, and the requirements for model architecture are also different. The low-code development platform enables developers to flexibly customize and expand according to specific needs through modular design and rich component libraries, providing more possibilities for realizing various AI applications.

summary 

To sum up, the low-code development platform has had a positive impact on AI model architecture in the digital intelligence era. It has become an integral part of the development of AI technology by enabling efficient development methods, promoting innovation, and providing flexibility and scalability. In the future development, the low-code development platform will continue to promote the evolution and breakthrough of AI model architecture, bringing more surprises and progress to AI applications in the digital intelligence era.

Note: Part of the content of this article refers to leading international thinking models and theories, and is expanded based on personal experience.

Information summary 

A summary of information is attached at the end of the article.

To summarize the main information about GPT-4 (summary from Yam Peleg’s tweet):

Number of parameters: GPT-4 is 10 times larger than GPT-3. The number of parameters is estimated to be around 120 layers and 1.8 trillion.

MoE architecture: Mixture-of-Experts architecture. This part of the information has been confirmed. OpenAI maintains a certain cost by using the MoE architecture, including 16 Experts, each of which is an MLP.2, about 111 billion parameters, each forward Propaganda is routed to these experts

MoE routing: Although the public technical documents say a lot of advanced routing functions, including how each token selects each expert, etc. But the existing GPT-4 is actually very simple, roughly sharing 55 billion parameters for each attention.

Inference: Each forward pass inference (generating a token) requires 280 billion parameters and 560 TFLOPS, which is in sharp contrast to the approximately 1.8 trillion parameters and 3700 TFLOPS required for each forward pass of the pure dense model.

Training dataset: GPT-4 is trained on approximately 13 trillion tokens. This does not refer to the number of different tokens, but the number of tokens used based on epochs. The text-based data set was trained for 2 epochs, and the code-based data set was trained for 4 epochs.

GPT-4 32K: Each pre-training stage is 8K in length. The 32K version is fine-tuned after the 8K pre-trained model.

Batch Size: The batch size gradually increases and reaches a value in the cluster after a few days. Finally, OpenAI’s Batch Size reached 60 million! That is, each expert has approximately 7.5 million tokens, but not every expert can see all tokens.

Parallel strategy: Due to the limitations of NVLink, OpenAI trained GPT-4 in 8-way tensor parallelism and 15-way pipeline parallelization.

Training cost: The FLOPS of OpenAI training GPT-4 is about 2.15e25, and it takes about 90-100 days to train on 25,000 A100s (MFU is about 32% to 36%). If one A100 is about 1 US dollar, the training cost is about $63 million (maybe only $21.5 million if H100 is used now).

MoE trade-offs: After using MoE, there are many trade-offs, including the difficulty of processing inference, because each model is used to generate text. This means that when generated, some can be used and some are idle, which is very wasteful in terms of usage. Studies show that 64-128 experts have better losses than 16 experts.

The inference cost of GPT-4: 3 times higher than the 175 billion Davinchi (GPT-3/3.5 series), mainly because the cluster of GPT-4 is too large and the utilization rate is lower. It is estimated that about 1k tokens cost $0.0049 (128 A100).

MOA: Multi Query Attention, like everyone else, uses MOA normally. Because only one head is needed, the video memory is greatly reduced, but 32K still cannot run on A100 40G. Continuous batching: OpenAI uses variable batch size and continuous batching methods. It is possible to balance inference cost and inference speed.

Vision Multi-Modal: The multi-modal part of GPT-4 is a single vision encoder with cross attention. This expands the parameters of GPT-4 from 1.8 trillion to about 2 trillion. VisionModel is trained from scratch and is not mature enough. One function of the Vision section is to allow agents to create monthly web pages that can then be converted into images and videos. Part of the data is trained based on Latex and screenshots. There are also YouTube videos, including scripts translated using whisper and frame extraction results.

Inference architecture : Inference is run on 128 GPU clusters, with different clusters in different regions. Each node has 8 GPUs and contains a 130 billion parameter model. In other words, each GPU has less than 30GB of FP16 and less than 15GB of FP8/int8.

Guess you like

Origin blog.csdn.net/sdgfafg_25/article/details/131663696