How much computing power is needed for a large model with hundreds of billions of parameters?

Author | Owen Zhu  

Produced | NPCon (New Programmer Conference)

Compared with artificial intelligence in a narrow sense, general artificial intelligence can meet the needs of a wider range of scenarios and achieve a higher degree of logical understanding and tool use capabilities through large models that cross domains, disciplines, tasks, and modes. In 2023, with the continuous breakthrough of LLM large-scale language model technology, large models will bring a new dawn for the exploration of higher-level general artificial intelligence. General artificial intelligence has entered a period of rapid development. In China, large-scale models have shown a trend of blooming, and various large-scale models emerge in endlessly.

In order to take the lead in the era of "Hundred Models Contest", the AI ​​development team needs to focus on solving the huge challenges in terms of computing power, algorithms, and data, and development efficiency and training speed are the core key factors to ensure the competitiveness of large models in the market , is also the core force of the future.

Owen ZHU, AI Architect of Inspur Information's Artificial Intelligence and High-Performance Application Software Department, participated in the first NPCon: AI Model Technology and Application Summit jointly sponsored by CSDN and "New Programmer", and shared the AI ​​large model for the new round of AIGC industrial revolution . The solution to the computing power system, and emphasizes that the comprehensive optimization of computing power, algorithms, data and system architecture plays a vital role in the training of large models.

This sharing mainly includes three parts, namely:

1. Computing bottleneck in the era of "Hundred Models Competing for Show"

2. If you want to refine a large model, you must first sharpen your tools

3. The ceiling of the large model, the infrastructure determines the speed

Remarks: For the live video, please refer to "CSDN Video Number"

394ec35abaeb63c71462218a2f65a9fc.png

Computing bottleneck in the era of "Hundred Models Competition"

The core technology of large model development is composed of pre-training and alignment (value alignment). The first part is pre-training, which requires a large amount of data to make the model converge faster and perform better. The second part is Alignment (value alignment). Alignment (value alignment) is not exactly equal to reinforcement learning. It uses various methods/strategies to optimize the model output, so that AI can learn how to communicate and express in the communication feedback with people. Parts are the core elements to improve the quality of large models.

At present, the basic capability of the model depends on the data, model parameters and computing power. The larger the number of model parameters and the larger the input training data, the stronger the generalization ability of the model. Due to resource constraints, when the two cannot have both, how should we choose? OpenAI's research concluded that, compared with increasing the amount of data, it would be better to increase the amount of model parameters first. Use a 100 billion model to train 200 billion Tokens and a 200 billion model to train 100 billion Tokens. The performance of the model will be higher.

It can be seen that the amount of parameters is an important indicator to measure the ability of the model. When the amount of model parameters increases beyond a certain threshold, the ability of the model shows a leap-forward improvement, showing the ability of language comprehension, generation, and logical reasoning. This is what we call the emergent ability of the model.

How large is the model size to generate emergent capabilities?

9f010430bd9d3d3ac2a91a964bd78375.jpeg

Looking at it now, 10 billion parameters is the threshold for a model to have emergent capabilities, and a model with 100 billion parameters has better emergent capabilities. But this does not mean that the scale of the model will rise to a trillion-scale competition, because the existing large models have not been fully trained. For example, each parameter of GPT-3 basically only trains 1-2 tokens. DeepMind Research shows that if a large model is fully trained, it is necessary to train 20 Tokens for each parameter.

Therefore, many of the current large models with a scale of 100 billion still need to use 10 times more data for training, so that the model performance can reach a better level.

Whether it is increasing the amount of model parameters or increasing the data scale, computing power is still the core driving force for the improvement of large model capabilities: it is necessary to use "enough" computing power to support the "sufficiently accurate" model generalization ability.

At present, the computing power equivalent of large model training is still increasing, from GPT-3 to GPT-4, the computing power equivalent has increased by 68 times. The greater the computing power equivalent, the smaller the cross entropy, and the stronger the model capability. As the number of training tokens, model parameters, and calculation amount increase, the loss of the language model decreases smoothly, which means that the accuracy of a large language model can be further improved with the expansion of calculation amount, parameter scale, and token number.

7cacc8bfa3e5e9917173ee901024b0ea.jpeg

c5d54f2c6865bde3dfe9fdb636cde955.png

If you want to refine a large model, you must first sharpen your tools

The ability of large models comes from a large amount of engineering practice experience, and the engineering challenges of pre-training are huge, which is reflected in the following aspects: First, the evolution of AI large models has great impact on the parallel computing efficiency of clusters, on-chip storage, bandwidth, and low-latency memory access. The planning and construction, performance optimization, and computing power scheduling of the Wanka AI platform are all difficult problems to solve; secondly, large-scale training generally has hardware failures, gradient explosions and other small-scale training problems. The problems that will be encountered; Thirdly, the lack of engineering practice makes it difficult for enterprises to achieve rapid improvement in model quality.

As one of the earliest companies to lay out large models, Inspur Information is the first in the industry to launch the Chinese AI huge model "Yuan 1.0", with a parameter scale of up to 245.7 billion. The practice of large-scale model innovation with a scale of 100 billion parameters has enabled Inspur Information to accumulate practical technical experience in the field of large-scale models and has a professional R&D team to provide the industry with AI computing power system reference design.

In terms of computing power efficiency, it is aimed at the situation that the computing mode is complex and the performance of the computing power cluster is low in the large model training. Source 1.0 adopts the three-dimensional parallel strategy of tensor parallelism, pipeline parallelism and data parallelism in large-scale distributed training. Using 266 8-card NVLINK A100 servers, the training takes about 15 days and the computing efficiency of a single card is about 44%. A total of 180 billion tokens were trained, and the final loss value of the model converged to 1.73, which was significantly lower than other language models in the industry such as GPT-3.

8bc00caf70fc008fb5027e58610291b7.jpeg

For the first time, a large model structure collaborative design method for efficiency and precision optimization was proposed, and in-depth optimization was carried out around the deep learning framework, training cluster IO, and communication. In the case of only 2x200G interconnection, the computing power efficiency of source 1.0 reached 45%, and the computing power efficiency reached 45%. Power efficiency is world-leading. At the level of cluster high-speed interconnection, based on native RDMA, the full line-speed networking of the entire cluster is realized, and the network topology is optimized, which can effectively eliminate the computing bottleneck of hybrid computing and ensure that the cluster is always in the best state when training large models.

a3fe2dbfa4935adf1f48ec89ecd75e7c.png

The Ceiling of Large Models: Computing Efficiency Determines Speed

At present, the computing power gap between China and the advanced level large-scale models in the industry is still large. From the perspective of computing power equivalent, the computing power equivalent of GPT-4 has reached 248,842PD, while most mainstream large-scale models in China have a large computing power of only Thousands of PDs, the gap is as high as nearly a hundred times.

At the same time, there is a huge gap between China and the industry's advanced level large-scale models in terms of algorithms and data. In terms of algorithms, although open source has brought good opportunities for the development of domestic large-scale models to overtake on corners, compared with the performance of top-level self-developed models such as GPT4, there is a "ceiling" in the capabilities of open-source models such as LLaMA.

652d56718782fbfd86e03c39ab7535af.jpeg

In terms of data, there is a significant gap in scale and quality between the Chinese dataset and the English dataset. Compared with the English data of hundreds of billions of words, the Chinese large-scale model has a data magnitude of only tens of billions. Left and right, and the degree of open source is low, and the degree of closure is high.

The development of large models and the development of general artificial intelligence is a very complex system engineering. We urgently need to find the optimal solution for the good ecological development of large models in the future from the system level. Come from actual combat, and accelerate the improvement of model development efficiency by building an efficient and stable intelligent computing system.

1606bd8bb47276d42a58f831614b50b6.gif

256619edaa5099c23696370223d174e1.png

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/132439633