Give the large model dome a power storage pillar

f159c80c692a7ec896b906de2a2ec0d7.jpeg

Before building a temple, it must first erect pillars sufficient to support the weight of its roof.

Duan Yucai said in "Shuowen Jiezi Zhu", "The master of the pillar is also the master of the house". In other words, the pillar is the most important and basic thing in a house. If the pillars are not strong, even the beautiful carved beams and painted buildings will come to naught.

Today, we are building a dome called AI Big Model with all our hearts. The advantages of good versatility and strong generalization of pre-trained large models have enabled various industries to see the dawn of intelligence, and ignited the hot pursuit of social economy. According to relevant data, at the World Artificial Intelligence Conference held in Shanghai a few days ago, more than 30 large Chinese AI models were unveiled. Looking at the whole of China, it has come to the grand occasion of the "Hundred Models War". According to the "China Artificial Intelligence Large-scale Model Map Research Report" issued by the New Generation Artificial Intelligence Development Research Center of the Ministry of Science and Technology and other institutions, the number of large-scale models developed by China ranks second in the world, and has achieved global leadership in some vertical fields.

It is important to look up at the dome of the model. But at this time, it is more time to think about whether the pillars of this dome are solid and reliable, and how much weight can it support? In addition to the two major AI infrastructures of transmission and computing power, the supporting significance of storage power to the development of large models is attracting more attention.

0b9259348b2310a6f95083128dace891.png

(Zhou Yuefeng, President of Huawei Data Storage Product Line)

On July 14th, Huawei AI Storage New Product Launch Conference was held in the era of large models. During the period, Huawei demonstrated in detail the challenges brought by the large model to the storage base, as well as Huawei's solutions in terms of technology, products, and ecology.

Zhou Yuefeng, president of Huawei's data storage product line, said in the theme sharing of "New Data Paradigm, Release AI New Momentum": "In the era of large models, data determines the height of AI intelligence. As a data carrier, data storage has become a key infrastructure for AI large models. .Huawei data storage will continue to innovate in the future, provide a variety of solutions and products for the era of AI large models, and work with partners to promote AI empowerment in all industries."

When the world is obsessed with building large-scale domes, the storage industry needs to build the pillars that can support the intelligent world. Facing the era of large-scale models, Huawei Storage chose to take on its own responsibility.

e26c1ed0cc491bcd8e2d95fe54efb38f.png

Hypothesis: Lack of storage support

What about the age of big models?

We all know that there are still many levels of problems to be solved in the development of AI large models. For example, the Chinese corpus and data sets available for training are insufficient; large models rely too much on manual tuning, and the landing cost is too high; they rely on high-end computing power, and computing power resources are scarce.

But in addition to these problems, we must face up to the fact that if the large model lacks suitable storage products and storage resources, the result may not be optimistic. In Huawei's view, in different fields and stages of AI development, it faces four major challenges in data storage.

The first is that data collection is too slow. Large models have a huge data scale and require a large amount of unstructured data for training. As a result, AI training needs to copy a large amount of original data from multiple data sources across regions. If this process is too complicated and the efficiency is too low, the progress of AI development will be put on hold. In particular, it will severely limit the implementation of large-scale models in industries with large-scale local data.

Secondly, the data preprocessing cycle is long. AI training first requires a large amount of data preprocessing. In particular, the scale of large-scale model data is huge, and the workload of data preprocessing also increases accordingly. For a typical 100TB large model data set, data preprocessing often takes more than 10 days, accounting for 30% of the entire process of AI data mining. If there is no targeted storage assistance, as the model continues to grow, the workload, working hours, and computing power consumption of data preprocessing will continue to increase, making it more difficult to train large models.

018802e643862f77116f2ed68e7a3378.png

Then the data set is loaded slowly and the training is easily interrupted. The training parameters of the large model and the size of the training data set are extremely large, which leads to various situations that will affect the loading of the data set, resulting in the interruption or even restart of the model training. Especially when training a complex model structure, data loading is not smooth, and error-prone will lead to a huge increase in work overhead.

For example, according to relevant data, OpenAI's training on GPT-4 used the computing power of about 25,000 A100 GPUs for 90 to 100 days of training. Its model flops utilization is only 32% to 36%. A large number of failures lead to restarting checkpoints, which is the main reason for its low computing power utilization. If this problem cannot be solved, the continuous development of large models means that endless computing resources and human resources will be consumed in data failures, making the application cost of large models unbearable.

In addition, there is another challenge, which is the low real-time and accuracy of the reasoning of the model. When inferring and deploying a large model, the latest data needs to be connected at any time, and the current mainstream method takes a long time to train and is expensive. If this challenge cannot be overcome, the effect of reasoning and deployment of large models will be greatly reduced, thereby affecting the final implementation of intelligence.

It can be seen that in the era of large models, companies not only need to compare algorithms, computing power, data, but also specific storage power. Specifically, it is to compete for storage resources, the accuracy of the storage system to meet the needs of large models, and the hardware and software adaptation to improve the training and inference effects of large models on the data side. If the pillar of storage capacity is lacking, just like the lack of AI computing power, lack of data, and lack of landing scenarios, the era of large models is simply untenable.

Build: for AI large models

storage pillar

22b0a4d138492626cbce19fcca61d2ae.png

Fortunately, in response to this reality, the storage industry has already taken action. For example, in the face of AI storage in the era of large models, it can provide four major capabilities: for data collection, Huawei can use data weaving capabilities to achieve global unified data view and scheduling across systems, regions, and clouds; To deal with long-term problems, Huawei has created near-storage computing capabilities to allow data to be stored and prepared to better release AI computing resources. To solve problems such as training interruptions, Huawei uses technologies such as pre-processing acceleration and AI training/reasoning acceleration. , to achieve 0 waiting in the training process.

These long-term building technical capabilities and targeted problem-solving ideas are aggregated together to form products and solutions that can meet the challenges of large-scale model storage and meet the development and deployment requirements of large-scale models—this is Huawei’s storage for large-scale models. The dome, built as a way of storing the pillars.

These technical capabilities have finally condensed into two products released this time: the OceanStor A310 deep learning data lake storage that provides leading performance for the whole process of AI, and the FusionCube A3000 training/push super-integrated all-in-one machine that can greatly reduce the threshold for using AI .

For AI large models with ever-expanding data volume and ever-changing models, a high-performance and targeted data storage base is a necessary condition for development. OceanStor A310 deep learning data lake storage is born for this. It has ultra-high scalability, high performance of mixed load, multi-protocol non-destructive integration and interoperability, and can realize massive data management of AI in the whole process of data collection, preprocessing, training, and reasoning.

5518b98b8bc7b043bf28633c6cc7db70.png

Faced with the industry trend of AI computing and HPC converging, OceanStor A310 can provide the same-source data analysis capability for AI, HPC, and big data scenarios. It supports a maximum of 4096 node expansion, and a single-frame 5U supports the industry's highest 400GB/s bandwidth and 12 million The highest performance of IOPS supports multi-protocol lossless integration and intercommunication, realizes zero copy of data, and improves the efficiency of the whole process by 60%. The OceanStor A310 storage can realize the preprocessing of training data through near-memory computing, and the preprocessing efficiency is increased by 30%. The global file system (GFS) is used to access raw data scattered in various regions, simplifying the data collection process, thereby releasing Huawei's storage capacity for AI large models in a centralized manner, and one-time access to data problems and storage challenges in the entire process of AI development.

In the future, AI will integrate into thousands of industries and reshape thousands of industries. This also leads to the hyper-converged all-in-one machine that can adapt to more industry scenarios and application scenarios will become a rigid demand in the process of industrial intelligence.

55d12067b0ca7463d4ee70f8cab449b1.png

To this end, Huawei launched the FusionCube A3000 training/push hyper-integrated all-in-one machine , which is oriented to industry large-scale model training/reasoning scenarios and integrates OceanStor A300 high-performance storage nodes, training/push nodes, switching devices, and AI for tens of billions of model applications. The platform software and management operation and maintenance software provide large-scale model partners with a check-in deployment experience, realize one-stop delivery, achieve out-of-the-box use, and complete deployment within 2 hours, which can be said to have opened up the large-scale model landing. last mile.

Both training/push nodes and storage nodes can be independently scaled horizontally to match model requirements of different scales. At the same time, the FusionCube A3000 uses high-performance containers to share GPUs for multiple model training and inference tasks, increasing resource utilization from 40% to over 70%.

With the support of these two products, whether it is to explore the development and training of large-scale models of intelligent ceilings, or to realize the scene-based deployment of intelligent implementation, large-scale AI models will receive strong support.

The pillar of storage power is thus established under the dome of the era of AI large models.

60efd5193dafe7a7ebf3de293f6a85ae.png

The Future: The Storage Pillar Grows Up

AI dome points to Tianyu

How can the storage industry provide continuous support and assistance for the in-depth and long-term development of large-scale AI models? For this question, Huawei also gave its own answer in the press conference. In Huawei's view, the explosion of large models should not just be a short-term opportunity for the storage industry. In the long run, AI and storage should promote and assist each other to form a benign situation of long-term positive development.

To this end, Huawei will first actively invest in the future and continue to conduct research and preparations on AI data storage. At the press conference, Zhou Yuefeng had a dialogue with Zhang Ji, a young genius from Huawei, and discussed how Huawei can improve its storage capacity in data collection, data training, and data reasoning, so as to help AI development and implementation to be better.

For example, Huawei is researching a technology called "data shelter" for the problem of safe data transfer across regions. This technology enables full encapsulation of data, access rights, and credential information, thereby ensuring that data is in a safe and reliable environment during the transfer process.

This kind of prediction of future AI technology development trends and R&D investment will become the key for Huawei to continue to open up the industrial space in the field of AI storage, and will also become a new driving force for the storage industry to embrace the opportunities of the AI ​​era.

On the other hand, in order to adapt to the industrial development space brought by the AI ​​model, the storage industry must rely on cooperation and ecological construction, so as to be able to provide users with comprehensive and industry-specific solutions.

Driven by ecological development, the diverse and complex software and hardware needs of users in the process of developing large models and applying large models will be continuously met, thus ensuring smooth development and implementation of models.

f0c5d79dac11c322e2ce0baae992abcf.png

Overall, Huawei storage not only provides a popular AI model, but also provides storage solutions that can solve problems immediately and satisfy resources. It also focuses on future development and continues to innovate and grow in storage technology and ecology. The AI ​​large model and the storage base are like the relationship between the dome and the pillars. As the pillars get higher and higher, the height of the dome will naturally increase, and the limit of intelligence can be continuously broken.

In order to be able to build the dome of the AI ​​large model, Huawei storage has brought three aspects of value, and truly realized the industry responsibility with storage as the pillar.

First of all, in the face of a series of real AI training and deployment problems such as data collection and data training interruption, Huawei storage brings solutions to the problems with better technology and provides coping strategies, thus building a bridge between the two technical fields of storage and AI. A bridge has been established to achieve communication between supply and demand.

Secondly, Huawei provides a storage base with more abundant resources and more reasonable utilization for large-scale model training and implementation. This will help to optimize the comprehensive cost of large models, and increase the possibility of using and adapting large models in various industries.

Thirdly, the open cooperation scheme promoted by Huawei in the storage field can promote the mutual promotion of the storage ecology and the AI ​​ecology. Let more software and hardware companies join the opportunity of the AI ​​large model and share the dividends of the intelligent era, thereby promoting the development and evolution of the storage industry in an all-round way.

The pillar of AI storage that goes down to the root and pierces up to the sky is gradually formed under the condensation of these values.

Making good use of this pillar can support the development of large AI models and improve the efficiency of the entire process of large models from training to inference.

The development of this pillar can help the intelligentization of thousands of industries and create a new infrastructure for the intelligent era.

Standing on this pillar, we can see the dawn of the fourth industrial revolution.

78c426f6ac246fd68989087f5c74756a.gif

Guess you like

Origin blog.csdn.net/R5A81qHe857X8/article/details/131733455
Recommended