In the data revolution in the AI era, why is distributed fusion storage a great task?

Some people say that the rise of artificial intelligence applications represented by ChatGPT marks the arrival of the singularity of the AI ​​era.

True. This wave of AIGC has hit, making people truly aware of the huge leap that AI has brought to productivity. Since this year, AI large models have become the focus of industry users, and even financial, media, advertising and marketing users have tried AI large models.

However, the success of OpenAI ChatGPT, in addition to the integration of various artificial intelligence technologies such as Transformer, also contributed to the efficient support of the infrastructure. With the deepening of AI applications, more and more users realize that with the advent of the AI ​​era, a data revolution will also occur: how to efficiently store and process massive multivariate data, how to achieve efficient management of the data lifecycle, How to select the appropriate data accuracy for the AI ​​large model...

Therefore, distributed converged storage begins to play a major role and plays an important role in the data infrastructure of various AI applications. Powerful distributed converged storage will become the cornerstone of various smart applications in the AI ​​era, truly solve various data pain points such as training and reasoning of AI applications, and inject the key power of data storage into the release of AI productivity.

The Data Revolution in the AI ​​Era

In recent years, the overall growth rate of the distributed converged storage market has been significantly faster.

The reason is the rise of new application scenarios represented by big data and AI. These new applications and new scenarios require a large amount of unstructured data. Gartner predicts that by 2025, artificial intelligence will become one of the most important factors driving infrastructure decisions, which will lead to a tenfold increase in infrastructure requirements.

If in the past ten years, the rise of AI applications has first brought about a revolution in computing power, making multivariate and heterogeneous computing power the general trend, and intelligent computing centers have become the direction of data center construction; then, in the next ten years, with the deepening of AI applications , the scale of data continues to increase, and a data revolution is coming, which will have a profound impact on the development of data infrastructure.

First of all, AI large model applications represented by AIGC are accelerating towards multimodality. For example, OpenAI GPT-5 is a multi-modal large model from the beginning, which means that data such as audio and video will be connected, the data set will usher in exponential growth, and the demand for data storage will also have a fundamental impact.

Li Hui, General Manager of Inspur Information Storage Product Line, said bluntly that the large AI model will have a fundamental impact on the data infrastructure: first, the large model will become multi-modal, and the data set after screening will reach the PB level, while the data volume before screening It will be even more amazing; second, the deepening of large-scale model applications means that the access of massive terminals will bring a large number of reasoning requirements, and the delay requirements for data infrastructure will become higher and higher.

Secondly, AI applications in various industries are gradually entering the in-depth stage, and the performance of data storage will be extremely eager. For example, the penetration rate of L2-level autonomous driving is increasing. When advancing from L2 to L3, the performance requirements for training are getting higher and higher. For example, the emergence of vehicle-road coordination scenarios has further accelerated the performance requirements of data infrastructure.

"In scenarios such as vehicle-road coordination, intelligent manufacturing, and intelligent medical care, data processing performance and timeliness are the core challenges at present." Liu Ximeng, deputy general manager of Inspur's information storage product line, introduced.

Third, the scale of AI applications and the diversification of scenarios will increase the complexity of data processing and pose great challenges to data interoperability and green energy conservation of data infrastructure. For example, vehicle-road coordination is now a typical intelligent application of device-edge-cloud linkage, and data often needs to flow, transmit and apply in multiple scenarios.

Therefore, the industry generally believes that the era of AI will accelerate the transformation of data infrastructure. Under the general trend of unstructured data, distributed fusion storage will play a huge role in this transformation.

Why Distributed Converged Storage is a Great Task

In the face of the data torrent brought by non-institutionalized data, the reason why distributed converged storage is favored is mainly because of its high scalability, high reliability and other advantages, which can cope with various challenges brought by massive data.

In fact, in addition to the above-mentioned advantages, distributed converged storage has also been evolving and iterating in recent years, keeping pace with the times in terms of protocol fusion, performance and security, in order to adapt to new data storage brought by applications such as big data and AI. need.

The first important feature of distributed converged storage is the need to achieve multiple fusions. In addition to the early fusion of protocols such as blocks, files, and objects, many distributed converged storage products have also begun to incorporate protocols such as big data.

Why is the direction of multi-fusion of distributed fusion storage significant? In fact, the importance of multiple fusion can be understood from the data processing pipeline. AI applications often involve multiple protocols and long links for data processing, often mixed loads are intertwined, and there are links between multiple data sets that are copied back and forth. Not only data The processing efficiency is low, and the performance cannot meet the processing requirements of AI applications.

Taking the science, education and research scenario as an example, the current science, education and research is a typical fusion model of computing + AI + Bigdata, and the efficient processing of data is the foundation and key. Liu Ximeng introduced that it is a very painful process to establish data sets in many scientific, educational and research scenarios, because data replication is required, and the replication of dozens of PB data takes many days. If the protocol is integrated, data replication can be eliminated and data processing efficiency has been greatly improved.

The second important feature of distributed converged storage is intelligence and agility. As we all know, despite the explosive growth of the current data volume, the amount of data used for analysis is still very small. Relevant data show that the average retention rate of acquired data is only 2%, and a large amount of data has never been analyzed and utilized. Distributed fusion storage needs to be able to process data intelligently in real time to meet the performance requirements brought by various AI applications.

The third important feature of distributed fusion storage is to improve security. With the popularization of AI applications, various security issues are also exposed. As the last line of defense for data, the data protection capabilities of distributed converged storage also need to be improved accordingly.

At present, almost all distributed converged storage products are accelerating iterations to support changes in demand for new applications such as big data and AI at the data storage level. Taking AS13000G7, a new generation of distributed converged storage of Inspur Information, as an example, it can be called a representative of the evolution of distributed converged storage.

On the basis of guaranteeing safety and reliability, Inspur Information AS13000G7 takes the "all-in-one" extreme architecture as the core, possesses extreme capabilities such as "all-in-one extreme fusion architecture, extreme performance, and extreme capacity", and creates a general-purpose distributed fusion storage, Various product forms such as high-density video distributed fusion storage, performance all-flash distributed fusion storage, etc.

Taking the fusion architecture as an example, Inspur Information AS13000G7 is the first to implement a set of storage that supports multiple interface protocols and multiple data storage applications (files, objects, big data, videos, etc.), and realizes multi-protocol mutual access and intercommunication of a piece of data. For example, in response to the various needs of AI applications for data processing, decentralized management processes can be avoided, data copying and complex performance tuning are not required, so that all data processing processes are in a set of distributed converged storage.

"For example, the creation of data sets in teaching and scientific research scenarios can realize multiple protocol access to one data without data replication, which greatly reduces the capacity challenge brought by data replication." Liu Ximeng said.

In terms of extreme performance, Inspur Information AS13000G7 is based on Intel Xeon 4th generation scalable processor, supports PCIe 5.0 high-speed bus, DDR5 cache, is equipped with self-developed NVMe SSD, and realizes end-to-end joint optimization through disk control collaboration. Compared with the previous generation, the performance has been improved by 40%.

In addition, Inspur Information AS13000G7 has reached a new level in terms of extreme capacity. A single cluster can be expanded to a maximum of 10240 nodes, and a single file system supports hundreds of billions of files. Based on the iCap intelligent space management engine, through the industry-leading 32+2 large-scale correction, Intelligent capacity algorithms such as intelligent equalization, compression and deduplication, multi-source zero-copy, and soft copy allow storage space utilization to reach over 94%.

As one of the fastest growing storage manufacturers in the world, Inspur Information has been working in the field of distributed converged storage for many years, has been at the forefront of the market, and has excellent market performance, ranking first in the market both in terms of installed capacity and sales The release of Mao, its new generation of distributed converged storage AS13000G7, not only sets the benchmark for a new generation of distributed converged storage, but also means that innovative distributed converged storage products are taking on a major role in the market.

The future of distributed fusion storage can be expected

It is undeniable that in the past many years, centralized storage has always been a well-deserved protagonist. Although distributed converged storage has been developed for many years, it has not really ushered in a good market opportunity until now, and it has begun to fully blossom in terms of industrial scale, growth rate, and product innovation.

In Li Hui's view, distributed converged storage will evolve towards the form of data center operating system + storage base in the future, and will become an important platform to support the digital transformation of enterprises in the future.

From the perspective of product form, distributed integrated storage will take the mainstream form of software and hardware all-in-one machine in the future, and the product will be closer to the scene. According to the "White Paper on Distributed Converged Storage Development" issued by the Distributed Converged Storage Industry Alliance, the scale of China's distributed converged storage market will reach tens of billions in 2021, of which all-in-one machines will account for as much as 91%, and software and hardware collaboration will achieve end-to-end high reliability , high performance and integrated operation and maintenance capabilities.

In addition, the application scenarios of distributed converged storage will be more extensive, and the product form will be closer to the application scenarios. Based on Inspur Information AS13000G7-MS60's extreme capacity scenarios of massive and multi-modal unstructured data, such as smart cities, smart transportation and other scenarios that generate massive real-time data and analysis applications, AS13000G7-MS60 can provide extensive compatibility, high cost performance, and high Reliable storage services; AS13000G7-MN24 provides industry-leading performance and data processing capabilities for real-time data analysis scenarios such as autonomous driving.

From the perspective of technological innovation, distributed converged storage will be more closely related to major trends such as AI applications in the future. As we all know, data storage belongs to the bottom layer of infrastructure products and has always been far away from applications. However, as AI applications enter the in-depth stage in the future, the technological innovation of distributed converged storage will be more closely linked with upper-layer applications.

"System-level deduplication and compression technologies are very important innovations in distributed converged storage. Taking the AI ​​large model as an example, in the data collection stage, distributed converged storage system-level compression technology to identify data, and use AI to reduce Occupation of storage space and improvement of data quality. There are still many underlying technologies worth researching and exploring." Li Hui introduced.

From the perspective of the development of data center architecture, the trend of storage-computing separation architecture will have many impacts on distributed fusion storage in the future, especially the rapid development of CXL protocol and DPU processor, which will make the role and status of distributed fusion storage more prominent in the future . Li Hui said bluntly: "In addition to the separation of storage and computing in the data center, the cloud data will also move towards decoupling. In the hybrid cloud or multi-cloud mode, how to better flow and share data is the core appeal of users. The decoupling of cloud and data is undoubtedly It helps the flow of data."

Comprehensive observation, the "White Paper on the Development of Distributed Converged Storage" predicts that in the next three years, China's distributed converged storage will still maintain a growth rate of 40%, and distributed converged storage is widely deployed in cloud, big data analysis, AI and other application scenarios Demand is the cornerstone of the AI ​​era. As the No. 1 manufacturer in China in terms of sales volume in the distributed converged storage market, Inspur Information undoubtedly has a very deep understanding of the product innovation and future trends of distributed converged storage. Facing the future, with the advent of Inspur Information AS13000G7, Inspur Information is expected to accelerate the application of distributed fusion storage in various industries, injecting a steady stream of data vitality into the digital transformation and intelligent upgrading of thousands of industries.

Guess you like

Origin blog.csdn.net/dobigdata/article/details/130871964