White Paper | Distributed Storage Development White Paper (2023)

On December 1, at the 2023 Cloud Native Industry Conference, the Cloud Institute of China Academy of Information and Communications Technology jointly released the "Distributed Storage Development White Paper (2023)" together with Huawei, Dell Technologies, IBM and other members of the distributed storage industry phalanx.

1. The need for data intelligence

(1) Large model training requires massive amounts of unstructured data, which places higher demands on the efficiency of data storage and flow.

(2) As a key component of computing power interconnection, data flow is the basis for releasing the value of computing power resources and a key link in solving the problem of digital computing collaboration.

2. Industry analysis

(1) Build a stable data base and the distributed storage market shows steady growth.

In 2022, China's distributed storage market size is expected to be 20.5 billion yuan, with a compound annual growth rate of 15%. Among them, storage solutions integrating software and hardware account for 91.3% of the market, mainly meeting the needs of unstructured data in scenarios such as AI large models and big data lakes.

(2) Industrial ecological picture and close cooperation between industrial ecology

From the perspective of the development of the entire distributed storage industry chain, ecological industries are all showing growth in scale, and product forms and service types are diversified.

(3) Media protocols are accelerating upgrades, and all-flash and converged forms are developing rapidly.

Thanks to flash memory performance, high-speed lossless RDMA network, compression software stack and other all-flash designs, distributed all-flash storage, as a new storage product form, provides stable sub-millisecond access performance.

Distributed storage has developed a new form of distributed converged storage. A distributed storage system supports multiple protocols to provide services at the same time, and realizes protocol interoperability, reducing data relocation and duplicate storage, improving data processing efficiency by 35%, and reducing the cost of approximately 20% energy consumption.

3. Scene Interpretation

The application scenarios of distributed storage are becoming increasingly abundant. This white paper will focus on exploring the emerging application scenarios and the development trends of typical application scenarios. The scenarios covered include AI large models, big data lake and warehouse integration, digital pathology, biological information analysis, quantitative transactions, edge computing and data networks.

Scenes

feature

Distributed storage advantages

AI large model

Large data volume, parallel data processing, diverse data formats, massive small files, high reliability and high availability

Massive storage space and online expansion, massive storage space, efficient data flow through protocol interoperability, and massive small file performance support.

Integrated big data lake and warehouse

Transaction support, open data format, separation of storage and computing, support for multiple workloads, BI support

Unified data storage layer, unified metadata layer, cache acceleration, and unified computing scheduling

digital pathology

The slice files are large, the data volume is large, the data is stored for a long time, and the data management is difficult.

Secondary compression of pathological images, hierarchical data storage, concurrent retrieval of massive slices, innovation of cold data storage media, and multi-protocol interoperability

Bioinformatics analysis

Large data volume, high bandwidth, low latency, high reliability, and the need to adapt to GPU and other high-concurrency computing clusters

Massive data support, performance adaptation to business needs, and data full life cycle management

Quantitative trading

The scale of basic quantitative data is large, quantitative trading relies on "AI + machine learning" to become the mainstream of the industry, there are many types of data, and the signal-to-noise ratio is low

Massive data support, elastic expansion, GPU storage pass-through, unified namespace

edge computing

Ultra-low latency, data security, flexibility and scalability, high reliability, cloud-edge collaboration, and edge intelligence

Long-term low-cost data storage, fast retrieval, multi-protocol interoperability, support for big data analysis, and data security;

data network

Cross-region, cross-architecture, cross-service provider, large data volume

The storage layer builds data flow capabilities across domains and clouds; builds a unified data base for multi-clouds to expand data sharing applications; builds a global file system to form a data interconnection network;

4. Technical perspective

(1) In terms of architecture, develop toward converged loads, higher density, and faster networks;

(2) In terms of functions, develop towards scene-based lossless compression and multi-active disaster recovery;

(3) In terms of hardware, develop towards all-flash, high efficiency and energy saving;

(4) In terms of ecology, open connection to cloud storage and storage direct development;

5. Joint construction and win-win results

(1) Ecologically, build an ecosystem of open interconnection of cloud storage and computing power interconnection;

(2) In industry, promote distributed storage innovation and build AI data engines;

(3) In terms of standards, improve the standards and evaluation system to promote the healthy development of the industry;

Download link:

Link: https://pan.baidu.com/s/1Urcb1VCrcqMkb4UgTkHvcQ?pwd=pqcu

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/134869329