A new generation of distributed converged storage, data scenarios All In One

1. Summary

        On May 11, 2023, the Guangzhou Station of the National Tour of Inspur Information officially started. At the conference, AS13000G7, a new generation of distributed converged storage, was released. It adopts the design concept of extreme fusion architecture to realize the "All In One" efficient integration of the same set of storage to meet four types of unstructured data. The data storage capacity is increased by 300%, and the IO The performance is improved by 100%. At the same time, it realizes lossless access of four unstructured protocols, realizes "All In One" in multiple scenarios, and accelerates the release of the value of data elements.

2. Introduction

        In the era of smart computing, computing power is productivity, and data is the core factor of production, as well as the basis for supporting AI training and smart applications. In thousands of intelligent application scenarios such as AIGC, intelligent driving, intelligent manufacturing, and intelligent medical care, PB-level or even EB multimodal data is a key element to support intelligent applications. For example, in the recently popular AI large model, the number of parameters is diverse, massive and rapidly growing.

        The GPT-3 language large model has 175 billion parameters, and the latest GPT-4 has exceeded trillions of parameters. At the same time, data types have become more abundant. In addition to text, it also requires images, audio, video, etc. data. In the face of massive and polymorphic data scenarios in thousands of industries, enterprises need extreme storage that simplifies complexity.

        In scenarios such as autonomous driving, astronomical observation, and gene sequencing, a data processing process usually involves data storage and access methods of multiple protocols such as files, objects, and big data. Taking the astronomical observation scene as an example, a complete astronomical observation data processing involves four steps: data collection, data preprocessing, data analysis, and result preservation. Different access protocols are used in different stages. Traditional distributed storage only supports single-protocol access, that is, customers need to deploy multiple sets of storage systems at the same time, and when processing different protocols, data conversion and copying are required, resulting in waste of storage space and increased storage costs, while greatly reducing data loss. Processing efficiency.

3. Architecture Introduction

       The new generation of distributed converged storage realizes a cluster system that simultaneously supports four protocols of files, objects, big data, and video to realize data fusion; it also supports four types of storage media such as flash memory, disk, tape, and CD to realize management fusion; it can support All application scenarios such as infrastructure cloudification, structured, and unstructured; support full lifecycle management, and data flows freely and efficiently among four-level storage of hot, warm, cold, and ice, realizing "one storage architecture supports one data center".

4. Key technologies 

        First, storage resources are integrated and interoperable, and data is shared globally

       The distributed converged storage platform builds a global and unified storage resource pool. Data and metadata are managed in a unified manner. Different protocols (NFS/CIFS/HDFS/S3) are shared and shared, and only one copy of created files, data and metadata is saved, effectively Reduce data duplication storage costs.

        Second, multi-protocol integration and intercommunication, zero copy of data

        According to the characteristics of traditional NFS, CIFS, HDFS and S3 storage protocols, the distributed converged storage platform designs a unified storage architecture. The protocol layer does not need data conversion and copying, installation of gateways or plug-ins, and modification on the computing side or application layer, and can directly access native semantics without sense, greatly improving data processing efficiency.

        Third, native semantic support, zero semantic loss

        Semantic loss is the main reason why traditional protocol interworking solutions cannot be used commercially. File, HDFS, and object services have different semantics due to different usage scenarios, such as file snapshots, object multi-segment uploads, and HDFS Ranger authentication. Due to the inconsistency of storage architecture and metadata management, the traditional protocol interoperability solution cannot realize the complete semantic support of each protocol, and usually requires adaptation and modification of the upper layer, resulting in semantic loss. The distributed converged storage platform realizes multi-protocol unified metadata management on a unified storage architecture, supports native non-destructive semantic access to the storage system of each protocol, and applies non-sensing access.

        Fourth, authority interoperability, multi-protocol authority linkage

        Due to the different permission management methods for files, objects, and HDFS protocols, the permission management of traditional protocol interoperability solutions is relatively chaotic, and permission interoperability cannot be achieved, which brings great inconvenience and trouble to user access.

        According to the different access forms and isolation restrictions of Windows users, Unix users and object users, a user mapping mechanism is designed to realize the sharing of different types of user rights and break the barriers of isolation between different types of users; Manage unstructured data permissions. One piece of data, one piece of permission information, and one kind of protocol modification permission can be implemented, which will take effect on other protocols at the same time, and the real-time linkage of permissions can be truly achieved.

        Fifth, redundant protection, data security and reliability

        Support a more comprehensive data protection strategy, provide cross-node, cross-rack, and different levels of data redundancy protection, and users do not need to worry about the risk of data loss caused by unexpected failures such as downtime and power failure. At the same time, it supports data copy and erasure redundancy strategy, which can realize timely and rapid data recovery and improve data reliability.

        Sixth, hierarchical storage of data to reduce storage costs

        With the explosive growth of data, a single form of storage can no longer meet the needs of users for high performance and low cost. AS13000 provides flexible grading strategies, and stores data in high-performance storage media and relatively low-cost storage media according to the set strategy and popularity, making reasonable use of storage space, reducing storage costs, and quickly responding to user data storage needs.

        Finally, feature-level interoperability, efficient and convenient

        With the unified feature architecture and operation interface, feature-level interoperability is realized, and unified value-added feature services are provided externally, such as unified quota, unified QoS, unified hierarchical storage, unified recycle bin, and unified metadata retrieval. After setting, it will take effect immediately and synchronously for various protocols such as NFS, CIFS, S3, and HDFS.

5. Highlights

        A set of storage architecture integrates massive polymorphic data

        With the deepening of digital transformation, the application of massive polymorphic data is rapidly increasing, and the demand for data fusion storage is increasing. How to save polymorphic data such as videos and pictures for a longer period of time with better cost and higher efficiency , more reliable, is the challenge of the industry.

        A new generation of distributed converged storage that supports converged storage with a four-in-one architecture . Users can purchase a set of storage to enjoy four storage services: files, objects, big data, and videos. Different unstructured storage services can access the same data. Converged storage The space utilization efficiency is increased by 200%, and a storage architecture can be used to efficiently support a data center, which can meet performance requirements and help enterprises reduce TCO.

        At the same time, in massive multi-modal scenarios, Inspur Information creates high-density proprietary products, adopts 4U60 disk configuration, supports 20TB large-capacity hard drives, and the capacity of a single node exceeds 1PB, one is worth three; at the same time, based on 32+2 large-scale erasure , Data reduction technology, the hard disk utilization rate is as high as 94%.

        A set of storage platforms to accelerate data processing and flow

        Whether it is the route decision-making of autonomous driving, the precise marketing of the e-commerce platform, digital smart applications such as digital medical online consultation, etc., it is inseparable from the collection, training and construction of massive pictures, texts, videos and other unstructured data. Model analysis and decision-making, real-time data will account for 25% of the global data circle by 2023. Taking high-precision maps as an example, high-precision maps are generally collected and returned by collection vehicles every day for analysis and refreshment. Each vehicle collects tens of terabytes of data every day, and transmits GPS, trajectory, speed, latitude and longitude, etc. in real time. Data, processing tens of millions of bits per second. Performance has become the never-ending appeal of smart applications.

        In order to improve performance, the new generation of distributed converged storage promotes disk control collaboration and full-link end-to-end performance optimization in a storage platform, allowing data to communicate and flow efficiently in hot, warm and cold four-level storage. AS13000G7 is equipped with the fourth-generation Intel Xeon CPU, self-developed PCIe 5.0 NVMe SSD, and achieves performance improvement through code-level joint optimization of technologies such as RDMA protocol, dedicated CPU core, data partition, and random transfer sequence. The bandwidth of a single node exceeds 50GB/s, which is equivalent to transmitting 25 high-definition movies in one second. Compared with the previous generation of products, the performance of the new AS13000G7 has been improved by at least 40%.

        A set of storage platforms ensures data security and reliability

        The new generation of distributed converged storage uses six layers of protection from components, devices, complete machine systems, core software, management software to solutions to ensure that services are always online and data is never lost. At the component level, we strictly select high-reliability components, and customize components such as hard disks and SSDs around reliability. At the cluster level, based on a fully symmetrical distributed architecture, it can be expanded to a maximum of 10,240 nodes. Based on a large-scale elastic EC, it can tolerate the failure of any 4 nodes at the same time.

        Facing virus and anti-ransomware attacks, an end-to-end data security solution has been launched. First of all, it can provide users with multiple protections of production storage, active-active storage, and off-site backup; secondly, through the analysis of read and write behaviors, predict blackmail behaviors, immediately terminate malicious blackmail behaviors and quickly restore data through high-density snapshot technology; thirdly, through Introduce third-party anti-virus software to kill ransomware; finally, through data anti-tampering technology, physical isolation technology, encryption and other technologies, viruses cannot enter, change, and data cannot be seen or taken away, creating the ultimate data security A line of defense.

reference:

http://www.dostor.com/p/84080.html

https://mp.weixin.qq.com/s/7-xDoN2JiR5HIKXjP5evWA

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132223312