Zhijiang Lab: How to build a storage layer for super-heterogeneous computing power clusters based on JuiceFS?

Today, high-performance computing combined with artificial intelligence technology is driving scientific research innovation. For example, by deciphering the genetic code of rice, we can promote the development of crop breeding from "experimental selection" to "calculation selection", and quickly analyze the interaction between molecules and proteins in the field of medicine to discover potential drug molecules that can effectively intervene in diseases.

Zhijiang Laboratory is the promoter of the above-mentioned scientific research innovation. The laboratory is a new research and development institution of the nature of a public institution led by the Zhejiang Provincial Government, supported by Zhejiang University and other institutions, and participated by enterprises. Research in the field provides new methods, tools and means.

Due to the natural heterogeneity of computing power resources, and the computing power realized by different technologies often comes from different system architectures or instruction sets, it will lead to software incompatibility, thereby raising the threshold for computing power usage and making it difficult to utilize computing power. Effectively improve. To solve this problem, Zhijiang Lab brings together various heterogeneous computing power resources to form a huge "computing power pool" . This article will share how Zhijiang Lab builds a storage layer for ultra-heterogeneous computing power clusters based on JuiceFS.

01- Export reactor of Zhijiang laboratory

The digital reactor is a large-scale scientific research device in Zhijiang Laboratory. The whole scientific research device is composed of hardware and software. In terms of software, Jiang Yaoguang intelligent operating system is responsible for the research and development.

The intelligent operating system mainly includes two key components. First, it provides a common computing platform solution to provide support for the upper application layer . Through this platform, users can develop and apply for different application fields, such as computational materials, computational pharmacy, computational astronomy, etc.

Second, we implement a heterogeneous resource aggregation scheme . In Zhijiang Lab, we have multiple heterogeneous clusters, including CPU clusters, GPU clusters, and some supercomputing resources. These different types of computing resources are used in different ways. Through the heterogeneous resource aggregation solution, we aggregate these different computing resources in a unified way to achieve unified management and use.

The overall structure is as above. In the computing and data center of Zhijiang Laboratory, we have deployed multiple sets of heterogeneous clusters, including H3C's Aofei cluster, Zhijiang's computing cluster, domestic Dawning cluster, etc., as well as edge computing scenarios. incorporated into our control system. By means of cluster device plug-ins, we have unified the abstraction of these different clusters and realized the aggregation of computing power. In the entire heterogeneous computing power federation system, we abstract these different computing power clusters into Kubernetes (k8s) clusters for management.

The business instructions and different types of jobs issued by the upper layer are determined by the meta-scheduler to which cluster to send these jobs to. According to different scheduling policies, such as computing priority, energy consumption priority and performance priority, the specific execution mode of computing jobs is determined. At present, we have access to AI computing power of about 200P (PFLOPS, 1P is equivalent to performing one quadrillion floating-point operations per second) and HPC computing power of 7000 cores.

The demand for storage on the computing power side

First: the abstraction and unification of the storage layer . Because in many computing scenarios, including supercomputing and AI training, POSIX interfaces are used. Therefore, we hope to uniformly use the interface of JuiceFS to provide services at this level.

The second aspect: the versatility of the storage solution . The currently connected computing power clusters are heterogeneous, so it is necessary to consider that the solution is applicable to different heterogeneous clusters.

The third aspect: data arrangement conditions . Our data has typical hot and cold characteristics. During the calculation of a task, the data it uses is hot data. After the task is calculated or after a few days, the data becomes cold data. Data reading and manipulation is relatively small.

The fourth aspect: storage performance requirements . Data read and write performance is better. Especially the read performance of hot data. In a computing power cluster, computing resources are very precious. If the CPU and GPU are idling and waiting due to slow data reading, it is a great waste.

Storage solution selection

Scenario 1: Bare Object Storage (OSS) combined with S3FS + NAS

There is a big problem with this solution, that is, the performance of directly using bare object storage is very poor. In addition, the S3FS mount points that use object storage naked often appear inexplicably lost. Once the mount point is lost, the container will be inaccessible. If the mount point is to be restored, the entire container must be restarted, which causes great disruption to user services.

Due to the small size of the cluster at the beginning and the simple deployment of this solution, the laboratory initially adopted this solution for deployment. However, with the gradual expansion of the cluster scale, especially the digital reactor that started construction last year, when the number of nodes gradually expanded from the initial 10 to more than 100 nodes during the evolution from a single cluster to more than 100 nodes , this solution Basically it doesn't work anymore .

Solution 2: Alluxio + Fluid + OSS

After investigation, we found that the structure of the program is relatively complex, involving the composition of many components. Zhijiang Lab is an ultra-heterogeneous multi-cluster environment. Since Alluxio is not a strongly consistent file system, it is actually just a glue layer for caching. In such a multi-cluster environment, you will face the problem of metadata inconsistency, and solving this problem is particularly difficult. Since the business products of upper-level users are very diverse, we cannot interfere with the way users use them. In this case, if the data is inconsistent between different clusters, it will cause serious problems. Secondly, OSS is still used at the bottom layer. When the data scale expands to a certain extent, due to the metadata performance problem of OSS, when the storage scale reaches a certain level, operations such as metadata synchronization and cache layer initialization of new clusters will also encounter large problems. performance bottleneck.

Solution 3: JuiceFS (finally adopted)

JuiceFS has very detailed community documentation, can be used directly, and performed well in our test cluster building and final online deployment. In addition, JuiceFS supports CSI and can be deployed in containers, and it has better adaptability to domestic hardware. Therefore, we finally chose to use JuiceFS as the storage base on our computing power side.

Advantages of using JuiceFS

First of all, JuiceFS provides a rich selection of metadata engines, such as Redis and TiKV, which make JuiceFS have better metadata performance. Currently, our lab uses TiKV built with three nodes as the metadata engine. Since this configuration was established last year, the current performance is not enough, and we will gradually improve the performance in the future.

Initially we considered using Redis as the metadata engine, but later found that if we use Redis, we cannot achieve horizontal expansion. Therefore, using TiKV, you can gradually expand as the number of file systems grows, which is indeed better.

Second, in a cross-cluster environment, the atomicity and consistency of files can be achieved by using JuiceFS. Files written in cluster A are immediately visible in cluster B. However, this cannot be done with Alluxio. Alluxio needs to perform some operations such as data synchronization events, and these events will actually bring a certain amount of overhead.

Third, JuiceFS has caching capabilities. The client can configure a cache directory, which can greatly reduce the pressure on the underlying storage of the computing power cluster after using the cache.

Fourth, JuiceFS is very compatible with POSIX. We found that the actual compatibility of Alluxio is not that good, and its client performance is relatively mediocre. Alluxio may be more suitable for the unified access layer of different heterogeneous data sources, and it is better for reading data. However, it may not be ideal to use if data needs to be written or modified frequently.

Fifth JuiceFS's community is very active.

This is measured by ourselves in the laboratory environment. The test tool: FIO 16 threads, 4M Block, 1GB data. The performance of the NAS in the above picture cannot be seen, because the service was still being provided in the production environment at the time of the evaluation. At that time, there were About 70 nodes are running, and the bandwidth is very small, so it can't run at all.

02-Storage-computing separation architecture evolution

In the early days, the entire high-performance computing process was actually divided into many links, but data scattered in different storage systems would bring challenges in terms of efficiency and convenience. In order to simplify data management and transfer, we use a unified storage base as the storage infrastructure. The core capabilities of the storage base include high reliability, low cost, and high throughput, so we chose object storage as the storage base . Storing data in object storage can easily achieve hot and cold tiering of data, thereby saving storage space.

However, there are still some problems with directly allowing computing clusters to directly use raw object storage. The first is the problem of poor metadata performance. For example, the list operation of files in the same directory will take a very long time when there are a large number of files. The second problem is high bandwidth consumption. The data lake provides an ordinary IP network instead of an RDMA high-speed network, so the total bandwidth is limited.

Therefore, in addition to object storage, we also built a metadata cluster and used the TiKV database. Based on object storage and TiKV, we built the JuiceFS distributed file system. The computing power cluster reads the data of the file system by installing the JuiceFS client on the node. This way, we can overcome some limitations of object storage, improve metadata performance, and reduce bandwidth consumption.

In order to achieve efficient data transfer, we allow users to upload and download files through the file management system. The bottom layer of the file management system uses the JuiceFS S3 gateway to write data into the bottom storage.

In addition to the data lake and metadata cluster, we also built a cache cluster, which is tightly deployed in the computing cluster, and the main purpose is to achieve the best I/O performance. This solves the problem of efficient data flow between the computing cluster and the base of the object storage data lake. Users don't need to care whether the data is stored in object storage or in a cache cluster.

The computing power system controls the data flow. The computing cluster and the cache cluster are connected through a 200G RDMA high-speed network. The BeeGFS high-speed parallel file system is deployed on the cache cluster, and the file system is mounted as a directory to the computing cluster. In this way, the computing cluster can use the caching system as if it were a local directory.

03- Product construction of storage capacity

In different business scenarios, storage requirements and performance indicators are different. In order to serve users more efficiently, we put forward the idea of ​​productizing storage capabilities. Currently, JuieFS is applied to the following types of storage products.

General file storage

JuiceFS will store its data in a specific directory and generate a unique access path based on the organizational structure of the user. Data isolation is achieved by directly mounting the path into the container. Users can upload and download files through the page, or use the commands and tools we provide to operate files.

storage volume

In the initial construction stage, there is a problem with general-purpose file storage, which is poor scalability of capacity. The capacity of the underlying object storage cluster (oss) is limited. As the amount of data increases, users cannot apply for more storage space. To solve this problem, we introduce the concept of storage volumes.

Storage volumes can be compared to cloud disks, and different storage volumes are equivalent to different types of cloud disks. For different storage types, we can package them into different storage volumes to meet the needs of users in different scenarios.

For scenarios that require frequent reading and writing of a large number of small files, it is necessary to use storage products with low latency and high throughput. To meet this demand, we converted the previously built cache cluster into a high-speed storage volume function, and directly opened its file system directory to users . In this way, users can directly use high-speed storage without accessing it through JuiceFS, and can feel the performance advantages of high-speed storage more directly.

For users who need to save large data but do not read it frequently, JuiceFS and object storage can be combined to create a standard storage volume . This can provide large storage capacity and acceptable throughput performance, while supporting cross-cluster network interoperability compared to high-speed storage volumes.

In addition, some users may have higher requirements for performance, for example, they need local disk products, but they also need the ability of data persistence. In the Kubernetes scenario, if users directly write data to the local disk, there is a risk of data loss, such as unexpected restarts or physical node problems. In this case, users need a persistent solution. We can open a part of the storage space of the user on the local disk of the affected node as a local storage volume, and schedule the task to the specified node according to the storage volume specified by the user during job scheduling.

In addition, different storage products differ in capacity, throughput, and cross-cluster intercommunication capabilities. For example, high-speed storage can communicate within a cluster, but not across clusters; storage products vary in capacity and cost. High-speed storage adopts an all-flash cluster, and the construction cost is relatively high, while the construction cost of object storage is relatively low, and it has a large storage capacity. Therefore, different storage hardware (facilities) capabilities are packaged into different storage products to adapt to different business scenarios of users.

data orchestration

We also implemented a data orchestration feature when using JuiceFS. Administrators can upload commonly used datasets to a certain directory in the file system, and this directory can be abstracted into a public dataset at the upper layer. Different users can mount these datasets when creating jobs. Ordinary users can also upload their own private datasets, and preheat these datasets through the preheating function of JuiceFS .

We built a cache cluster inside the computing power cluster. Using the warmup command, user data sets can be warmed up directly from both ends to the cache cluster of computing nodes. In this way, users can directly interact with the high-performance cluster built by themselves when training a large number of models, without interacting with remote OSS clusters, thereby obtaining better performance.

In addition, this setting can reduce the network bandwidth pressure of the object storage base. The entire cache eviction process is managed automatically by the JuiceFS client, since access directories can be configured with an upper limit. For users, this part of the function is relatively transparent and easy to use.

04-I also encountered some problems during the use of JuiceFS

file read performance

After we chose to use JuiceFS, we did some file reading performance tests internally and in collaboration with the algorithm team. At that time, from the test results, the read performance of JuiceFS was always much slower than that of NAS. We started to find out why JuiceFS is slower than NAS.

Later we discovered that in scenarios using JuiceFS and TiKV as metadata, API operations like enumerating directories are actually random, it does not guarantee a consistent order like NAS or other file systems . In this case, if the algorithm is based on random selection of files or the code is fixed, it might be argued that those files selected should be fixed.

In the scenario of processing a large number of small files, the overhead of metadata is considerable. If the metadata is not cached in memory, it needs to be fetched from the metadata engine every time, which will bring a large overhead compared to no cache. Therefore, through this question, we found that it is necessary to organize the index directory of the file in a specific scenario . For example, an algorithm may need to process hundreds of thousands or even millions of files. If you want to ensure the consistency of algorithm training, you first need to use these files as index files and an index directory tree of your own. Every time the algorithm is trained, the index file is read directly instead of calling the list dir operation to ensure that the file directory tree under this folder remains consistent during the algorithm training.

Editor's note: The slow reading performance is mainly related to the user's usage scenario. After the evaluation, the function of "random reading of directories" has not been adjusted. If other users have similar problems on this issue, please feel free to raise them.

TiKV cannot be garbage collected

During use, we encountered the problem that TiKV cannot perform garbage collection. Since we only use this file system, and it shows a capacity of 106T with 140 million files, and TiKV occupies a capacity of 2.4T, this is obviously abnormal.

According to official documents, such as Redis, about 100 million files should only occupy about 30GB of capacity. After investigation, we found that the metadata engine of TiKV may not be garbage collected. We also looked at the report and found that the entire garbage collection metric was empty. The possible reason is that we only deployed TiKV, but not TiDB. However, TiKV's garbage collection actually needs to rely on TiDB, which is a problem that is easily overlooked .

Editor's note: JuiceFS added TiKV's background GC task in PR #3262 and #3432, which fixed this problem. These fixes have been merged in v1.0.4.

JuiceFS client memory usage is high

When we mount the JuiceFS client, we set the cache cluster as the save directory, and set its capacity relatively high, with a theoretical upper limit of 50T.

In this case, the JuiceFS client periodically scans the cache directory and builds an in-memory index so that JuiceFS knows which data resides in the cache directory. So this takes quite a bit of memory. If the directory is very large, we recommend that users turn off this scanning function.

When testing small file random I/O, we felt that the performance was ok, but when testing sequential I/O, a larger problem appeared. For example, using the dd command to create a 500MB file, it turns out that the object store generates a lot of snapshots. It can be seen that the storage and operations on the object storage here far exceed the operations that should be done to create a 500MB file.

During further investigation, we found that -o writeback_cacheafter enabling the parameter, sequential writes will become random writes, thereby reducing the overall sequential write performance. This parameter is only suitable for very advanced randomness scenarios. Serious problems can result if this parameter is used in a scenario other than this. This is also a point to be aware of when using JuiceFS.

Editor's note: This problem is mainly for the scenario of using NAS as a cache. It has been optimized in 1.1beta, which greatly reduces the memory usage and improves the speed when scanning. JuiceFS added in #2692 --cache-scan-intervalto customize the scan time, and can choose to scan only once on startup or turn off scanning completely, user configurable option. For users who use local disks for caching, no adjustment is required.

05-Follow-up planning: richer and more diverse storage products

Multilevel

We will provide more levels of software and hardware products, and productize these capabilities in the form of different storage volumes to meet the storage needs of users in different scenarios.

isolation

  • At present, there are data security risks. All user data are stored in the same large file system and mounted on the bare metal through hostpath. If some users have node login permissions, they can actually access the data inside the entire file system. In order to solve this problem, we plan to use CSI mode combined with path customization to avoid isolation problems in the future.

  • We will also launch the quota management function. When users use storage products, there needs to be a mandatory means to limit the storage capacity that users can use, and to be able to accurately check how much capacity users actually use. The process of directly using the du command to view capacity is expensive and inconvenient. The quota management feature will solve this problem.

  • In the metering and billing scenario, we need to know the traffic generated by users and the energy consumed, and bill according to the actual capacity used. Therefore, good capacity management capabilities are required.

Monitoring & O&M

  • When using JuiceFS, we mount it directly on the physical machine. By exposing a monitoring port, our production cluster can communicate with these ports and establish a monitoring system that can monitor and collect all monitoring data.

  • Data disaster recovery and migration capabilities are still relatively lacking. We encountered a typical scenario, that is, the capacity of the existing cluster is insufficient, and a new cluster needs to be launched. Between the old and new clusters, how to handle data migration and different data migration methods, and how to ensure uninterrupted business as much as possible without affecting production users, and realizing data migration are still relatively difficult issues. Therefore, we plan to find solutions in the future to improve our capabilities in this area

  • In addition, we are also developing general capabilities based on JuiceFS and CSI plug-ins to enable dynamic mounting on different storage clients. In the production environment, users have certain requirements for the adjustment of mounting parameters, because different mounting parameters are suitable for different business products. However, if you directly adjust the mount parameters, it may cause the interruption of the entire physical node. Therefore, if the ability of dynamic mounting can be realized, the user only needs to switch the business appropriately without restarting or other operations.

Some functional expectations for JuiceFS:

  1. Volume management capability (quota, user, permission) The volume capability has partially realized the quota function before, but in fact, what we need more is the management capability based on user and administrator permissions. Currently, JuiceFS is mounted on a large file system, and creating a file system for each user will bring high overhead. Therefore, what we currently use is based on a large file system, and manages the permission coordination of different users through different directories . There is a lack of a unified and centralized user permission management system, and we still need to rely on Linux permission management. However, the rights management of Linux is distributed on different nodes, and it is relatively difficult to manage user rights. I am thinking about whether it is possible to rely on metadata and use it as a centralized database capability to manage users and permissions. In this way, volume management capabilities can be more productized.

  2. Support distributed caching capability . It can make full use of the computer's physical resources and local disks, and utilize them through computing nodes and networks within the cluster. At present, we understand that the commercial version provides this function, but the community version does not yet.

  3. Mount point hot update capability . In different scenarios, users need different mount parameters, but the unmounting and remounting operations are too heavy, which will affect the user's business. Users can accept short unreadable or read interruption, but cannot restart the business container or interrupt the algorithm or planner. We are researching and developing this capability internally.

Editor's note: The catalog quota requirements have been realized in 1.1Beta, please pay attention to the release information next week for details. Distributed caching and mount point update capabilities are currently available in the commercial version, and these two functions are also in the planning of the community version.

If you are helpful, please pay attention to our project  Juicedata/JuiceFS  ! (0ᴗ0✿)

Musk announced that Twitter will change its name to X and replace the Logo . React core developer Dan Abramov announced his resignation from Meta Clarification about MyBatis-Flex plagiarizing MyBatis-Plus OpenAI officially launched the Android version of ChatGPT ChatGPT for Android will be launched next week, now Started pre-registration Arc browser officially released 1.0, claiming to be a replacement for Chrome Musk "purchased for zero yuan", robbed @x Twitter account VS Code optimized name obfuscation compression, reduced built-in JS by 20%! Bun 0.7, a new high-speed JavaScript runtime , was officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/10082842