Parallel consistent hash file system CHFS based on local persistent memory

Summary

        CHFS is an adaptive parallel file system that utilizes the persistent memory of compute nodes. The design is entirely based on a highly scalable distributed key-value store using a consistent hashing algorithm. CHFS improves the scalability of parallel data access performance and metadata performance by eliminating dedicated metadata servers, sequential execution, and centralized data management. The implementation takes advantage of multi-core and multi-core CPUs, high-performance networking, and remote direct memory access of the Mochi-Margo library. In an environment consisting of 4 persistent memory cluster nodes, CHFS performed 9.9 times faster than the state-of-the-art DAOS distributed object store and 6.0 times faster than GekkoFS on the IOR hard-write benchmark. In terms of scalability, CHFS exhibits better scalability and performance in terms of bandwidth and metadata compared to BeeOND and GekkoFS. CHFS is a promising building block for the HPC storage layer.

Introduction

        Leading supercomputers are used not only for computationally intensive scientific computing applications but also for data-intensive large-scale data analysis and machine learning applications. Storage performance has always been an issue because CPU/GPU performance improves faster than storage performance. To close this performance gap, leading supercomputers (such as Oak Ridge National Laboratory’s Summit [19], AIST’s ABCI, Tokyo Institute of Technology’s Tsubame, and University of Tsukuba’s Cygnus) have introduced local storage in computing nodes, such as NVMe SSD and persistent memory. However, effectively utilizing node local storage is a challenging problem that has been addressed by several teams. An ephemeral distributed file system is a file system that uses the local storage of a compute node during the job distribution process. Existing temporary distributed file system designs rely either on using block-based local file system storage or on databases with log-structured merge tree (LSM tree) data structures. Applying these methods to byte-addressed persistent memory is not the best option to effectively exploit the performance advantages of persistent memory [9].

        This paper proposes the design of a temporary parallel file system called CHFS that uses persistent memory local to the node. Key features of the design include leveraging the low latency and high bandwidth properties of byte-addressable persistent memory, as well as parallel file access and metadata performance that is highly scalable in terms of the number of compute nodes. The design utilizes a persistent in-memory key-value store in persistent memory to take advantage of low latency and high bandwidth. Since persistent memory is byte-addressable, there is no need to use data structures optimized for block devices. Additionally, to reduce metadata access overhead, a dedicated metadata server is not used. Historically, storage performance has been improved by decoupling metadata management from file data management [3, 8, 32]. In this case, the client process can access the file data in parallel after accessing the metadata. In contrast, the single metadata server approach has scalability limitations in terms of the number of client processes. To overcome this limitation, researchers have studied distributed metadata servers [10, 20, 24]. To further reduce metadata access overhead, this study proposes a method to directly access file data without accessing metadata. Furthermore, to avoid limiting scalability, sequential execution and centralized data management are excluded from this study. The proposed file system is based on a highly scalable distributed key-value store with consistent hashing. The contributions of this study include:

  • An ad hoc parallel file system for persistent memory of compute nodes in leading supercomputers is designed with scalable metadata performance and scalable file access performance.

  • A simple file system design based on distributed key-value storage can be flexibly applied to other distributed key-value storage systems.

  • Open source implementation that efficiently utilizes multi- and many-core CPUs, high-performance networking, and Remote Direct Memory Access (RDMA) for research, experimentation, and production.

  • A flexible backend design that uses both a persistent in-memory key-value store and a POSIX file system for flash and other block devices.

2 Related work

        The performance and characteristics of persistent memory have been reported by Yang [39]. In order to provide a portable programming interface, a Persistent Memory Programming Development Kit (PMDK) was developed [21, 27]. Pmemkv [22] is a library that implements an in-memory key-value store for persistent memory in PMDK.

        DAOS is an advanced distributed object storage system designed for persistent memory and flash memory devices [4, 15]. Intel Wolf System's DAOS achieved the best IO performance in the June 2020 IO500 list [11]. It is not considered an ad-hoc distributed file system utilizing node-local persistent memory, but instead assumes dedicated storage nodes equipped with persistent memory and flash devices.

        Orion [38] is a distributed file system for persistent memory. It leverages RDMA for performance and has only one metadata server, which limits the scalability of metadata performance when the number of client processes increases.

        Octopus [16] is a persistent memory distributed file system with distributed metadata management. It proposes a set-scheduled transaction to support file system operations in distributed transactions, such as mkdir, mknod, and rmdir. Compared with Octopus, CHFS does not require distributed transactions because each file system operation only accesses one key-value pair, which is more efficient in performance than Octopus.

        There are also some specific distributed file systems [5], such as Gfarm/BB [33], BurstFS [37], UnifyFS [18], BeeOND [34] and GekkoFS [36]. Gfarm/BB takes advantage of storage location to create files in the node's local storage. To manage file locations, it has a single metadata server with hot backup. BurstFS and UnifyFS also take advantage of the locality of storage, writing data in a log-structured format in node local storage. Metadata is managed by a distributed key-value store. These systems focus primarily on checkpoint write-intensive workloads, while CHFS is optimized for both read and write operations. BeeOND uses distributed metadata servers to improve metadata performance, while CHFS does not use dedicated metadata servers to further improve metadata performance. GekkoFS does not have a dedicated metadata server. It only has storage servers where metadata is stored using RocksDB [6] and file data is stored in the local file system. RocksDB is not optimized for persistent memory [17]. In GekkoFS, metadata and file data are distributed modulo the hash value. GekkoFS is similar to CHFS, but CHFS is designed entirely on top of a distributed key-value store, where all data is stored. CHFS has a relatively simple design and can be flexibly applied to other distributed key-value stores. Through consistent hashing, CHFS can still serve most data when servers in the distributed key-value store join or leave.

        Burst buffers were introduced in parallel file systems to absorb parallel write accesses in high-performance computing applications. The burst buffer is primarily used as a write-back cache. In production systems, burst buffers are deployed via dedicated storage nodes. The Infinite Memory Engine (IME) [7] can be shared among multiple users and processes and used as a temporary storage area for data analysis during transfer. IME relies on a backend parallel file system for metadata management, which limits metadata performance.

        SymphonyFS [19] is proposed as a write-back cache for parallel file systems. It leverages node-local NVMe SSDs as temporary cache storage and uses block-based log-structured cache management to support a single shared file access pattern. It also relies on a backend parallel file system for metadata management, which limits metadata performance. The metadata performance of the BeeGFS-based cache file system [1] is also subject to similar limitations.

        In Luster, persistent client-side caching is another way to leverage node-local storage. It applies a hierarchical storage management mechanism to cache files into node local storage. Because it caches the complete file, single shared file access performance is not improved.

        There are several proposals for distributed hash tables using peer-to-peer technology. Chord [31] is a scalable consistent hash lookup protocol when it is distributed among N nodes, while the original consistent hash [12] requires the list of all N nodes to be known among all nodes. Chord requires log(N) entries in each node for the routing table and log(N) steps to find the target node. Chord and other peer-to-peer systems are optimized to improve scalability with the number of peer nodes, while CHFS is optimized to improve metadata and parallel data access performance.

3 CHFS design

        参考otatebe/chfs: CHFS parallel and distributed file system for node-local persistent memory (github.com)

4 CHFS implementation

        参考otatebe/chfs: CHFS parallel and distributed file system for node-local persistent memory (github.com)

5 Performance evaluation

This section will evaluate the performance of persistent memory and CHFS as follows:

(1) Performance evaluation of persistent memory,

(2) Performance comparison with DAOS and GekkoFS,

(3) Performance comparison with Gfarm/BB and BeeOND,

(4) Scalability evaluation of CHFS.

picture

picture

picture

picture

picture

picture

picture

picture

6 Conclusion

        This paper proposes the design of CHFS, an adaptive temporary parallel file system for compute node persistent memory. It is entirely based on highly scalable distributed key-value storage and consistent hashing. It improves the scalability of parallel data access performance and metadata performance in terms of the number of compute nodes by not using dedicated metadata servers, sequential execution, and centralized data management. It leverages the Mochi-Margo library to efficiently utilize multi- and many-core CPUs, high-performance networking, and remote direct memory access.

        An evaluation of persistent memory performance shows the advantages of pmemkv in devdax mode. It showed 8.5 GiB/s put bandwidth in devdax mode, while pmemkv in POSIX write and fsdax mode showed 5.8 GiB/s and 4.2 GiB/s respectively.

        Performance evaluation on a 4-node persistent memory cluster shows that CHFS outperforms DAOS and GekkoFS by 9.9 times and 6.0 times respectively on the IOR hard write benchmark. The performance on the MDtest hard-write benchmark is 6.0 times and 4.4 times higher than that of DAOS and GekkoFS respectively. CHFS shows good scalability in terms of the number of computing nodes.

        In a performance comparison with Gfarm/BB and BeeOND, using 10 Cygnus supercomputer nodes, all file systems showed good performance on the IOR easy benchmark, while only CHFS showed consistently good performance on the IOR hard benchmark. performance. In the metadata benchmarks, CHFS showed 8.6 to 23.5 times better performance than Gfarm/BB and BeeOND except for the find benchmark; for the find benchmark, it outperformed Gfarm/BB and BeeOND by 3.1 to 3.1 times. 3.2 times. These results show that CHFS has low latency and high throughput for metadata access compared to Gfarm/BB and BeeOND.

        Regarding scalability, CHFS shows better scalability and performance than BeeOND and GekkoFS in terms of bandwidth and metadata. When using 64 compute nodes, CHFS's IO500 bandwidth score is 17.3 times higher than when one compute node is used, and the IO500 metadata score is 15.8 times higher than when one compute node is used. Therefore, CHFS is a good cornerstone of the HPC storage layer.

        CHFS does not take full advantage of consistent hashing, however, it is still useful for fault tolerance and checkpoint support. Future work will focus on improvements in CHFS fault tolerance support.

Supplementary views

        At present, the project is mainly contributed by one person and is still in the experimental and testing stage, but you can try to make some modifications to use it as a local cache acceleration file system on the computing side. It is currently suitable for accelerating the storage scenario of temporary files generated in the middle of the application.

 edwu  ed Meditations  Published in Beijing on 2023-09-13 00:00 

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132962554
Recommended