DAOS Distributed Asynchronous Object Storage|Architecture Design

        Distributed Asynchronous Object Storage ( DAOS ) is an open-source object storage system designed for large-scale distributed non-volatile memory (NVM, Non-Volatile Memory), using SCM (Storage-Class Memory) and NVMe (Non-Volatile Memory) -Volatile Memory express) and other next-generation NVM technologies.

        DAOS is a scale-out object storage that provides high-bandwidth, low-latency, and high-IOPS storage containers for high-performance computing applications, and supports next-generation data-centric workflows that combine simulation, data analysis, and machine learning.

        Unlike traditional storage stacks designed primarily for spinning media, DAOS has been rebuilt for new NVM technology to run end-to-end in user space, bypassing the operating system entirely, and is a lightweight system.

        DAOS provides an I/O model that provides native support for accessing highly granular data, rather than the traditional I/O model based on high latency and block storage design, thereby unleashing the performance of next-generation storage technologies.

        Unlike traditional buffers, DAOS is an independent high-performance fault-tolerant storage layer that does not rely on other layers to manage metadata and provide data recovery capabilities. DAOS servers keep their metadata in persistent memory, while bulk data is kept directly in NVMe SSDs.

1. DAOS Features

        DAOS relies on OFI (OpenFabric Interface) to bypass the operating system, deliver DAOS operations to the DAOS storage server, and make full use of any remote direct memory access (RDMA, Remote Direct Memory Access) function in the architecture for low-latency, high-message-rate user Space communication and data storage in persistent memory and NVMe SSDs.

        The key-value storage interface of DAOS provides a unified storage model . After migrating the I/O middleware library to achieve native support for DAOS API, you can take advantage of DAOS's rich API and advanced functions, such as HDF5, MPI-IO and Apache Arrow .

        DAOS also provides an emulation of POSIX. POSIX is no longer the basis for the new data model, but like other I/O middleware, the POSIX interface will be built as a library on top of the DAOS backend API.

        DAOS I/O operations will be recorded and stored in the SCM to maintain a persistent index. Each I/O is marked with a specific timestamp and associated with a specific version of the dataset. Internally, no read-modify-write (read-modify-write) operation is performed, and the write operation is lossless and insensitive to alignment. On a read request, the DAOS server traverses the persistent index, creating aggregate RDMA descriptors, thereby reconstructing the requested version of the data directly in the application-supplied buffer.

        The SCM maps memory directly into the DAOS service address space, and the DAOS service manages persistent indexes through direct load/store. According to the characteristics of different I/O, DAOS service can decide to store I/O in SCM or NVMe:

  • Latency-sensitive I/O such as application metadata and byte-grained data is typically stored in SCM;
  • Checkpoint and batch data are stored in NVMe.

        This approach allows DAOS to stream data into NVMe and maintain an internal metadata index in SCM, providing raw NVMe bandwidth for bulk data. The persistent memory development kit  PMDK  manages transactional access to SCM, and the storage performance development kit  SPDK  performs user space I/O operations on NVMe devices.

DAOS can provide:

  • Ultra-fine-grained, low-latency, and true zero-copy I/O
  • Non-blocking data and metadata operations to support I/O and compute overlap
  • Advanced data placement to address fault domains
  • Redundancy is managed by software and can be reconstructed online, enabling copying and erasing of codes
  • End-to-end (E2E) data integrity
  • Scalable distributed transactions, providing reliable data consistency and automatic recovery
  • Dataset snapshot function
  • Security framework for managing access control of storage pools
  • Software-defined storage management to provision, configure, modify and monitor storage pools
  • Provides native support for I/O middleware libraries such as HDF5, MPI-IO, and POSIX through the DAOS data model and API. The application can directly use the DAOS API without porting the code
  • Apache Spark integration
  • Implement a native producer/consumer workflow using the publish/subscribe API
  • Data indexing and query functions
  • In-storage computing to reduce data movement between storage and compute nodes
  • Disaster Recovery Tool
  • Integrates seamlessly with the Luster parallel file system and can be extended to other parallel file systems, providing a unified namespace for data access across multiple storage tiers
  • Data Mover for migrating datasets between DAOS pools, migrating datasets from Parallel File System to DAOS and vice versa

2. DAOS components

        A data center may have hundreds of thousands of compute nodes, interconnected by a scalable high-performance fabric, where all nodes or a subset of nodes called storage nodes have direct access to NVM storage.

        A DAOS installation involves several components that can be centralized or distributed.

DAOS system and storage nodes

        A DAOS system is identified by a system name and consists of a group of DAOS storage nodes connected to the same fabric. DAOS storage nodes run one DAOS service instance per node, which starts one DAOS I/O engine process per physical socket. Information for these DAOS services is logged into the system map, which assigns each I/O engine process a unique integer rank. Two different DAOS systems consist of two disjoint sets of DAOS servers that cannot cooperate with each other.

DAOS service

        The DAOS service is a multi-tenant daemon running on each storage node's Linux instance (physical node, virtual machine or container). The service's I/O engine subprocess exports locally attached SCM and NVM storage over the network. Services listen on a management port (addressed by IP address and TCP port number), and one or more fabric endpoints (addressed by network URI).

        DAOS services  /etc/DAOS are configured through YAML files in , including the configuration of its I/O engine child processes. The startup of services can be integrated with different daemon management or orchestration frameworks (systemd scripts, Kubernetes services, or parallel launchers like pdsh and srun).

I/O handling

        In the DAOS I/O engine, storage statically spans multiple Target partitions to enhance concurrency. To avoid races, each Target has its own private storage, its own pool of service threads, and a dedicated network context that is directly addressable through the fabric without dependencies on other Targets hosted on the same storage node.

        SCM modules are usually configured in  AppDirect interleaved  mode. Therefore, they  fsdax are presented to the operating system as a single PMEM namespace per socket (in schema). When configuring N Targets per I/O Engine, each Target uses  fsdax 1NN1​of the socket's SCM capacity, independent of other Targets. Each Target also uses a fraction of the capacity of the NVMe drive connected to this socket.

Target

        Target does not implement any internal data protection mechanisms against storage media failure. Therefore, a Target is a single point of failure and also a unit of failure. A dynamic state is associated with each Target: its state can be "up and running" or "down and not available".

        Target is a unit of performance. Hardware components associated with Target, such as back-end storage media, CPU cores, and networks, have limited capabilities and capacities.

        The number of targets exported by the DAOS I/O engine instance is configurable and depends on the underlying hardware (the number of SCM modules and NVMe SSDs of the I/O engine instance). The optimal configuration for the number of targets of an I/O engine is an integer multiple of the number of NVMe drives served by the I/O engine.

Storage APIs, Application Programming Interfaces and Tools

        Applications, users and administrators can interact with the DAOS system through two different client APIs.

        Management API provides an interface for managing the DAOS system. It is designed to integrate with storage management or open source orchestration frameworks of different vendors. dmg The command-line tools are built on top of DAOS' admin API.

        The DAOS library  libdaos implements the DAOS storage model and is mainly provided to developers of applications and I/O middleware who want to store datasets in the DAOS system. The commands commonly used by users  daos are also built on top of the API, allowing users to manage datasets through the command line.

        Applications can directly access datasets stored in DAOS through native DAOS APIs, I/O middleware libraries such as POSIX emulation, MPI-IO, HDF5, or frameworks such as Spark or TensorFlow that have been integrated with the native DAOS storage model .

acting

        The DAOS agent is a daemon process that resides on the client node and authenticates the application process by interacting with the DAOS library. It is a trusted entity that supports signing DAOS clients with certificates. DAOS Proxy supports different authentication frameworks and communicates with client libraries using Unix domain sockets.

GitHub: https://github.com/storagezhang

Email: [email protected]

DAOS:  https://github.com/daos-stack/daos

This article is translated from https://daos-stack.github.io/overview/architecture

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132305285