Talk about daos high-performance distributed storage

Introduction

  • The IO model of most applications now will increase the proportion of metadata and unaligned data fragmentation. At the same time, the alignment constraints introduced by traditional storage software and the large amount of delay lead to worse and worse performance for these types of IO applications. The combination of large-capacity persistent memory (SCM) and high-speed hardware structure provides the best opportunity to redefine storage specifications and efficiently support today's IO-intensive applications

  • Based on SCM, the design of the complete storage stack needs to be reconsidered. In order to release the performance of these new hardware, the new software stack adopts a shared-nothing interface with byte granularity, and it can support large-scale distributed storage. It is a brand-new IO architecture DAOSbased on SCMand provides distributed storage services under the premise of ensuring performance through global access to the address space of objects.NVMefabric一致性、可用性、弹性

Legacy Parallel File System Limitations

  • Conventional parallel file systems are built on block devices, and the submission of IO is through the block interface of the kernel; they optimize the operation of the disk through the IO scheduler, merge writes and other optimization means to adapt to the characteristics, and then send a large amount of data streams to the seekdisk workloaddrive for higher bandwidth. But as new hardware 3D-XPointemerges to provide low latency that is several orders of magnitude lower than traditional storage, the software stack designed for mechanical disk will become a large overhead for these new storage.

  • Most parallel file systems will provide capabilities, such as directly transferring data RDMAfrom the client to the server , and then persisting the block storage on the server. Due to the lack of a unified poll processing model for block device IO and network events, IO processing relies heavily on multi -threaded concurrent processing, so context switching during IO processing cannot take advantage of the low latency of the network. The software stack of the traditional parallel file system can still be used on the storage device, and can achieve higher performance.page cachebuffer cache中断RPCcache/distribute lock3D NAND/3D-XPoint

Daos software architecture

  • Daos(Distribute Asynchronous Object Storage)It is an open source custom object storage based on non-volatile memory (NVM). daosIt provides key-valuestorage interfaces and functions such non-bloking I/Oas , 数据的多版本, and 快照so on.

  • DaosThe storage system makes full use of next-generation NVMtechnologies, such as SCM(Storage Class Memory)and NVMe(NVM express). Using kernel bypasstechnologies, end-to-end runs in user mode, and does not require any system calls during IO operations.

picture

  • As shown in the figure above, Daosthe core is divided into three parts, which are SCM和PMDK, NVMe和SPDK, libfabric. SCM和PMDKThe first part daosis used SCMto store all metadata, application keyindexes and delay-sensitive small IO. daosCall the system call to initialize the persistent memory at startup, For example, after enabling DAXthe file system function, map the persistent memory file to the virtual memory address space. After the system is started and running, daosthe persistent memory device can be accessed through memory instructions in the user mode. Persistent memory devices are very fast, but have low capacity and high cost, so they are very suitable for storing metadata; for data in distributed storage, devices daosare used to achieve the goal NVMethrough SPDKtechnology , and IO submissions are submitted kernel bypassasynchronously SPDKThe user mode queue, SPDK IOafter completion, creates an index for these data in persistent memory. libfabricYes daos, the last part, it is mainly responsible for high-performance networks, such as supporting Omni-Path/IBnetwork architectures. It is a library defined in user mode, and at the same time exports communication services libfabricto applications that use it . It provides message-based asynchronous functions including data transmission and network polling.fabricAPI

  • daoskernel bypassBased on new hardware and network technology, distributed storage running in user mode , it currently supports SCMand NVMedoes not support mechanical disks.

  • daosIs a C/Smodel based on, daos clientis a linrarycan be integrated into the application, it runs in the same address space as the application. daos serverIt is a multi-fault-tolerant daemonprocess, which directly accesses SCMand NVMestores all metadata and small IO SCM, and large IO is stored in NVMeit. daos serverIt does not rely on pthreadto handle concurrent IO requests, but uses user-level threads User Level Thead(ULT)to handle them.

picture

Daos data storage strategy

  • daosProvided in the form of stored exported objects key0-valueor key-arrayin the form of APIs for user access. In order to avoid scalability problems and the overhead of maintaining metadata (such as the layout of the object used to describe the location of the object data), the daosobject in is 128bitused to identify the uniqueness of the object, and 128bitthe code is also used to describe the distribution and data of the data. Protection policy (whether it is a copy or ec) and other information. daosAccording to the configuration of the storage pool, the layout of the random number generation object is generated. This advantage is similar to crushthe algorithm of ceph.

picture

  • daos serverThe direct connection to the memory bus for metadata storage SCMand NVMethe direct connection to the memory bus for data storage PCIe. Use memory  load/storeinstructions to access the memory map SCM, and then use SPDK APIthe user mode access NVMe . Once a hardware failure SCMoccurs NVMe, there will be data or metadata loss. In order to ensure data loss, or methods daosare provided to protect data and restore data. When the data protection function is enabled, it will be replicated or chunked into multiple data shards and data verification shards, and then stored in different storage nodes. Once a hardware failure or node failure occurs, it will still be accessible in degraded mode , data recovery is to recover from other copies or verification data.replicationerasure codingdaos objectdaos object

picture

  • replicationProvides relatively high data redundancy, daosadopts primary-slavethe protocol for writing operations, primary replicais responsible for accepting requests for writing, and then primary replicaforwards the requests to slave replicaprocess distributed transactions. primary-slaveThe model differs from the traditional replica model. primary replicaOnly forward rpc to slave server. All replica node requests are obtained directly RDMAfrom the peer client through the method . A variant of the two-phase commit protocol is used. If one replica cannot apply the change, all replicas are notified to update. If the server fails to process the copy write, this node will be excluded from the transaction, and then a different normal node will be selected as a replacement node through the algorithm, and then the previous transaction status will be assigned to this normal node. If the failed node returns to normal at this time, it will capture the transaction status according to the data recovery protocol, while ignoring the local transaction status. When the node fails during the health check, it will report to the multi-node-based protocol service. The raft service in the server will scan the object id, calculate the layout of each object, and then find out all affected obejcts; The objet id of the algorithm is sent to the emergency server of the algorithm. The emergency node rebuilds the affected data by pulling other replicas.bufferdaosdaosdaosdaos-serverraft

  • Erasure CodingProvides a data protection strategy that saves more space and improves space utilization. daos clientIt is a lightweight library, which is integrated into the process, so the EC encoding of the data is performed on the client, and the node where the client process is located will consume more CPU resources. daos clientCalculate the check code of the data, create data fragments and data check blocks RDMA Destriptor, and then send a RPCrequest to the leader server of the check group to coordinate the write operation. This write operation is similar to the write of the copy. The nodes participating in the ec write operation directly from bufferTo obtain data in the client , daos ecthe two-phase commit protocol is also adopted to ensure the atomic writing of data on different nodes. When the written data is not equal to stripe_size, most storage systems will read/encode/writeensure the consistency of data fragmentation and data verification through processing. This operation code is very large (caused by amplification problems), and a distributed lock is required to ensure read and write consistency. However daos, in order to avoid this overhead, the method of copying part of the written data to the parity server is adopted Multi-version data module, so the parity server can easily calculate the parity data through the copy data. When a node fails during the reading process, daosit will provide degraded reading. daos clientIt will first obtain the stripe information of all the data to reconstruct the lost data, and adopt a two-phase commit protocol to pass the transaction to the normal server node, and then process the lost data. data reconstruction.

  • daosThere are three types of failures. The first is service crash, which is daoshandled by gossip-like protocol SWIM; the second is NVMefailure, daoswhich is SPDKjudged by the state of polling equipment; the third is storage medium failure, which daoswill be detected and saved and Verify the checksum for assurance. When the server receives a write request, the server verifies the checksum or stores the checksum and data. The verification function can be enabled or disabled on the server side according to performance requirements. When the application comes back to read data again, if the read data is aligned with the previously written data, the server returns the data and check code directly; otherwise, the daos server verifies the check code of the data block involved in the read operation, and then calculates the value of the read data Check code, and then return the data and check code to the client. If the daos client detects a verification code error during the reading process, it will enable degraded reading or switch to other replicas for reading or rebuild data on the client (ec mode). The client will also report the verification code error to the server. The server will collect all verification code errors through detection and verification, and then perform sums vefifyand scrubbingreport them to the client.

Daos data model

picture

picture

  • daosThe data model contains two different object forms, one is array objectsto allow the application to present a multi-dimensional array form; the other is key/valueto store object data, this method provides kv interface and multi-levelkv interface. In either form, data objects are versioned, allowing applications to easily roll back to previous versions of data. Each object belongs to a domain ( daos container). Each container has a private object address space, and the transaction processing is also independent of other containers in the poll.

picture

  • daosSupport access to posix semantics. Posix is ​​not daosa function of the storage model, but daosa library built on the back-end api. A posix file system namespace is in. daos containerThe posix api is fusedriven by using the daos engine api (libdaos) and daos File system api (libdfs) to access data.

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132306008