Introduction
-
The IO model of most applications now will increase the proportion of metadata and unaligned data fragmentation. At the same time, the alignment constraints introduced by traditional storage software and the large amount of delay lead to worse and worse performance for these types of IO applications. The combination of large-capacity persistent memory (SCM) and high-speed hardware structure provides the best opportunity to redefine storage specifications and efficiently support today's IO-intensive applications
-
Based on SCM, the design of the complete storage stack needs to be reconsidered. In order to release the performance of these new hardware, the new software stack adopts a shared-nothing interface with byte granularity, and it can support large-scale distributed storage. It is a brand-new IO architecture
DAOS
based onSCM
and provides distributed storage services under the premise of ensuring performance through global access to the address space of objects.NVMe
fabric
一致性、可用性、弹性
Legacy Parallel File System Limitations
-
Conventional parallel file systems are built on block devices, and the submission of IO is through the block interface of the kernel; they optimize the operation of the disk through the IO scheduler, merge writes and other optimization means to adapt to the characteristics, and then send a large amount of data streams to the
seek
diskworkload
drive for higher bandwidth. But as new hardware3D-XPoint
emerges to provide low latency that is several orders of magnitude lower than traditional storage, the software stack designed for mechanical disk will become a large overhead for these new storage. -
Most parallel file systems will provide capabilities, such as directly transferring data
RDMA
from the client to the server , and then persisting the block storage on the server. Due to the lack of a unified poll processing model for block device IO and network events, IO processing relies heavily on multi -threaded concurrent processing, so context switching during IO processing cannot take advantage of the low latency of the network. The software stack of the traditional parallel file system can still be used on the storage device, and can achieve higher performance.page cache
buffer cache
中断
RPC
cache/distribute lock
3D NAND/3D-XPoint
Daos software architecture
-
Daos(Distribute Asynchronous Object Storage)
It is an open source custom object storage based on non-volatile memory (NVM).daos
It provideskey-value
storage interfaces and functions suchnon-bloking I/O
as ,数据的多版本
, and快照
so on. -
Daos
The storage system makes full use of next-generationNVM
technologies, such asSCM(Storage Class Memory)
andNVMe(NVM express)
. Usingkernel bypass
technologies, end-to-end runs in user mode, and does not require any system calls during IO operations.
-
As shown in the figure above,
Daos
the core is divided into three parts, which areSCM和PMDK
,NVMe和SPDK
,libfabric
.SCM和PMDK
The first partdaos
is usedSCM
to store all metadata, applicationkey
indexes and delay-sensitive smallIO
.daos
Call the system call to initialize the persistent memory at startup, For example, after enablingDAX
the file system function, map the persistent memory file to the virtual memory address space. After the system is started and running,daos
the persistent memory device can be accessed through memory instructions in the user mode. Persistent memory devices are very fast, but have low capacity and high cost, so they are very suitable for storing metadata; for data in distributed storage, devicesdaos
are used to achieve the goalNVMe
throughSPDK
technology , and IO submissions are submittedkernel bypass
asynchronouslySPDK
The user mode queue,SPDK IO
after completion, creates an index for these data in persistent memory.libfabric
Yesdaos
, the last part, it is mainly responsible for high-performance networks, such as supportingOmni-Path/IB
network architectures. It is a library defined in user mode, and at the same time exports communication serviceslibfabric
to applications that use it . It provides message-based asynchronous functions including data transmission and network polling.fabric
API
-
daos
kernel bypass
Based on new hardware and network technology, distributed storage running in user mode , it currently supportsSCM
andNVMe
does not support mechanical disks. -
daos
Is aC/S
model based on,daos client
is alinrary
can be integrated into the application, it runs in the same address space as the application.daos server
It is a multi-fault-tolerantdaemon
process, which directly accessesSCM
andNVMe
stores all metadata and small IOSCM
, and large IO is stored inNVMe
it.daos server
It does not rely onpthread
to handle concurrent IO requests, but uses user-level threadsUser Level Thead(ULT)
to handle them.
Daos data storage strategy
-
daos
Provided in the form of stored exported objectskey0-value
orkey-array
in the form of APIs for user access. In order to avoid scalability problems and the overhead of maintaining metadata (such as the layout of the object used to describe the location of the object data), thedaos
object in is128bit
used to identify the uniqueness of the object, and128bit
the code is also used to describe the distribution and data of the data. Protection policy (whether it is a copy or ec) and other information.daos
According to the configuration of the storage pool, the layout of the random number generation object is generated. This advantage is similar tocrush
the algorithm of ceph.
-
daos server
The direct connection to the memory bus for metadata storageSCM
andNVMe
the direct connection to the memory bus for data storagePCIe
. Use memoryload/store
instructions to access the memory mapSCM
, and then useSPDK API
the user mode accessNVMe
. Once a hardware failureSCM
occursNVMe
, there will be data or metadata loss. In order to ensure data loss, or methodsdaos
are provided to protect data and restore data. When the data protection function is enabled, it will be replicated or chunked into multiple data shards and data verification shards, and then stored in different storage nodes. Once a hardware failure or node failure occurs, it will still be accessible in degraded mode , data recovery is to recover from other copies or verification data.replication
erasure coding
daos object
daos object
-
replication
Provides relatively high data redundancy,daos
adoptsprimary-slave
the protocol for writing operations,primary replica
is responsible for accepting requests for writing, and thenprimary replica
forwards the requests toslave replica
process distributed transactions.primary-slave
The model differs from the traditional replica model.primary replica
Only forward rpc toslave server
. All replica node requests are obtained directlyRDMA
from the peer client through the method . A variant of the two-phase commit protocol is used. If one replica cannot apply the change, all replicas are notified to update. If the server fails to process the copy write, this node will be excluded from the transaction, and then a different normal node will be selected as a replacement node through the algorithm, and then the previous transaction status will be assigned to this normal node. If the failed node returns to normal at this time, it will capture the transaction status according to the data recovery protocol, while ignoring the local transaction status. When the node fails during the health check, it will report to the multi-node-based protocol service. The raft service in the server will scan the object id, calculate the layout of each object, and then find out all affected obejcts; The objet id of the algorithm is sent to the emergency server of the algorithm. The emergency node rebuilds the affected data by pulling other replicas.buffer
daos
daos
daos
daos-server
raft
-
Erasure Coding
Provides a data protection strategy that saves more space and improves space utilization.daos client
It is a lightweight library, which is integrated into the process, so the EC encoding of the data is performed on the client, and the node where the client process is located will consume more CPU resources.daos client
Calculate the check code of the data, create data fragments and data check blocksRDMA Destriptor
, and then send aRPC
request to the leader server of the check group to coordinate the write operation. This write operation is similar to the write of the copy. The nodes participating in the ec write operation directly frombuffer
To obtain data in the client ,daos ec
the two-phase commit protocol is also adopted to ensure the atomic writing of data on different nodes. When the written data is not equal to stripe_size, most storage systems willread/encode/write
ensure the consistency of data fragmentation and data verification through processing. This operation code is very large (caused by amplification problems), and a distributed lock is required to ensure read and write consistency. Howeverdaos
, in order to avoid this overhead, the method of copying part of the written data to the parity server is adoptedMulti-version data module
, so the parity server can easily calculate the parity data through the copy data. When a node fails during the reading process,daos
it will provide degraded reading.daos client
It will first obtain the stripe information of all the data to reconstruct the lost data, and adopt a two-phase commit protocol to pass the transaction to the normal server node, and then process the lost data. data reconstruction. -
daos
There are three types of failures. The first is service crash, which isdaos
handled by gossip-like protocol SWIM; the second isNVMe
failure,daos
which isSPDK
judged by the state of polling equipment; the third is storage medium failure, whichdaos
will be detected and saved and Verify the checksum for assurance. When the server receives a write request, the server verifies the checksum or stores the checksum and data. The verification function can be enabled or disabled on the server side according to performance requirements. When the application comes back to read data again, if the read data is aligned with the previously written data, the server returns the data and check code directly; otherwise, the daos server verifies the check code of the data block involved in the read operation, and then calculates the value of the read data Check code, and then return the data and check code to the client. If the daos client detects a verification code error during the reading process, it will enable degraded reading or switch to other replicas for reading or rebuild data on the client (ec mode). The client will also report the verification code error to the server. The server will collect all verification code errors through detection and verification, and then perform sumsvefify
andscrubbing
report them to the client.
Daos data model
-
daos
The data model contains two different object forms, one isarray objects
to allow the application to present a multi-dimensional array form; the other iskey/value
to store object data, this method provides kv interface andmulti-level
kv interface. In either form, data objects are versioned, allowing applications to easily roll back to previous versions of data. Each object belongs to a domain (daos container
). Each container has a private object address space, and the transaction processing is also independent of other containers in the poll.
-
daos
Support access to posix semantics. Posix is notdaos
a function of the storage model, butdaos
a library built on the back-end api. A posix file system namespace is in.daos container
The posix api isfuse
driven by using the daos engine api (libdaos) and daos File system api (libdfs) to access data.