Paper reading notes (Clover: a distributed key-value storage system that passively separates computing and storage)

AboutDisaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value StoresNotes on this paper

Original link

present background

In a traditional distributed storage system, each node contains two parts: computing and storage. A node can access both the local storage part and the remote storage part. The traditional storage part is composed of SSD or HDD, but with the introduction of non-volatile memory (PM: persistent memory), more and more storage systems adopt this storage medium. The formed organizational structure is shown in the figure below:

traditional model

Problems

  • In a single node, there is a difference in processing speed between computing and storage, which prevents optimal performance

  • Poor scalability

  • There are issues with data consistency and reliability

split mode

In response to the problems existing in traditional distributed storage systems, people have proposed a model that separates computing and storage. Compared with the traditional model, this model performs better in terms of resource management and scalability. Many current data centers and Cloud service platforms are adopting this model.

In addition, there is a network technology called RDMA (Remote Direct Memory Access) that is being used in distributed systems. This technology allows direct access to the memory of remote nodes across the CPU, so it has low latency and low CPU utilization. Characteristics, using this technology can greatly improve the performance of distributed systems.

Now that the computing and storage nodes are separated, a hypervisor needs to be installed on one of the nodes to maintain the system. According to the node where the hypervisor is located, combined with PM storage media and RDMA transmission technology, two types of models are proposed: aDPM ( active disaggregated PM) and pDPM (passive disaggregated PM). Among them, active and passive refer to the data management mode.

aDPM

The architecture of aDPM is shown in the figure below

aDPM

It can be seen that in aDPM, the hypervisor is installed on the storage node. This method can reduce latency, but in order to maintain a large network bandwidth, the storage node needs to have high processing power, which will produce a large energy consumption. In addition, if the system uses RDMA technology, then in this case, it needs to go through the management layer in advance to reach the memory, and the advantages of RDMA direct access to the memory are not used.

pDPM

Since aDPM still has some shortcomings, we considered placing the management program on the computing nodes, thus forming the pDPM model. The architecture of pDPM is shown in the figure below:

pDPM

Adopting this mode effectively solves the problem that RDMA cannot function in aDPM. In this mode, you only need to install a smart network card that supports RDMA on the storage node to achieve direct access to the memory of the storage node. But in this mode, the storage node loses its processing power, and the next question is where to process and manage the data. Starting from this point, three modes are proposed: pDPM-Direct, pDPM-Central and Clover

pDPM-Direct

The intuitive idea is to manage data on the computing node. The computing node performs read and write operations on the storage node through one-way RDMA. Its architecture is as follows:

pDPM-Direct

The following is a brief introduction to the implementation of this architecture in terms of reading and writing:

For a piece of data, its form in the storage node is a KV entry. Each KV entry contains submitted and uncommitted data. At the same time, these data need to have check codes to ensure reliability.

  • When performing a read operation, the submitted data in the KV entry is read and verified. If the verification fails, it needs to be read again.

  • When performing a write operation, first lock the KV entry to be written, then write the data into uncommitted and committed data successively, and finally release the lock.

As you can see, the problems with this approach include:

  • Slower during write operations

  • A piece of data needs to be copied into two copies, which will cause a waste of space.

pDPM-Central

The method adopted by pDPM-Direct is equivalent to distributing data processing to each computing node. Another corresponding idea is to centralize data processing in a scheduler. This scheduler is located between the computing node and the storage node. , this is the approach adopted by pDPM-Central. Its architecture is as follows:

pDPM-Central

The following is a brief introduction to the implementation of this architecture in terms of reading and writing:

The PM in the scheduler stores a mapping table, and each entry stores the address of a piece of data.

  • When a read operation is performed, the computing node will send an RPC request to the scheduler, and the scheduler will lock the corresponding mapping table entry. Then the scheduler reads the data from the storage node and returns it to the computing node, and finally releases the lock on the entry.

  • When a write operation is performed, the computing node will send an RPC request to the scheduler. At this time, the scheduler needs to allocate space in the storage node for this data, then the scheduler writes the data into the allocated space, and finally updates the internal mapping. Table (needs to be locked)

As you can see, the problems with this approach include:

  • Due to passing through the scheduler in the middle, the speed of read operations decreases.

  • The CPU usage of the scheduler itself is very high, and it needs to process RPC requests from computing nodes, allocate space for storage nodes, etc.

  • The scheduler becomes a bottleneck in the system

Clover

The model adopted by Clover is a mixture of the above two methods. It separates data and metadata and manages them in different forms. For data management (called data layer), the method in pDPM-Direct is adopted. , that is, data read and write operations are dispersed in each computing node; for metadata management (called the metadata layer), the method in pDPM-Central is adopted, that is, operations such as data space allocation and garbage collection are concentrated in one Metadata Server (MS). Its architecture is shown in the figure below:

Clover

data layer

For the data layer, the basic operation that needs to be completed is the reading and writing of data. A data structure that does not require locking is used here. A piece of data is stored in the form of a linked list, and each node of the linked list represents the data. It is not difficult to see that the last node of the linked list is the latest version of the data. At the same time, a cursor (similar to a pointer) is saved in the computing node, which represents the version when the data was last accessed (not necessarily the latest).

  • When performing a read operation, the position in the linked list corresponding to the piece of data is found based on the cursor in the calculation node, and the latest version of the piece of data is obtained by traversing from this position until the end of the linked list is found.

  • When performing a write operation, you need to add a new node to the corresponding data entry of the storage node. If the linked list has only one node, it means that it is newly created data. You only need to add a new node pointing to the node in the computing node. Cursor; if the linked list has multiple nodes, it means that the data is updated, pointing the node representing the previous version to the newly created node, and finally updating the cursor in the computing node that performs the write operation.

It can be seen that during the read operation, when the linked list is very long and the historical version pointed by the cursor is too early, the traversal time may be too long. Therefore, an optimization measure can be taken to store a type of pointer called shortcut inside the storage node, which will point to the latest version node in the corresponding data entry. In actual application, traversing the linked list and using shortcut pointers will be done in parallel until one of the methods obtains the latest data.

The organization form of the data layer is shown in the figure below:

Data plane

metadata layer

For the metadata layer, it only communicates with computing nodes to perform operations such as space management, garbage collection, and load balancing.

For the space allocation operation, the free space is packed into a chunk in MS. The size of each chunk is consistent with the size of the data buffer. Different chunks will have different sizes. These chunks will form an idle queue. When the computing node needs to perform a write operation, it will request the MS to allocate a corresponding block in the background, and the MS will send the block to the computing node in the idle queue.

For garbage collection operations, after the writing is completed, the computing node may need to eliminate some historical version nodes, so the background will send a recycling request to the MS. The MS that receives the recycling request will put the originally allocated blocks back into the idle queue.

The organizational form of the above operations is shown in the figure below:

Metadata plane1

For data reliability and load balancing, copies of the historical version of a data entry may exist on different storage nodes, and one version node can point to multiple next version nodes, although they exist on different storage nodes. The general idea is shown in the figure below:

Metadata plane2

summary

Among the above three pDPM models, Clover tries to combine the advantages of the other two models. Experiments have proved that Clover does have the advantages of low read and write latency, low energy consumption, and low cost. However, there are also problems where performance deteriorates when a large number of write conflicts occur. question. In short, the Clover model in pDPM can be considered when designing a distributed storage system.

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/133699087
Recommended