[Paper Reading] Research on Distributed Database Based on LevelDB

Research on Distributed Database Based on LevelDB

Research and Implementation of Distributed Database Based on LevelDB-CNKI.net

What was achieved?

Based on the key-value NoSQL database LevelDB, combined with the data consistency algorithm Raft, data fragmentation and load balancing, the distributed database based on LevelDB is designed and implemented.

The main work includes:

1. Modify the read strategy of the Raft algorithm, change only from the leader to read from the follower, reduce the burden on the leader, increase the read throughput when the read load is much greater than the write load, and reduce the average delay of requests; increase the pre-election Mechanism, before launching a formal election, a pre-election is initiated first, and the election is officially initiated only after receiving a majority of replies. Solve the problem of discarding committed logs that may occur in the Raft algorithm when the network is partitioned.

2. Realize the data distribution function based on the distribution of keyword intervals, providing friendly support for sequential reading and writing; realize the load balancing function, and dynamically adjust the system load through partition splitting and partition migration; add an intermediate layer to realize storage in a single storage
instance Provide logically independent storage space for multiple users.
3. Design and implement the prototype system DLevel. The system has basic addition, deletion, modification and query functions.

basic concept

CAP and BASE

CAP : data consistency (Consistency), service availability (Availability) and partition fault tolerance (Partition-tolerance).

The CAP theory has officially become the basic theorem in the field of distributed systems. This theorem points out that for distributed systems, the following three points cannot be achieved at the same time:

  • Consistency: The data of all nodes is consistent at any time
  • Availability: Read and write requests to the system always complete successfully
  • Partition tolerance: When a node crashes or a network partition causes message loss, the system can still provide external services that meet consistency and availability

The BASE theory is derived from the practical summary of distributed systems, reducing strong consistency to eventual consistency. Refers to Basically Available (basically available), Soft State (soft state) and Eventually Consistent (final consistency).

  • Basically available: when a node goes down or a network partition occurs in a distributed system, it is allowed to lose part of the consistency to ensure that the core is available
  • Soft state: Compared with strong consistency, it is necessary to ensure that the data copies of multiple nodes are consistent. The soft state allows a certain delay in the synchronization of data copies between different nodes in the system, allowing data copies to be in an inconsistent state, and does not require all the time completely consistent
  • Final consistency: Although there may be moments of inconsistency, after a period of data synchronization, the data copies in the system will eventually remain consistent

LevelDB

image-20230211112355980

architecture diagram

It mainly includes six main parts. Memtable and Immutable Memtable in memory, SSTable, Log file, manifest file and Current file in disk. Both Memtable and Immutable Memtable are implemented based on Skiplist. The difference is that Immutable Memtable is immutable Memtable. SSTable is a data structure that mainly stores KV data. It encapsulates KV data pairs and stores them in an orderly manner according to the key. SSTable files are stored on the disk and are stored hierarchically, extending from the high-level Level 0 to the low-level Level n. The SSTable data of the low-level Level n comes from the merged result of the high-level Level n-1 SSTable. The start_key_ and end_key_of

levelDB features:

  1. LevelDB uses mechanical disk as the main storage medium, instead of databases like Redis and Memcache, which use memory as the main storage medium
  2. LevelDB is arranged in an orderly manner according to Key, and supports custom key comparison methods
  3. Key and Value support any byte length
  4. Supports snapshots, data is automatically compressed using the Snappy compression algorithm, and supports forward and backward iterators
  5. Supports atomic batch operations and provides basic operation interfaces Put, Get, Delete

RPC

RPC is usually based on the C/S mode. Client carries predefined request parameters when requesting, and calls the server-side program. After receiving the client request, the server parses the request parameters, executes the corresponding program, and returns the client processing result.

System Requirements Analysis

Design goals


(1) It has the basic functions of LevelDB ’s original ecology, and provides basic operation interfaces including Open, Close, Put, Get, Delete, and Scan; Storage service for storage capacity;
(3) The strong consistency of data in the replica set is guaranteed through the Raft consensus algorithm, and the successful writing of data
indicates that it has been successfully written to most of the members in the replica set. Once the data is written successfully, all clients
will see the latest and consistent data;
(4) Under the premise of ensuring data consistency, improve the availability of the system as much as possible and reduce the service unavailable time.

Functional Requirements

(1) Possess the basic functions of a distributed KV database

  • Open and close, Open(dbname), Close(dbname)
  • Write data into Put(key, value), write the key-value key-value database into the database
  • Data query Get(key), returns the corresponding value through the keyword key
  • Data deletion Delete(key), delete the key-value key-value pair corresponding to the keyword key
  • Get all key-value data within the specified range and the number is less than limit Scan(start_key, end_key, std::map<std::string, std::string>, int limit).

From the above, it can be seen that this system does not implement the conversion of the upper-level KV to the relational model and does not support SQL queries. Just simple KV operation

(2) Support replica sets, improve system reliability through data replication, and ensure system availability when some storage nodes are down;

(3) Load balancing, when the data distribution is unbalanced, the data distribution in the system is balanced by partition splitting and merging, partition migration and other methods;

(4) Horizontal expansion, when the storage capacity needs to be expanded, the storage capacity of the system is expanded by adding a Storage Server Group

Three or four are basically not used for the time being, and load balancing and horizontal expansion may be added later.

performance requirements

  1. availability. When the metadata management node or storage node fails, the system can still guarantee the availability of the service as long as it does not exceed half of the nodes in the cluster.
  2. reliability. In this system, data is copied to most members or even all members through data replication. When most nodes in the replica set survive, the data will not be lost. Assuming that the probability p of each replica set node being damaged, the number of nodes in the replica set is n. Data loss occurs only when all nodes in the replica set are damaged, so the reliability of this system is 1-p^n.
  3. maintainability

overall system design

The architecture diagram is as follows:

image-20230212164247831

CS architecture, the server side is mainly composed of MetaInfo Server Group, a metadata management module, and Storage Server Group, a storage module, and the two interact through heartbeats.

The client first connects to the metadata management module through a network request, returns specific storage module information, and then initiates a network request to the storage module, and the storage module returns the client operation result after completing the corresponding operation request.

? Is this because the efficiency is limited by a single module? Or is this metadata management module a raft group, which is basically read-only, and each member can process it, so the efficiency is still distributed?

Metadata management module is responsible for partition information management, request routing , load balancing

Storage module : responsible for reading and writing requests of KV data, each raft group includes a master copy node and multiple slave copy nodes. The storage module consists of multiple storage server groups, and each group is responsible for the read and write requests of a specific partition.

DLevel also includes a cluster management module, which is mainly realized by Zookeeper. When the storage module and metadata management module are started, znodes are created in the directory specified by Zookeeper. If a node fails and cannot communicate with Zookeeper, Zookeeper will generate a corresponding report. information, which is used for cluster management.

Regarding zookeeper, this article is a reference: [22] Ti KV. https://github.com/tikv/tikv. [23] Pegasus. https://github.com/XiaoMi/pegasus. [44] FoundationDB. https ://www.foundationdb.org/.

Server

image-20230212164802758

Step 1: Start Zookeeper first to facilitate the registration of Storage Server and Metainfo Server;

Step 2: Storage Server initialization, because the Storage Server cluster includes multiple groups, when each group starts internally, the Raft algorithm is used to elect the leader in the current group, and then each Storage Server creates its own znode node in Zookeeper. // How do you know which is a group?

Step 3: Metainfo Server is initialized, and Metainfo Server is used as the central node of the system. Generally, three nodes are configured and managed through the raft algorithm. The group elects a Leader, and then each Metainfo Server creates its own znode node in Zookeeper.

Step 4: After the Storage Server and Metainfo Server are successfully started, the two parties establish a heartbeat connection. In the heartbeat, the Storage Server attaches its own state information upload value to the Metainfo Server, which is used for the Metainfo Server to manage the cluster and perform load balancing.

Step 5: Metainfo Server initializes the partition node mapping relationship table.

Step 6: Complete the startup, and the system provides external services.

storage module

Transformation into a distributed version first requires adding a network communication module. In this paper, network communication is realized through RPC. In this paper, the replica set is implemented through the consensus algorithm Raft.

image-20230212165621047

The service access layer is mainly composed of the RPC module and the command processing module. The service access layer of the storage node may receive requests from multiple clients at the same time, so the access layer needs to have certain concurrent processing capabilities. The service access layer firstly performs simple processing on the message from the client, such as the legality verification of the message format, the normative verification of the message content, etc., and then passes it to the data synchronization layer.

The data synchronization layer is mainly composed of the distributed consensus algorithm Raft, which is used to synchronize the requests received by the access layer to the group replica set. The data synchronization layer is mainly composed of consistency module, state machine and log module. Firstly, the messages processed by the data access layer are synchronized to all nodes in the replica set through the consistency module. Usually, the leader synchronizes its own data to the followers. The data during data synchronization is generally the operation log after serialization, and then handed over to the log module to analyze and extract specific client requests. Each node in the replica set has the same log sequence, plus the state machine, all nodes are initially in the same state, and after applying the same operation sequence, they are also in the same state, thus achieving a consistent state machine. The data synchronization layer ensures strong data consistency through the Raft algorithm

The data storage layer is mainly composed of the LevelDB storage engine.

Metadata Management Module

image-20230212172340568

Generally speaking, it is a Raft cluster, which is composed of three MetaInfo Servers. Each MetaInfo Server is roughly divided into three layers, namely the access layer, data synchronization layer and service layer. The service layer mainly completes the two main tasks of request routing and load balancing. Function.

The mapping relationship table between the metadata cluster storage partition and the storage node. After the client request reaches the metadata management cluster, it returns the specific storage cluster of the client by looking up the mapping relationship table.

Data Synchronization Module

image-20230212172944793

The Client first initiates a write request to the Leader, and the Leader writes the write operation sequence log to the local, and then the Leader forwards it to the Follower. After the Follower writes to the local, it returns a confirmation message to the Leader, and the Leader receives confirmation from most Followers. Just return confirmation to the client.

Configuration management module

Each cluster has multiple nodes, and multiple nodes in the same cluster run the same service and have some identical configurations.

If you want to modify these parameters in the same cluster, you must modify the parameters in multiple nodes at the same time, which is very prone to problems in a distributed environment. This system mainly uses Zookeeper to complete cluster configuration management. Zookeeper is a distributed coordination service that solves the coordination and management services in a distributed environment by providing a simple architecture and API to facilitate program development.

image-20230213094319899

Here is to use zookeeper to manage metadata?

watch node When the node changes, the metadata management cluster will receive the change event, thus perceive the change of the storage cluster, and then carry out corresponding countermeasures.

client module

image-20230212205735342

According to our needs, the user interaction module can be added with sql and transactions.

The cache module is used to cache the routing table. As long as the routing table does not change, there is no need to go through the metadata management route every time. When the routing table changes, the metadata management cluster sends a request to the client to update the routing table.

That is to say, if you send a request to a certain node, you don't need to query the metadata, but route it directly according to the cache.

Read and write process

to write

image-20230212205943636

image-20230212210049583

read

image-20230212210132400

To sum up : In our architecture design, we can also adopt CS architecture, interact with the metadata manager during the startup process of the client, and cache routes, but is this table big? Then read directly according to the contents of this cache;

If we have upper-level queries, should the client perform coarse-grained query processing? If it is not sent to the server, then the cached one will be cached for nothing?

System implementation

storage module

image-20230212212300604

Data synchronization module implementation

Leader Election

In order to solve the errors caused by network partitions, pre-elections are used to perceive the network status of the cluster, and the real elections are not started until the replies from most nodes are received.

The class diagram is as follows:

image-20230212213903939

It mainly includes three classes: RaftNode class, RaftNodeManger class, and RaftService class.

The RaftNode class is an encapsulation description of the nodes in the Raft algorithm, and defines the behavior of the nodes. Including node initialization init, node start and termination shutdown and join, processing the request service handler_request_vote in leader election, processing add_peer and remove_peer for raft group member addition and removal, reset election time reset_election_timeout, etc.

The RaftService class is automatically generated from the defined protobuf file raft.proto after protoc compilation. The raft.proto file mainly defines the interfaces of the two main rpc services, request_vote and append_entries, and the format of the sent parameter data and the format of the returned data.

The specific content is shown in Figure 4-12: request_vote rpc request parameters include node server_id, current term number, last log last_log_term and last log subscript last_log_index; reply parameters include current term number and whether to vote. The pre_vote parameter is the same as the request_vote parameter. The pre_vote rpc is used to detect the network status between the nodes in the current cluster, prevent the nodes in a small part of the partition when the network is partitioned, and continuously initiate the Leader election to cause the term number to increase alternately. Finally, when the partition is merged, a small part The partitioned node is elected as the leader, thus discarding and overwriting the logs submitted during the network partition.

The specific implementation of RaftNode is completed in the RaftNodeImpl class. In the RaftNode class, the specific implementation is called through the member variable RaftNodeImpl* impl_, so that the implementation of the class is decoupled from the class itself, and at the same time, the implementation of the content of the class is transparent to the user of the class.

The RaftService class is automatically generated from the defined protobuf file raft.proto through protoc compilation. The raft.proto file mainly defines the interfaces of two main rpc services, request_vote and append_entries. The format of the sent parameter data and the returned data format are as follows:

image-20230212215605316

ppend_entries rpc request parameters include node server_id, current term number, previous log number prev_log_term, previous log subscript prev_log_index, and log entries to be copied in the current request. Note that this is a list form, and multiple logs can be submitted and copied at one time, and finally is the log submitted by the system

The RaftNodeManage class manages RaftNodes, records the information of RaftNodes in the group through std:map, and is responsible for adding and removing Nodes in the Group and obtaining the member information of the current Group.

The election process is as follows. Before sending request_vote rpc, first send pre_vote rpc to detect the network status in the current cluster. The election process described in the pseudocode of the Raft algorithm begins when most replies from the pre_vote rpc are received.

image-20230212215757915

log replication

The class diagram of the log replication module is shown in the figure. The classes related to the log replication module are mainly LogStorage class, LogReplicator class, LogReplicatorGroup class and FSM class.

The LogStorage class is mainly responsible for storing logs, mainly including adding a single log append_log_entry and adding log append_log_entries in batches, synchronizing the log match_log with the Leader, removing log entries inconsistent with the Leader, and obtaining log entries that have not been synchronized.

The LogReplicator class is a specific instance created by the Leader for each Follower during log replication to manage log replication. The main data responsible for recording such as the currently synchronized log entry log_index_, the next log entry to be synchronized next_index_, the timeout_ of log synchronization, Synced log entries and more. The main behaviors of the LogReplicator class include start synchronization with Follower, stop synchronization with Follower, stop and join, check consistency with Follower logs and force synchronization catch_up, maintain heartbeat, etc.
The FSM class mainly implements a state machine for receiving events executed on each raft node. When the Leader determines that the log has been synchronized to most nodes, it applies the operation to the state machine. Once submitted to the state machine, it means that the log has been committed.

image-20230212215951058

The specific process of log synchronization

image-20230212221237085

Measures to improve performance:

  1. It mainly adopts log batch submission (batch), and the Leader collects client requests of a certain size at a time and sends them to Followers in batches. But this way needs to consider the size limit
  2. Leader can send LOG to Follower and Append to local parallel processing

read mode

The standard reading and writing in the Raft algorithm can only be done through the Leader, and the Follower cannot provide any external read and write requests. If a client connects to the Follower, the request will be forwarded to the Leader. Leader always has the latest submitted log records. This design ensures that the latest data can be read every time. However, for read requests, if reading from Follower is allowed, the Leader will be relieved to a certain extent. The burden of reading increases the efficiency of reading, but the key lies in how to ensure that the newest possible data is read.

image-20230213090921757

read mode

The standard reading and writing in the Raft algorithm can only be done through the Leader, and the Follower cannot provide any external read and write requests. If a client connects to the Follower, the request will be forwarded to the Leader. Leader always has the latest submitted log records. This design ensures that the latest data can be read every time. However, for read requests, if reading from Follower is allowed, the Leader will be relieved to a certain extent. The burden of reading increases the efficiency of reading, but the key lies in how to ensure that the newest possible data is read.

Guess you like

Origin blog.csdn.net/qq_47865838/article/details/129225454