In-depth analysis of the optimization and transformation of the Raft module in ZNBase (Part 1)

Author: Guan Yanxin

Guided reading

Yunxi Database ZNBase is a NewSQL distributed database open sourced by Inspur, with HTAP features and a distributed architecture with strong consistency and high availability. For a highly available distributed system, in order to ensure the consistency of data on different nodes in different clusters, the consensus algorithm is particularly important. Raft is a distributed consensus algorithm for managing log replication. Many distributed systems including ZNBase use Raft as the underlying consensus protocol. This series of articles will introduce the implementation of the Raft consensus algorithm in ZNBase, and deeply analyze the five optimization improvements made by the ZNBase technical team to the Raft protocol according to their own business needs.

Introduction to Raft

Raft  , as a distributed consensus algorithm for managing log replication, was proposed in a paper by Diego Ongaro and John Ousterhout of Stanford University . Before Raft, Paxos was the standard for distributed consensus algorithms. But Paxos is relatively difficult to understand. The design goal of Raft is to simplify Paxos and make the consensus algorithm easier to understand and implement.

Both Paxos and Raft are distributed consensus algorithms. The process is like voting for a leader. Candidates need to persuade the majority of followers to vote for him . Once a leader is elected, the leader issues orders. The difference between Paxos and Raft is that the specific process of election is different. The detailed explanation of the Raft algorithm in the community is very rich, so I won't go into details here. check the details

Raft algorithm in ZNBase

Yunxi Database - ZNBase is a distributed database, and is a member of the NewSQL family like OceanBase, CockroachDB, and TiDB. Yunxi Database has a strong, consistent and highly available distributed architecture, can scale horizontally , provides enterprise-level security features , is fully compatible with the PostgreSQL protocol, and can provide users with a complete distributed database solution. The overall architecture of ZNBase is shown in Figure 1:

Figure 1: Overall architecture of ZNBase

The strong consistency of all aspects of ZNBase is achieved through the Raft algorithm. First, the Raft algorithm ensures strong data consistency among distributed multiple copies and consistency of external reads and writes. In short, there are multiple copies of the data in ZNBase, and these copies are stored on different machines. When one of the machines fails , the database can still provide external services. In addition, ZNBase will divide the data into multiple Ranges according to the key of the inserted data, and the data on each Range is maintained by a Raft Group to maintain data consistency among multiple copies. Therefore, to be precise, ZNBase uses the Multi-Raft algorithm.

Specifically, the storage layer of ZNBase is developed based on RocksDB. Using the single-machine RocksDB, ZNBase can quickly store data on the disk; in the event of a single -machine failure , the Raft algorithm can be used to quickly copy the data to the machine. In this process, data writing is implemented through the Raft algorithm interface , rather than directly writing to RocksDB. Through the Raft algorithm, ZNBase has become a distributed key-value storage system. Faced with the failure of no more than half of the machines in the cluster , it can automatically complete the copy through the Raft algorithm, so that the business is unaware of the failure . 

In the early stage of project development, the Raft algorithm in ZNBase uses the open source etcd-raft module, which mainly provides the following functions:

  • Leader election;
  • Member changes can be subdivided into: adding nodes, deleting nodes, leader transfer, etc.;
  • log replication.

ZNBase uses etcd-raft module for data replication, and each data operation is finally converted into a RaftLog. Through the RaftLog replication function, the data operation is safely and reliably synchronized to each node in the Raft Group. However, in practice, according to Raft's protocol, it is only necessary to synchronously replicate to most nodes to safely consider data writing to be successful. 

However, in the follow-up production practice, the ZNBase R&D team gradually found that the etcd-raft module still has many limitations, so they have successively carried out the optimization work in the following aspects, including: 

  1. Added Raft character
  2. Added Leader affinity election
  3. Hybrid serialization
  4. Raft Log separation and custom storage
  5. Raft heartbeat and data separation 

The following will focus on the first point, that is, the three new roles that the ZNBase team has added to the Raft module according to their own business needs.

ZNBase 's improvements to the Raft module

Added Raft character

1. Strong synchronization role

In order to solve the problem of data synchronization in multiple data centers deployed across regions, to achieve the effect of co-writing data in multiple places, and to achieve regional-level disaster tolerance, the ZNBase R&D team added a strong synchronization role to the etcd-raft module . 

The specific measures are as follows:

  1. Add strong synchronization flags for replicas and the logic to configure and cancel strong synchronization flags.
  2. The original log submission strategy of the etcd-raft module: Leaders can submit Raft logs only after receiving votes from more than half of the replicas (including the leader itself). On the basis of the original majority submission strategy, ZNBase adds a log submission strategy for the strong synchronization role - log submission also requires the votes of all strong synchronization replicas.
  3. The fault identification and processing mechanism shown in Figure 2  is designed for the strong synchronization role  : the strong synchronization fault is identified through the heartbeat timeout mechanism, the strong synchronization role of the fault is ignored when the log is submitted, and the fault information is recorded in the database log and notified to the user.
  4. Added hot update function for the heartbeat timeout of strong synchronization roles.

Figure 2 Processing logic of strong synchronization roles

After the above four points of transformation, etcd-raft module has newly enhanced synchronization role, which can realize the following functions:

  • Allows users to configure or cancel the strong synchronization role for a specified replica, without affecting the original characteristics of the replica where the strong synchronization role is located. For example, after configuring a strong synchronization role for an omnipotent replica, the replica still stores Raft logs and user data, participates in voting, and participates in leader election. When elected as a leader, it can provide read and write services, and when a follower is selected, it can provide non-consistent read.
  • The data of the strong synchronization role is synchronized with the leader .
  • Allows users to configure the heartbeat timeout for strong sync roles. If the strong sync role fails, the Raft cluster will resume write functionality after this timeout, and writes during the failure will also be committed after the timeout. Raft will temporarily cancel the strong synchronization flag after identifying a strong synchronization role failure , and automatically recover after the strong synchronization role failure is resolved. Information about the failure and recovery of strong synchronization roles is visible to the user.
  • Allows users to query the configuration status of strong synchronization roles in the SQL terminal .

 2. Read-only role 

Since ZNBase has the characteristics of HTAP, it is necessary to add a relatively independent special copy to the Raft Group, and only provide read services to the outside world (for example, replace the storage engine of this type of copy with a column storage engine) to realize the function of OLAP. In order to add this special copy in Raft Group without affecting the original cluster characteristics, the ZNBase R&D team designed a new read -only role in Raft .

The specific implementation measures are as follows:

  1. Add read -only role IDs, and add read -only role creation, deletion, and rebalancing logic.
  2. Add the logic of reading data after the read-only replica receives the request: if the timestamp of the read -only role is not less than the timestamp of the request, the read service will be provided; otherwise, it will be retried many times, and the read will be read after the retry reaches the limit. Take a timeout error back to the SQL terminal.
  3. Added hot update function for parameters such as the timestamp update interval of read-only replicas and the maximum number of retries when reading.  

After adding a read -only role to the etcd-raft module , the following functions can be implemented:

  • Allows users to create, delete, or move read -only roles at specified locations. Read -only roles support load balancing-based rebalancing, moving to nodes that are less stressed.
  • This role only provides read services to the outside world, stores Raft logs and user data, does not participate in voting, does not participate in leader elections, and can provide read services.
  • Allows users to configure the timestamp update interval of read replicas and the maximum number of retries when reading, which can be used to tune the read performance of read replicas.
  • Allows users to query the configuration of read -only roles in the SQL terminal . 

3. Journal role

ZNBase does not support the active-active mode in the two-data center three-replica deployment mode. No matter where the replicas are deployed, there will always be one data center with more than half of the replicas. When the data center with more than half of the replicas fails, the other data center will not be able to provide normal services to the outside world because the number of available replicas in another data center does not meet the majority. In order to solve such problems, improve the disaster recovery capability of ZNBase, make full use of and integrate resources, avoid waste caused by idle resources, and improve the service capabilities of active-active data centers, the project team added a log role to the etcd-raft module . 

The specific implementation measures are as follows:

  1. Journal replicas participate in leader elections, have voting rights, and can become leaders. In a failure scenario where the common replica lacks the latest logs, in order to restore the availability of the cluster, the log-type replica needs to be elected as the leader and append logs to other replicas so that the replica has the latest logs. Then the leader transfer is initiated, and the copy with the latest log is elected as the leader and the leaseholder to complete the cluster recovery.
  2. Journaled replicas cannot send snapshots. Since journal replicas do not contain user data, sending snapshots will cause other replicas to lose data. Therefore, journal replicas are prohibited from sending snapshots.
  3. Journaled replicas cannot be LeaseHolders. Reading data from a journaled replica is prohibited, and when a journaled replica becomes the leader, the leader will be transferred to this replica as soon as other replicas have the latest journal.
  4. Journaled replicas keep journals. A journaled replica's journal can be used for failover, thus extending its journal retention time. The original log cleanup policy is to perform log cleanup when the number of cleanable log indexes is greater than or equal to 100 or the actual size is greater than or equal to 64KB. When the node is down, the log to be cleaned exceeds 4MB, and the log cleaning operation is performed. The log cleaning strategy for log-type replicas is: Packing and delaying log cleaning requests by hour. The default cleaning time value is 24, that is, delaying the processing of log cleaning requests for 24 hours to achieve the effect of log retention. The user can configure the cleaning time value. The configurable range is [-1, MaxInt]. If the configuration is -1, it means that the log will not be retained, and the cleaning operation will be performed according to the original logic.
  5. Journaled replicas are restarted offsite. When a journaled replica is restarted off-site, it will crash because it tries to submit the Commit value carried by the heartbeat message. Modify the logic of the follower processing heartbeat. If the log-type follower receives a heartbeat message with a Commit value greater than the actual lastIndex value, set the Reject field of the heartbeat reply message to true and the RejectHint field to the actual lastIndex. When the Leader receives the heartbeat reply message with Reject being true, it updates the Match and Next corresponding to the progress of the follower replica to the actual values, and appends the log to the replica to complete the lost log.

Add Logonly syntax support for log roles. The configuration example using Alter statement is as follows. Table t has 3 replicas, of which 2 all-round replicas are placed in Beijing and Jinan, and the log replicas are placed in Tianjin:

ALTER TABLE t CONFIGURE ZONE USING num_replicas=2, num_logonlys=1, constraints='{"+region=beijing": 1,"+region=jinan": 1}', logonly_constraints='{"+region=tianjin":1}';

Take the deployment in the mode of two centers and three copies (one all-around, one strong sync, and one journal) as an example . The all-around copy and the strong sync copy are stored in the high-end machines (or most machines) of DC-1 and DC-2 respectively. , the log copy is stored in the low-profile machine (or a small number of machines) of DC-1 or DC-2, and the log is incrementally replicated to the low-profile machine (or a small number of machines) in another data center. In the event of a data center-level failure, after losing two copies (one all-around and one log), manually start the node that stores the log copy in another data center. The log copy contains the logs obtained based on incremental replication. data.

Disaster recovery in the event of a data center-level failure (the log copy of Node7 needs to be restarted manually)

By adding three new roles to the Raft algorithm, ZNBase's capabilities in cross-regional cluster disaster recovery and OLAP support have been significantly enhanced.

summary

This paper introduces the important role that Raft consensus algorithm plays in distributed NewSQL database ZNBase, and the three newly designed roles in Raft algorithm that ZNBase project team has designed according to their own business characteristics and needs, thereby improving ZNBase's remote disaster tolerance and performance. HTAP capabilities. In addition to the new Raft role, the ZNBase R&D team has also added new leader affinity elections, hybrid serialization, Raft Log separation and custom storage, and Raft heartbeat and data separation for the Raft module . These four improvements are analyzed in detail in the following article :

In-depth analysis of the optimization and transformation of the Raft module in ZNBase (below)

More details about ZNBase can be found at:

Official code repository: https://gitee.com/ZNBase/zn-kvs

ZNBase official website: http://www.znbase.com/

If you have any questions about related technologies or products, please submit an issue or leave a message in the community for discussion.

At the same time, more developers interested in distributed databases are welcome to join our team!

Contact email: [email protected]

Further reading

In-depth analysis of the optimization and transformation of the Raft module in ZNBase (below)

How is the HTAP database implemented? Analysis of the column storage engine in ZNBase

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324154520&siteId=291194637