Open source a multi-group Raft library of tens of millions - Dragonboat

Introducing Dragonboat , an open source Go implementation of multiple Raft libraries, the project has been open sourced under the Apache2 protocol. Everyone is welcome to try it out, and please click star for encouragement : https://github.com/lni/dragonboat

In layman's terms, this is a distributed consensus protocol library that applications can use to distribute and store data on multiple machines. As long as more than half of the machines are online, data and services can be online. This avoids the unavailability of data and services due to individual machine downtime or network failure, and improves system availability. It provides a strong consistency feature called Linearizability : multiple copies of data externally look more like using a single copy, without the troublesome and difficult read of old data that is common with systems that only provide eventual consistency.

The consensus library based on the Raft protocol has been applied to many Internet background systems. After contacting many users, the current application obstacle that is generally reported is the lack of a reliable, high-performance, and nearly fully transparent out-of-the-box general implementation of the consensus protocol itself.

Advantage

Dragonboat has well resolved all of the above application hurdles:

  • One of the most complete open source Raft libraries for testing, all implementation codes, test tools, and test results are open source.
  • The open source Raft library with the best throughput performance can achieve tens of millions per second, and the average single core can achieve more than 400,000 write throughputs per second.
  • The most complete open source Raft library , without special application assumptions, ensures maximum versatility under the premise of safety and reliability.

At the same time, Dragonboat's Go implementation has undergone a lot of specific performance optimization and polishing. In the context of the current high-performance Go system in the context of the continuous surge in demand in the industry, it provides a reference for students who are ready to step into the pit to lead the way.

 

Function and use

Dragonboat implements almost all the functions mentioned in the Raft paper and is the most complete Go implementation of the Raft protocol:

  • Master election and replication
  • snapshot
  • High-performance Strong Consistent Read Based on ReadIndex Protocol
  • Master node transfer
  • Non-voting nodes
  • Node is silent
  • Quorum self-check
  • Application-transparent idempotent update support
  • Fully asynchronous execution and snapshot operations

Dragonboat is very convenient to use. Unlike the etcd Raft library, Dragonboat does not require users to participate in any operations related to the state of the Raft protocol, minimizing the cost of using Raft and the probability of human error. The user first implements an IStateMachine interface to describe the execution methods of update and query requests. Only four methods of this interface must be implemented, which are used to update, query StateMachine, and create and restore snapshots of StateMachine. The actual project experience found that a simple in-memory Key-Value database, its StateMachine can make a prototype in a few minutes.

With the above IStateMachine instance, various requests can be made using the application API interface of Dragonboat according to the application requirements. The system will process each user request strictly according to the requirements specified by the Raft protocol, and finally submit it to the user's IStateMachine instance to complete the status update and query. implement.

The detailed Chinese routines that come with Dragonboat can let users understand the entire usage process in ten minutes, observe the fault-tolerance fault-tolerant feature brought by the consensus protocol to the system at close range, and experience the actual operation such as initiating Proposal, strong consistent reading, Raft group member change, node restart and other system functions.

Design and Implementation Overview

The core component of Dragonboat is NodeHost distributed on each server in the network. Usually, each server has a NodeHost instance to allocate computing, storage and communication resources, and to manage the member nodes from different Raft groups running on the server.

NodeHost also provides a facade interface to provide the supported services. Users can use its API to complete the supported functions, such as initiating read/write or member change requests, starting or stopping a member node, requesting the master node to transfer or query the current group members, etc. Please use the link to the routine provided in the previous paragraph to learn more about the use of NodeHost.

Each Raft member node contains the following instances:

  • Raft Protocol Status
  • Raft group members
  • Apply IStateMachine
  • Session state to support idempotent updates

The application IStateMachine is implemented by the user, and the rest are implemented by the system and completely transparent to the user.

In order to natively support a large number of Raft groups, various batching and pipelining optimization methods are carefully implemented into every detail. For example, the execution engine that drives the update execution of each Raft group is the best example. Taking proposal as an example, as shown in the figure below, each Raft group is assigned to different execution shards to provide parallelism, and each shard is a multi-stage pipeline, with different processing stages (IO-intensive, memory access) Intensive, etc.) processing is done concurrently between different stages of the pipeline, making full use of the concurrency advantage to asynchronously all message delivery, update execution, and other operations.

The Log storage component of Dragonboat is called LogDB, which uses RocksDB by default, but can be easily replaced with other Key-Value store solutions. The default RocksDB adaptation is only 350 lines of Go code. The message transmission between NodeHosts is completed by a component called Raft RPC. The system provides two default implementations of TCP and gRPC, both of which support Mutual TLS and can be easily adapted to other transmission schemes.

 

Testing and correctness checking

Dragonboat has undergone extremely rigorous testing. The following specific numbers and facts are believed to be more convincing than propaganda slogans about how reliable the software is:

  • Tens of millions of times per night, a total of tens of billions of times over a year, node restart and network partition recovery tests under the combination of random behaviors, found and corrected many Raft implementation errors including etcd
  • Tens of thousands of lines of fully handwritten high-coverage test code, 3,000 lines of code at the core of the Raft protocol are escorted by up to 10,000 lines of test code
  • Ported and passed all etcd raft related test codes, covering all kinds of possible small-probability accidents accumulated by etcd
  • System testing methods: unit and integration testing, Jepsen testing, fuzz random input testing, random combination behavior testing, I/O error injection testing, file system power-down testing
  • Comprehensive check: Linearizability check, consistent application state machine status, consistent Raft group members, Raft group availability, consistent Raft Log Entry on disk

In the random behavior combination test, the I/O historical data of some Raft groups are saved, and the linearizability of the system can be checked by Jepsen's Knossos tool. These data have also been open sourced .

 

performance analysis

Between three groups of 22-core 2.8GHz servers, for a 16-byte load, when using 48 Raft groups, Dragonboat can sustain 9 million writes per second, or 9:1 in high read-write ratio scenarios. 11 million mixed read and write operations per second . In cross-regional high-latency scenarios, high throughput is still maintained.

 

Increasing the number of active groups directly reduces throughput as batching becomes more difficult, but throughput is always on the order of millions. A large number of idle groups does not significantly affect throughput.

The table below is Dragonboat 's write latency data. At 8 million 16-byte writes per second, the P99 write latency is still less than 5 milliseconds. Thanks to the feature that the ReadIndex protocol does not require disk writing, the read latency is usually significantly smaller than the write latency.

Go's GC has the lowest STW pauses of any mainstream language, which is extremely important for latency and latency dispersion-sensitive scenarios. Under the pressure of tens of millions of requests per second, STW stalls for 400 microseconds. In Go 1.12, this delay will continue to be halved. The figure below shows the duration of all STW pauses for 120 consecutive GC cycles, with an average of 3 GC cycles per second in this test scenario.

Dragonboat is optimized for multi-group Raft, single Raft group performance has not been optimized in any way. The current version can sustain 1.25 million 16-byte writes per second in a single Raft group scenario, with an average latency of 1.3 milliseconds and a P99 latency of 2.6 milliseconds. The three groups of servers occupy a total of 9 2.8GHz CPU cores, and each server occupies 3 2.8GHz CPU cores on average.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324142300&siteId=291194637