Tens of billions of exercise - distributed system testing in hell mode

This article takes the recent open source Dragonboat multi-group Raft library as an example to introduce how a typical distributed system such as Dragonboat is tested. Dragonboat is implemented in Go and can provide more than 10 million reads and writes per second on common hardware. It is currently the fastest multi-group Raft open source library with complete functions on github.com. Welcome to try it out, and please click Star to support: Dragonboat

the biggest misleading

It is often seen that a system touting itself as a reliable method is to say that a certain large-scale event used it, or a certain internal project of a certain company, so as to draw a reliable conclusion, and the production environment has become a test platform in the mouth of cheap public relations. . In fact, as we all know, the number of nodes that unexpectedly crashed and restarted at a certain event was very small, the disk damage was only 2-4% for a whole year, and the faulty network partition was encountered in many DevOps positions throughout their careers. Only a few times, so that the stories of the failures caused by the network partitions of the first-line companies over the years have only been collected on a few pages. A certain activity or a certain project is used, these are not sufficient and necessary conditions for the reliability of the software at all.

The truth is cruel. I have read a consensus library of one of the top four companies in China. After 30 minutes of reading the code, I found many bugs with data loss and damage. The implementation is a typical full streaking mode that kills and refuses to write tests. For software, any cheap marketing rhetoric that cannot be verified with code, don't say, don't look, don't listen:

Compared with cheap propaganda, Dragonboat is full of awe for the correctness of the system, and honestly provides practical guarantees with complete test solutions, open source test codes, and publicly verifiable test result data.

routine test

In the general testing part, Dragonboat did:

The test coverage of each Package in Dragonboat is basically above 90%. According to the implementation of the Raft protocol, incorrectly deleting a line of code can basically trigger multiple test errors. These tests, along with monkey testing described below, are run nightly build with the race detector turned on. Golang's error handling method, which has been criticized for a long time, is indeed likely to be missed due to human negligence, but there are many static detection tools included in the gometalinter software that can deal with it. Fuzz testing for each input may sound like an overkill at first, but there are many examples of bugs found after running fuzz testing for tens of seconds, which has also been encountered in the development of Dragonboat.

Self-test based on Raft protocol

The Raft protocol has strict requirements on internal data, such as the obvious:

  • The Index value of all entries should always be continuous and strictly increasing
  • The Term value of all entries should be one-way
  • When Index and Term are determined, the entry content is uniquely determined

These all provide an opportunity for runtime self-testing at a small cost. Just take the first item as an example. In Dragonboat , it is implemented at multiple points and self-testing is transparent to the application:

  • When the node has completed the check specified by the protocol, and is about to append the log
  • After the entry is committed, ready to be returned by the Raft protocol for the execution of the replication state machine
  • The copy state machine is about to execute the entry, when the entry Index compares the Index value of the latest executed entry

These self-tests cannot be turned off by any setting in Dragonboat , and they are strictly required even when benchmarking.

Disk file IO test

How difficult it is to do disk file IO correctly, you can first look at two facts:

Assuming that the reliability of the file system is justified, right? Unfortunately, this assumption is also a high-risk move .

This summary chart from TS Pillai visualizes how hard it is to get file IO right.

In order to solve these disk file IO challenges, Dragonboat first chose not to be smart to do file operations everywhere, hand over storage to RocksDB as much as possible, and conduct various tests on RocksDB-based systems:

  • Disk error injection testing using ScyllaDB's charybdefs implementation
  • Power-Down Data Integrity Testing Using Automatic Switching

The former can simulate things like when RocksDB tries to read the contents of an sst file and return an error on a second read operation, helping to check that Dragonboat handles such IO errors correctly as designed. The power-down test checks whether fsync is correctly configured (for example, whether fcntl(fd, F_FULLFSYNC) is used on MacOS) and called, and whether there is a problem of data loss in IO logic.

After reading the above introduction, some people may think that this is a big deal. What's so difficult about file operations that have been played since Turbo C? hehe, do the authors of git and ZooKeeper have more experience in file manipulation, or are you more talented?

as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know.

Donald Rumsfeld

Monkey Testing

Monkey Testing is sometimes referred to as Chaos Engineering, the purpose of which is to automatically test whether the system can still provide due services as designed in the event of failure of each component. Unlike random data input for fuzz testing, Monkey Testing/Chaos Engineering looks at the impact of random disruptive events on the system.

In Dragonboat 's monkey testing, a combination of various random destructive events was injected into a multi-node test environment, which resulted in tens of billions of Raft node restart events during the automated testing period of more than a year , found and corrected Lots of bugs in Raft protocol implementation and related auxiliary functions. The whole testing process and its time-consuming can be called hell mode . Specifically, in monkey testing, the injected random events are:

  • Stop each node at will
  • Feel free to delete all Raft data of the node
  • Arbitrarily discard messages in transit
  • Random network segmentation nodes temporarily block communication

Under the premise of the above-mentioned random destructive events injected in large numbers, a large number of Raft group instances are run on the above-mentioned multi-node test environment at the same time to conduct Raft read and write tests. The monkey testing environment has a built-in Drummer system with three nodes at the same time. The three Drummer nodes observe and maintain the health information of each Raft group. After finding that the members of the Raft group fail, they try to change the members of the Raft group on other nodes. Adds and starts a new Raft member, replacing the defunct Raft member.

The above three-node Drummer itself is also a single-point-free system based on Dragonboat's Raft implementation, and the above random errors will also be injected in monkey testing. Drummer's business logic of monitoring and repairing the Raft group above is completed on the premise that it is also faced with a large number of injected random destructive events, which further verifies the reliability of Dragonboat's Raft implementation under such specific actual business logic. .

Under the condition that a node survives for only a few minutes on average, random failure and restart tests on the order of tens of millions of nodes can be completed every night on several servers. In this and its harsh test environment, Raft read and write requests are applied to the system at the same time, and with a large number of background Raft snapshot save and snapshot recovery operations, strict posterior checks ensure:

  • Jepsen 's Knossos and porcupine checks in no way violate a strong consistency called linearizability
  • Raft group needs to be available when Quorum is available
  • User application state machine state is consistent
  • Raft group members are unanimous
  • The Raft Entry Log saved on the disk is consistent

A portion of the edn log in a Jepsen readable format has been published for everyone to use with their chosen Linearizability checker: https://github.com/lni/knossos-data

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324131175&siteId=291194637