Byzantine Fault Tolerance in Proof-of-stake protocols

If you are following the Casper research from the Ethereum Foundation, you will hear a lot of terms like “accountable safety” and “fault tolerance thresholds in an asynchronous timing model.” I want to sketch the important concepts in the underlying Byzantine Fault Tolerance theory so you have more context to some of the most exciting work in the Proof of Stake space today.

When talking about distributed consensus algorithms in the academic literature, we use an underlying model that assumes some properties about time (“how long will it take for this message to arrive?”) and some properties about the types of faults (“how can nodes in the protocol do the wrong thing?”). We do this so we can be very precise about the types of things we are talking about so we can in turn create formal proofs of various guarantees of the algorithms under investigation. These formal proofs are useful so we can convince ourselves and others that what we think will happen will actually happen. The possible number of states that a computer can get into, let alone a network of many of them, is so large that having some nifty tools from mathematics is a huge help when building real systems that do cool things.

Safety and liveness in protocols

Let’s quickly review what a protocol is so we all have the same context. Then we can talk about liveness and safety and what it means when either of those things have “failures” or “faults” (generally synonyms, btw). We can think of a protocol much like a game — there are a set of players and some rules. These rules dictate what the players should do in response to various events in the game. This “thing to do” usually results in a player changing some data they have, like the highest score so far or something. If our consensus protocol works well, then all players following all of the rules means that at any given time, every player will have the same value for their local state. This outcome becomes very important in decentralized protocols where you generally have a bunch of strangers on the internet who are just trying to take their best guess about what the nature of the world is.

Liveness refers to the fact that our consensus algorithm can’t get “stuck” meaning that as long as enough nodes (“players”) are still participating, progress towards consensus can be made. A “liveness failure” refers to the fact that we started in some state (like the genesis block) and then ran the protocol — processing messages we see “off the wire” from our peers and updating our local view of the global state accordingly — and then got “stuck”. Stuck here means that your computer could be happily waiting for more messages but there is in fact no message anyone (or some subset in some cases) could send you that will let you proceed with your calculation of the global state. You think about this like playing hopskotch, except you get stuck (potentially forever!) mid-jump between tiles 7 and 8.

Safety is a critical consensus property which just means that once a node has decided on some consensus value (according to the protocol) then that decision will also have been made across all nodes in the set of participants. Rephrased a little more simply, if you don’t have safety then you don’t have consensus. Our protocol fails if half of us think the current state is the number `1000` but the other half thinks it is `1234`. An important subtlety is that nodes can refrain from producing a consensus value — perhaps they need more time to figure out what the consensus state is from all the messages they have seen — but if a node is correctly following a protocol with safety then it will only commit to a particular value once it is certain that everyone else who is following the rules has done the same.

Timing models

The three main types of timing models usually used are the synchronous model, the asynchronous model and the partially synchronous model. Each of these models makes some guarantees about the length of time (“latency”) that can occur between the exchange of messages amongst nodes in a given round of the protocol execution. This categorization is important because in the distributed setting a single node cannot distinguish between a peer node that has failed and a peer node that is just taking a long time to respond. Taking this fact into account is central to designing robust algorithms for distributed consensus.

To take a more familiar example, let’s imagine we are sitting at a table with your friends playing a game of telephone. One player starts by whispering a word to the next player and then we all go around the table whispering the word to the next player in the sequence. The last player at the end reveals the word they hear to everyone and every player can compare their word to the starting word. Laughs usually ensue. We can actually distinguish two types of latency here: 1) the time it takes a player to whisper the word to the next player once they hear it; 2) the time it takes for the word to travel from player to player. The first kind of latency is important to our distributed computing context — imagine a “leader” node that hits an infinite loop due to a bug in their local code and never “passes the torch” so that the “followers” in the group can proceed in forming consensus without the failed leader — however, the second kind of latency is usually given more analysis because it usually dominates in absolute terms (and perhaps because authors assume we can write bug-free code?). To pass a message in the telephone game, you just have to whisper to your neighbor and this doesn’t seem to take very long. In the distributed computing context, you could be trying to reach a computer on the other side of the world at which point the time it takes to transmit a message starts to add up. Let us now look at each timing model.

In the synchronous model, there is some maximum value (“upper bound”) T on the time between when a node sends a message and when you can be certain that the receiving node hears the message. You also have an upper bound P on the relative difference in speed between nodes (so you can account for machines with slow processors). We can imagine some idealized setting where we only have a handful of computers (perhaps all in the same room) wired with perfectly reliable links who are exchanging messages. In this case, we would expect modern machines to be able to send messages down the wire in much less than a second; this value is our T. Given our use of modern machines, let’s say all with the same kind of processor direct from the fab, then we will have the same speed within some tolerance across nodes; our P value.

In the asynchronous model, we remove both upper bounds T and P. Messages can take arbitrarily long to reach peers and each node can take an arbitrary amount of time to respond. When we say arbitrary, we include “infinity” meaning that it takes forever for some event to occur. This model is important because it more accurately captures the public Internet where nodes on the network (including intermediate routers) fail and messages are dropped all the time.

The partially synchronous model in a mix of the two: upper bounds exist for T and P but the protocol designer does not know them and the task is designing mechanisms that still come to consensus in light of this fact. In practice, protocol implementers can achieve systems resembling this model given the realistic characteristics of modern networks/machines (messages usually get where they are going) and use of tactics like timeouts to indicate when a node should retry sending a message. There are several classes of algorithms from the traditional BFT literature (e.g. Paxos and the newer Raft) that take advantage of this insight.

How some of this applies in Casper

All of the above context is quite relevant to Casper, given that it aims to be a decentralized consensus protocol for the Ethereum blockchain. Not only do we have the typical class of problems — buggy nodes and spotty networks — we also now have actual incentives for nodes to misbehave. Successful attacks on Casper consensus could allow for someone to perform a double-spend attack or something I like to call a Soros attack (cf. George Soros and Black Wednesday) where the attacker shorts ETH and then inspires a massive panic sell-off when Twitter starts spreading the news that Ethereum has a consensus failure.

There is a lot more to say about the details of consensus in the decentralized setting and given the economic incentives involved with cryptocurrencies the space of attacks gets even bigger. I would suggest starting here to dive deeper: https://blog.cosmos.network/understanding-the-basics-of-a-proof-of-stake-security-model-de3b3e160710 and if you want to drink from the firehose you can join the research on https://ethresear.ch/.

https://hackernoon.com/byzantine-fault-tolerance-in-proof-of-stake-protocols-ab9746521b24