【Translation】distributed system

 Original link: Distributed systems

introduce

I wish there was a text that would bring together the ideas behind many of the latest distributed systems - such as Amazon's Dynamo, Google's BigTable and MapReduce, Apache's Hadoop, and more.

In this text, I try to provide a more understandable introduction to distributed systems. To me, this means two things: introduce the key concepts you need to understand in order to have a good time reading the more in-depth text , and at the same time provide a narrative that covers enough details so that you can get a general idea of ​​what happened things without getting bogged down in details. It's 2013, you have the internet and you can selectively read about the topics that interest you most.

In my opinion, a big part of distributed programming is dealing with the impact of two consequences of distribution:

  • information travels at the speed of light
  • Independent things fail independently

In other words, the core of distributed programming is dealing with distance (huh!) and having multiple things (huh!). These constraints define the possible system design space, and I hope that after reading this article you have a better understanding of the interplay of distance, time, and consistency models.

This text focuses on the distributed programming and systems concepts you need to understand business systems in the data center. It's impossible to try to cover everything. You'll learn many key protocols and algorithms (covering, for example, one of the most cited papers in the discipline), including some new and exciting ways of looking at eventual consistency that haven't been codified into college textbooks , such as CRDTs and CALM theorems.

I hope you like it! Follow me on Github (or Twitter ) if you want to say thanks . If you find a bug, please submit a pull request on Github .


1. Basics

Chapter 1 provides a high-level overview of distributed systems and introduces some important terms and concepts. It covers high-level goals such as scalability, availability, performance, latency, and fault tolerance; how these goals are difficult to achieve, and how abstractions and models and partitioning and replication come into play.

2. Up and down levels of abstraction

Chapter 2 delves into abstract and improbable outcomes. It begins with a quote from Nietzsche, then introduces system models and many of the assumptions made in typical system models. Then the CAP theorem is discussed and the FLP impossibility results are summarized. Then we turn to one of the implications of the CAP theorem, that one should explore other consistency models. Some consistency models are then discussed.

3. Timing and sequence

An important part of understanding distributed systems is understanding time and order. To the extent we fail to understand and model time, our systems will fail. Chapter 3 discusses time and sequencing, and various uses of clocks and time, sequencing, and clocks (such as vector clocks and fault detectors).

4. Replication: Preventing divergence

Chapter 4 presents the replication problem and introduces two basic methods of implementation. It turns out that most of the relevant properties can be discussed with this simple description. Then, from the least fault-tolerant approach (2PC) to Paxos, replication methods for maintaining single-replica consistency are discussed.

5. Replication: accepting differences

Chapter 5 discusses replication with weak consistency guarantees. It introduces a basic reconciliation scenario in which replicas of a partition try to reach consensus. It then discusses Amazon's Dynamo as an example of a system design with weak consistency guarantees. Finally, two perspectives on out-of-order programming are discussed: CRDTs and the CALM theorem.

appendix

The appendices include suggestions for further reading.


*: This is a lie . This article by Jay Kreps elaborates .

1. High-level distributed systems

Distributed programming is the art of using multiple computers to solve the same problem that can be solved on one computer.

Any computer system needs to accomplish two basic tasks:

  • storage and
  • calculate

Distributed programming is the art of using multiple computers to solve the same problem that can be solved on a single computer - usually because the problem is no longer suitable for a single computer.

There's no real requirement for you to use a distributed system. If we had unlimited money and unlimited R&D time, we wouldn't need distributed systems. All computing and storage can be done on a magic box - a single, extremely fast and extremely reliable system that you pay someone else to design .

However, few people have unlimited resources. So they have to find their niche on some real-world cost-benefit curve. On a small scale, upgrading hardware is a viable strategy. However, as the problem size increases, you reach a point where either there is no hardware upgrade that can solve the problem on a single node, or the cost of upgrading the hardware is too high. At that point, welcome to the world of distributed systems.

The current reality is that the best value is in mid-range, commodity hardware - as long as maintenance costs are kept low by fault-tolerant software.

Computers mostly benefit from high-end hardware that can do memory access by replacing slow network access. In tasks that require a lot of communication between nodes, the performance advantage of high-end hardware is limited.

As shown in the figure above, data from Barroso, Clidaras, and Hölzle show that the performance gap between high-end and commodity hardware decreases as the cluster size increases, assuming a uniform memory access pattern exists across all nodes.

Ideally, adding a new machine will linearly increase the performance and capacity of the system. But of course this is not possible, because there is some overhead due to having a separate computer. Data needs to be replicated between computers, computing tasks need to be coordinated, and so on. That's why it's worth studying distributed algorithms - they provide efficient solutions to specific problems, as well as guidance on what's possible, what's the lowest cost to implement correctly, and what's impossible.

This text focuses on distributed programming and systems that take place in a mundane but business-relevant environment: the data center. For example, I won't discuss specific issues that arise from having exotic network configurations, or issues that arise in a shared memory environment. Also, the focus is on exploring the system design space rather than optimizing any particular design - the latter being a more specialized topic.

What we want to achieve: Scalability and other goodies

我认为,一切都始于需要处理尺寸。

大多数事物在小规模时是琐碎的 - 但是一旦你超过一定的规模、体积或其他物理限制,同样的问题就会变得更加困难。举起一块巧克力很容易,举起一座山很难。数一下房间里有多少人很容易,数一下国家里有多少人很难。

所以一切都始于尺寸 - 可伸缩性。简单来说,在一个可伸缩的系统中,当我们从小到大进行移动时,事物不应该逐渐变得更糟。以下是另一个定义:

可扩展性

是指系统、网络或流程处理不断增长的工作量的能力,以及它能够扩大以适应这种增长的能力。

是什么在增长?嗯,你可以用几乎任何衡量标准来衡量增长(人数、电力使用等)。但有三件特别有趣的事情值得关注:

  • 大小可扩展性:增加更多节点应该使系统线性加快;增加数据集不应增加延迟
  • 地理可扩展性:应该能够使用多个数据中心来减少响应用户查询所需的时间,同时以某种明智的方式处理跨数据中心的延迟。
  • 管理可扩展性:增加更多节点不应增加系统的管理成本(例如,管理员与机器的比率)。

当然,在真实的系统中,增长是同时在多个不同的方向上发生的;每个指标只能捕捉到增长的某些方面。

可扩展系统是指随着规模的增长,能够继续满足用户需求的系统。有两个特别相关的方面 - 性能和可用性 - 可以通过多种方式来衡量。

性能(和延迟)

性能

是指计算机系统在使用的时间和资源相比所完成的有用工作的量。

根据上下文,这可能涉及实现以下一个或多个目标:

  • 对于给定的工作,响应时间短/延迟低
  • 高吞吐量(处理工作的速率)
  • 计算资源利用率低

优化任何一种结果都会涉及权衡。例如,一个系统可以通过处理更大的工作批次来实现更高的吞吐量,从而减少操作开销。这种权衡会导致单个工作项的响应时间变长,因为需要进行批处理。

我发现低延迟 - 实现短响应时间 - 是性能中最有趣的方面,因为它与物理(而不是财务)限制有着紧密的联系。使用财务资源来解决延迟问题比解决性能的其他方面更困难。

有很多非常具体的关于延迟的定义,但我非常喜欢这个词的词源所唤起的理念:

延迟

潜伏状态; 延迟,即某事物开始和发生之间的时间段。

而“latent”是什么意思呢?

Potential

From the present participle of Latin latens, latetis, lateo, meaning "to hide". Present or present but hidden or inactive.

This definition is great because it emphasizes that latency is actually the time between when something happens and when it has an effect or becomes visible.

For example, imagine that you are infected by an airborne virus that turns people into zombies. The incubation period is the time between when you are infected and when you turn into a zombie. This is the incubation period: what has already happened is hidden during this time.

Let's assume for a moment that our distributed system only performs one high-level task: given a query, it fetches all the data in the system and computes a single result. In other words, think of a distributed system as a data store with the ability to run a single deterministic computation (function) on its current content:

result = 查询(系统中的所有数据)

For latency, then, it's not the amount of old data that matters, but the speed at which new data "lives" in the system. For example, latency can be measured by how long after data is written it becomes visible to readers.

According to this definition, another key point is that if nothing happens, there is no "incubation period". A system where data does not change should not have latency issues.

In a distributed system, there is an insurmountable minimum latency: the speed of light limits the speed at which information can be transferred, and every operation of a hardware component has a minimum latency cost (e.g. memory and hard disk, but also CPU).

The impact of minimum latency on your queries depends on the nature of those queries and the physical distance the information needs to travel.

Availability (and fault tolerance)

The second aspect of a scalable system is availability.

availability

The fraction of time the system is in a normal operating state. If a user cannot access the system, it is said to be unavailable.

Distributed systems allow us to achieve desirable properties that are difficult to achieve with a single system. For example, a single machine cannot tolerate any failures because it either fails or it does not fail.

A distributed system can take a bunch of unreliable components and build a reliable system on top of it.

A system without redundancy is only as available as its underlying components. Systems with redundancy can tolerate partial failures and are therefore more available. It's worth noting that "redundancy" can mean different things depending on where you look at it - components, servers, data centers, etc.

Formally, availability is: 可用性 = 正常运行时间 / (正常运行时间 + 停机时间).

From a technical point of view, availability mainly refers to fault tolerance. As the number of components increases, the probability of failure also increases, and the system should be able to compensate so that it does not become unreliable as the number of components increases.

for example:

Availability % How much downtime is allowed per year?
90% ("one nine") over a month
99% ("two nines") less than 4 days
99.9% ("three nines") less than 9 hours
99.99% ("four nines") less than an hour
99.999% ("five nines") about 5 minutes
99.9999% ("six nines") about 31 seconds

Availability is broader than uptime in the sense that the availability of a service can also be affected by things like network failures or the bankruptcy of the company that owns the service (factors that have nothing to do with fault tolerance but can still affect system availability). But without knowing every specific aspect of the system, all we can do is design for fault tolerance.

What is fault tolerance?

fault tolerance

the ability of a system to behave in a well-defined manner after a failure

Fault tolerance boils down to the following: define the failures you expect, then design a system or algorithm that can tolerate those failures. You cannot tolerate failures that you have not considered.

What prevents us from achieving good things?

Distributed systems are limited by two physical factors:

  • Number of nodes (increase with required storage and computing power)
  • Distance between nodes (the speed at which information is transmitted is the speed of light in the best case)

Work within these constraints:

  • An increase in the number of independent nodes increases the probability of system failure (reducing availability and increasing management costs)
  • An increase in the number of independent nodes may increase inter-node communication requirements (reducing performance as scale increases)
  • An increase in geographic distance increases the minimum latency between remote nodes (reducing performance for some operations)

Beyond these tendencies - which are the result of physical constraints - there is a world of system design options.

Both performance and availability are defined by the external guarantees provided by the system. At a high level, you can think of these guarantees as a service level agreement (SLA) for your system: if I write data, how quickly can I access it elsewhere? What guarantees do I have about durability after the data is written? If I ask the system to run a calculation, how quickly will it return results? When a component fails or stops functioning, what does this do to the system?

There is another criterion, though not explicitly mentioned, but implicit: understandability. How understandable are the guarantees made? Of course, there is no easy measure of what is understandable.

I kind of want to classify "intelligibility" as a physical limitation. After all, it's difficult for those of us to understand anything that involves more moving objects than our fingers . This is the difference between errors and exceptions - errors are incorrect behavior, and exceptions are unexpected behavior. If you were smarter, you'd expect exceptions to happen.

abstraction and model

This is where abstractions and models come into play. Abstractions make things more manageable by removing real-world aspects that are not relevant to solving a problem. Models describe key properties of distributed systems in a precise manner. In the next chapter, I will discuss many kinds of models, such as:

  • System model (asynchronous/synchronous)
  • Failure Models (Crash Failure, Partitioning, Byzantine)
  • Consistency model (strong consistency, eventual consistency)

A good abstraction makes using the system easier to understand while capturing factors relevant to a particular purpose.

There is a tension between the reality of having many nodes and our desire for systems that "work like a single system". Often, the most familiar models (e.g. implementing shared memory abstractions on distributed systems) are too expensive.

A system that offers weaker guarantees has greater freedom of action and thus potentially higher performance - but this can also be difficult to reason about. People are better at reasoning about systems that work like a single system, rather than collections of nodes.

A common practice is to improve performance by exposing more details of system internals. For example, in a column store , users can (to some extent) infer the locality of key-value pairs within the system, and thus make decisions that affect typical query performance. Systems that hide these details are easier to understand (because they are more like a single unit with fewer details to consider), while systems that expose more real-world details are likely to be more performant (because they are closer to reality).

Writing a distributed system so that it behaves like a single system with several types of failures is difficult. Network latency and network partitions (such as complete network failures between some nodes) mean that systems sometimes need to make difficult choices when these failures occur, whether to maintain availability but lose some important guarantees that cannot be enforced, or to play it safe Deny the client.

The CAP theorem - which I discuss in the next chapter - summarizes some of these tensions. Ultimately, the ideal system meets programmer needs (clear semantics) and business needs (availability/consistency/latency).

Design Techniques: Split and Duplicate

The way the dataset is distributed across multiple nodes is very important. In order to do any computation, we need to find the data and then operate on it.

There are two basic techniques that can be applied to datasets. It can be split across multiple nodes (partitioning) for more parallel processing. It can also be replicated or cached on different nodes to reduce the distance between client and server and provide higher fault tolerance (replication).

Divide and conquer - I mean, divide and copy.

The following diagram illustrates the difference between these two concepts: Partitioned data (A and B below) is divided into independent collections, while replicated data (C below) is replicated to multiple locations.

This is an excellent way to solve any problem involving distributed computing. The key, of course, is to choose the right technology for your specific implementation; there are many algorithms for implementing replication and partitioning, each with different limitations and advantages, which need to be evaluated against your design goals.

Partition

Partitioning is the division of a dataset into smaller independent sets; this is to reduce the impact of dataset growth, since each partition is a subset of the data.

  • Partitioning improves performance by limiting the amount of data to examine and locating related data in the same partition.
  • Partitioning increases availability by allowing partitions to fail independently, increasing the number of nodes that need to fail.

Partitioning is also very application-specific, so it's hard to detail it without knowing the specifics. This is why most texts, including this one, focus on copying.

Partitioning is mostly about defining partitions in terms of what you consider to be the main access patterns, and dealing with the limitations that come with having independent partitions (e.g. inefficient access across partitions, different growth rates, etc.).

copy

Replication is the copying of the same data to multiple machines; this enables more servers to participate in the computation.

Let me quote Homa Simpson imprecisely :

Yes copy! Causes and solutions of all problems in life.

Replication - copying or reproducing something - is the main way we can combat latency.

  • Replication improves performance by applying additional computing power and bandwidth to new copies of data
  • Replication increases availability by creating additional copies of data, increasing the number of nodes that need to fail

Replication is about providing extra bandwidth and caching at key locations. It also involves maintaining consistency according to some consistency model.

Replication allows us to achieve scalability, performance and fault tolerance. Fear of loss of availability or reduced performance? Data is replicated to avoid bottlenecks or single points of failure. Calculation slow? Replicate computations on multiple systems. I/O slow? Copy data to a local cache to reduce latency or to multiple machines to increase throughput.

Replication is also the source of many problems, as there are now multiple independent copies of data that must be kept in sync across multiple machines - which means making sure that replication follows a consistency model.

The choice of consistency model is very important: a good consistency model provides programmers with clear semantics (in other words, the properties it guarantees are easy to reason about), and meets business/design goals such as high availability or strong consistency.

There is only one consistency model for replication - strong consistency - that allows you to program as if the underlying data was not replicated. Other consistency models expose some of the internal details of replication to the programmer. However, a weaker consistency model can provide lower latency and higher availability - and is not necessarily harder to understand, just different.


further reading

2. The rise and fall of the level of abstraction

In this chapter, we will travel up and down the level of abstraction, looking at some impossible results (CAP and FLP), and then dropping again for performance.

If you've done any programming, the concept of levels of abstraction is probably already familiar. You're always working at some level of abstraction, interfacing with a lower-level layer through some API, and possibly providing some higher-level API or user interface to your users. The seven-layer OSI model of computer networking is a good example.

Distributed programming, I dare say, is largely a consequence of dealing with distribution (obviously!). That is, there are many nodes in reality, and we want the system to work "like a single system". This means finding a good abstraction that balances possibility with understandability and performance.

What do we mean when we say X is more abstract than Y? First, X does not introduce anything new that is fundamentally different from Y. In fact, X might remove some aspects of Y or present them in a more tractable way. Second, X is in some sense easier to understand than Y, assuming that what X removes from Y is not important to the problem at hand.

As Nietzsche wrote:

Every concept is created through our equating unequal things. No leaf is ever quite equal to another, and the concept of "leaf" was formed by the arbitrary abstraction of these individual differences, by forgetting these distinctions; "Leaves" beyond leaves - an original form from which all leaves are woven, marked, copied, colored, curled and drawn, but due to poor technique, no copy will ever be an exact copy of the original form , reliable and faithful images.

Abstraction, fundamentally, is false. Every situation is unique, and so is every node. But abstraction makes the world manageable: simpler problem statements - free from reality - are easier to analyze, and the solutions are broadly applicable as long as we don't overlook anything important.

In fact, if what we leave behind is necessary, the results we can draw will be broadly applicable. This is why impossibility results are so important: they take the simplest possible formulation of the problem and prove that it cannot be solved under certain constraints or assumptions.

All abstraction ignores something and equates something unique in reality. The key is to get rid of everything that is not necessary. How do you know what is necessary? Well, you probably wouldn't know it beforehand.

Every time we exclude an aspect from a system's specification, we risk introducing a source of bugs and/or performance problems. That's why sometimes we need to go in the opposite direction and selectively bring in some aspects of real hardware and real world problems. Reintroducing some specific hardware characteristics (such as physical order) or other physical characteristics may be sufficient to obtain a sufficiently performing system.

With this in mind, what is the least amount of reality we can preserve while still being able to deal with something that is still recognized as a distributed system? A system model is a specification of properties that we consider important; once a system model is specified, we can look at some impossible outcomes and challenges.

a system model

A key property of distributed systems is distribution. More specifically, programs in a distributed system are:

  • Running concurrently on separate nodes...
  • Connecting over a network that can introduce uncertainty and message loss...
  • And there is no shared memory or shared clock.

has many meanings:

  • Each node executes a program at the same time
  • Knowledge is local: nodes only have quick access to their local state, and any information about the global state may be out of date
  • Nodes can fail and recover independently
  • Messages may be delayed or lost (not related to node failure; it is difficult to distinguish between network failure and node failure)
  • Clocks are not synchronized between nodes (local timestamps do not correspond to global real-time order and cannot be easily observed)

A system model enumerates a number of assumptions related to the design of a particular system.

system model

A set of assumptions about the environment and facilities on which distributed system implementations depend

System models vary in their assumptions about the environment and facility. These assumptions include:

  • What capabilities do nodes have and how they can fail
  • How communication links work and how they can fail
  • properties of the overall system, such as assumptions about timing and ordering

A strong model of a system is one that makes the weakest assumptions: any algorithm written for such a system is very tolerant to different environments because it makes few and very weak assumptions.

On the other hand, we can create a model of the system that is easy to reason about by making strong assumptions. As an example, assuming that nodes cannot fail means that our algorithm does not need to handle node failures. However, such a system model is unrealistic and thus difficult to apply in practice.

Let's look at the properties of Node, Link, Time and Order in more detail.

Nodes in our system model

Nodes serve as hosts for computing and storage. They have:

  • Ability to execute procedures
  • Ability to store data into volatile memory (which can be lost on failure) and stable state (which can be read after a failure)
  • Clock (which may or may not be assumed to be accurate)

The node executes a deterministic algorithm: the calculated local state and the sent message are uniquely determined by the received message and the local state when the message was received.

There are a number of possible failure modes that describe how a node can fail. In practice, most systems assume a crash-recovery failure mode: that is, a node can only fail by crashing, and can (possibly) recover at some subsequent point in time.

另一种选择是假设节点可能以任何任意方式失效。这被称为拜占庭容错。在现实世界的商业系统中,很少处理拜占庭故障,因为对任意故障具有弹性的算法运行成本更高,实现更复杂。我在这里不讨论它们。

通信链路将各个节点相互连接,并允许消息在任何方向上发送。许多讨论分布式算法的书籍都假设每对节点之间都有单独的链路,这些链路为消息提供了先进先出(FIFO)的顺序,它们只能传递已发送的消息,并且发送的消息可能会丢失。

一些算法假设网络是可靠的:即消息不会丢失,也不会无限期延迟。对于某些现实世界的设置,这可能是一个合理的假设,但一般来说,我们更倾向于将网络视为不可靠的,并受到消息丢失和延迟的影响。

网络分区是指在节点本身保持运行状态的情况下,网络发生故障。当发生这种情况时,消息可能会丢失或延迟,直到网络分区被修复。分区节点可能对某些客户端是可访问的,因此必须与崩溃节点进行区别对待。下图说明了节点故障与网络分区的区别:

很少有人对通信链路做更进一步的假设。我们可以假设链接仅在一个方向上工作,或者我们可以为不同的链接引入不同的通信成本(例如,由于物理距离而导致的延迟)。然而,在商业环境中,除了远距离链接(WAN延迟)之外,这些很少成为问题,因此我在这里不讨论它们;成本和拓扑的更详细模型可以在复杂性的代价下实现更好的优化。

时间/顺序假设

物理分布的一个后果是,每个节点以独特的方式体验这个世界。 这是不可避免的,因为信息只能以光速传播。如果节点之间的距离不同,那么从一个节点发送到其他节点的任何消息将以不同的时间到达并且有可能以不同的顺序到达其他节点。

定时假设是一种方便的简写方式,用于捕捉我们在多大程度上考虑这个现实的假设。两种主要的替代方案是:

同步系统模型

进程同步执行;消息传输延迟存在已知的上限;每个进程拥有一个准确的时钟

异步系统模型

没有时间假设 - 例如,进程以独立的速率执行;消息传输延迟没有上限;不存在有用的时钟

Synchronous system models impose many constraints on timing and sequencing. It basically assumes that nodes have the same experience: sent messages are always received within a certain maximum transmission delay, and processes execute in lockstep. This is convenient because it allows you as the system designer to make assumptions about timing and ordering that the asynchronous system model does not.

Asynchronicity is a non-assumption: it only assumes that you cannot depend on time (or "time sensor").

It is easier to solve problems in a synchronous system model because assumptions about execution speed, maximum message transfer delay, and clock accuracy all help to solve the problem, because you can reason based on these assumptions and by assuming that they will never happen Troubleshoot inconvenient failure scenarios.

Of course, it's not particularly realistic to assume a synchronous system model. Real-world networks suffer from failures, and there are no hard boundaries for message latency. Real-world systems are at best partially synchronous: they may occasionally work correctly and provide some upper bound, but there will also be times when messages are delayed indefinitely and clocks are out of sync. I won't really discuss algorithms for synchronous systems here, but you may come across them in many other introductory books because they are analytically easier (but not realistic).

consensus problem

In the following text, we will vary the parameters of the system model. Next, we'll examine how to change two system properties:

  • whether network partitions are included in the failure model, and
  • Synchronous vs. Asynchronous Timing Assumptions

Influence system design choices by discussing two improbable outcomes (FLP and CAP).

Of course, in order to start a discussion, we also need to pose a problem that needs to be solved. The problem I'm going to discuss is that of consistency .

If several computers (or nodes) agree on a value, they achieve consensus. More formally:

  1. Consistency: Every correct process must agree on the same value.
  2. Completeness: Every correct process decides on at most one value, and if it decides on a value, it must have been proposed by some process.
  3. Termination: All processes eventually reach a decision.
  4. Validity: If all correct processes propose the same value V, then all correct processes decide on V.

Consensus problems are at the heart of many commercial distributed systems. After all, we want the reliability and performance of distributed systems without having to deal with the consequences of distribution (e.g. divergence/divergence between nodes), and solving consensus problems can solve some related higher-level problems, such as atomic broadcasting and atomic commit.

two impossible outcomes

The first impossibility outcome, known as the FLP impossibility outcome, is an impossibility outcome of particular relevance to those who design distributed algorithms. The second - the CAP theorem - is a related result, more relevant for practical operators; those who need to choose between different system designs but are not directly involved in algorithm design.

FLP impossibility result

I will briefly summarize the FLP impossibility results , although considered more important in academia . The FLP impossibility result (named after the authors Fischer, Lynch, and Patterson) studies the consensus problem (strictly speaking, a very weak form of the consensus problem, the protocol consistency problem) under the model of asynchronous systems. Assume nodes can only fail by crashing; the network is reliable, and the typical timing assumptions of asynchronous system models hold: e.g., there is no upper bound on message delay.

Under these assumptions, the FLP results state: "In an asynchronous system, even if messages are never lost, at most one process can fail, and it can only fail by crashing (stopping execution), and there is not one (determined nature) algorithm can solve the consensus problem."

This result implies that under a very simplified model of the system, there is no way to solve the consensus problem without being delayed forever. The argument is that if such an algorithm exists, then it is possible to devise a process for executing it by delaying message delivery to keep it pending ("bivalent") for an arbitrarily long time - this in the asynchronous system model is allowed. Therefore, such an algorithm does not exist.

This impossibility result is important because it emphasizes that, given an asynchronous system model, algorithms for solving consensus problems must either give up safety or liveness when the bounds on message delivery are not guaranteed.

This insight is especially important to those who design algorithms because it imposes severe constraints on the problems we know can be solved in models of asynchronous systems. The CAP theorem is a related theorem, and even more relevant to practitioners: it makes slightly different assumptions (network failures instead of node failures), and has more explicit implications for practitioners when choosing system designs.

CAP theorem

The CAP theorem was initially a conjecture made by computer scientist Eric Brewer. It's a popular and fairly useful way to think about tradeoffs in the guarantees that a system design makes. It even has a formal proof by Gilbert and Lynch and no, Nathan Marz didn't debunk it, in spite of what a particular discussion site thinks.

The theorem states that among these three properties:

  • Consistency: All nodes see the same data at the same time.
  • Availability: Node failure does not prevent surviving nodes from continuing to operate.
  • Partition tolerance: The system can continue to operate despite message loss due to network and/or node failures.

Only two can be satisfied at the same time. We can even plot this into a nice graph, choosing two of the three properties, we get three system types corresponding to different intersections:

Note that the theorem states that the part in the middle (with all three properties) is unrealizable. Then we get three different system types:

  • Consistency + Availability (CA). For example, full strict statutory protocols such as two-phase commit.
  • Consistency + Partition Tolerance (CP). For example, use a majority quorum protocol where a minority of partitions are not available, such as Paxos.
  • Availability + Partition Tolerance (AP). For example, use a conflict resolution protocol such as Dynamo.

Both CA and CP system designs provide the same consistency model: strong consistency. The only difference is that the CA system cannot tolerate any node failure; while the CP system can tolerate at most multiple failures under the non-Byzantine failure model , that is , it can tolerate the failure of a small number of nodes fas long as most nodes remain in normal operation. the reason is simple:f+1f

  • A CA system cannot distinguish between node failures and network failures, so must stop accepting writes everywhere to avoid introducing divergences (multiple replicas). It can't tell if the remote node is down or just lost its network connection: so the only safe thing to do is to stop accepting writes.
  • A CP system prevents divergence (such as maintaining the consistency of a single replica) by enforcing asymmetric behavior on both sides of a partition. It only preserves the majority of partitions and requires a minority of partitions to become unavailable (eg stop accepting writes), which maintains some degree of availability (majority partitions) and still ensures single-copy consistency.

I discuss this in detail in the chapter on replication, when I discuss Paxos. Importantly, CP systems incorporate network partitions into their failure models and use algorithms like Paxos, Raft, or view stamp replication to distinguish between majority and minority partitions. CA systems are not partition aware and have historically been more common: they typically use a two-phase commit algorithm and are common in traditional distributed relational databases.

Assuming partitioning occurs, the theorem reduces to a binary choice between availability and consistency.

I think four conclusions can be drawn from the CAP theorem:

First, many system designs used in early distributed relational database systems did not take partition tolerance into account (for example, they were CA designs). Partition tolerance is an important property for modern systems because if the system is geographically distributed (as in many large systems), the chances of network partitions are greatly increased.

Second, there is a tension between strong consistency and high availability during network partitions . The CAP theorem is a formulation of the trade-off between strong guarantees and distributed computing.

In one sense, it's pretty crazy to promise that a distributed system consisting of an interconnected unpredictable network of independent nodes "behaves in ways that are indistinguishable from non-distributed systems."

Strong consistency guarantees require us to give up availability during partitions. This is because there is no way to prevent divergence between two copies of a partition while continuing to accept writes from both sides of the partition in the event that the two copies of the partition cannot communicate with each other.

How can we solve this problem? Either by strengthening assumptions (assuming no partitions) or by reducing guarantees. Consistency can be traded off against availability (and the related ability for offline access and low latency). If "consistency" is defined as some degree below "all nodes see the same data at the same time", then we can have both availability and some (weaker) consistency guarantees.

Third, there is a tension between strong consistency and performance in normal operation .

Strong consistency/single-replica consistency requires nodes to communicate and agree on every operation. This causes high latency during normal operation.

If you can live with a consistency model that differs from the classic one, one that allows lag or divergence between replicas, then you can reduce latency during normal operations and maintain availability in the presence of partitions.

Operations can complete faster when fewer messages and nodes are involved. But the only way to achieve this is to relax the guarantee: let some nodes be contacted less often, which means the nodes may contain old data.

It also makes exceptions possible. You are no longer guaranteed to get the latest value. Depending on the kind of guarantees made, you might read an older value than expected, or even lose some updates.

Fourth - and somewhat indirectly - if we don't want to give up availability during network partitions, then we need to explore whether consistency models other than strong consistency are suitable for our purposes.

For example, even if user data is replicated to multiple data centers, and the link between those two data centers is temporarily down, we still want to allow users to use the website/service in many cases. This means that two different data sets need to be reconciled later, which is both a technical challenge and a business risk. But often, the technical challenges and business risks are manageable, so it's best to provide high availability.

Consistency and availability are not really binary choices unless you limit yourself to strong consistency. But strong consistency is just a consistency model: in this model, you have to give up availability to prevent multiple copies of the data from being active at the same time. As Brewer himself pointed out , the "2 out of 3" interpretation is misleading.

If you take away just one point from this discussion, it's this: "consistency" is not a single, well-defined property. remember:

ACID Consistency !=
CAP Consistency !=
Can Cereal Consistency

Conversely, a consistency model is any guarantee a data store makes to the programs that use it.

consistency model

A contract between a programmer and a system in which the system guarantees that the results of operations on a data store will be predictable if the programmer follows some specific rules

The "C" in CAP is "strong consistency", but "consistency" is not a synonym for "strong consistency".

Let's look at some alternative consistency models.

Strong Consistency vs. Other Consistency Models

Consistency models can be divided into two types: strong consistency models and weak consistency models:

  • Strong consistency model (able to maintain a single replica)
    • linear consistency
    • sequential consistency
  • Weak consistency model (not strong)
    • Client-Centric Consistency Model
    • Causal Consistency: The Strongest Model Available
    • eventual consistency model

The strong consistency model guarantees the order of presentation and visibility of updates to be equivalent to a non-replicating system. The weak consistency model does not provide such guarantees.

Note that this is by no means an exhaustive list. Again, the consistency model is just an arbitrary contract between the programmer and the system, so it can be pretty much anything.

Strong Consistency Model

The strong consistency model can be further divided into two similar but slightly different consistency models:

  • Linear consistency : Under linear consistency, all operations appear to be executed atomically in the same order as the global real-time operation order. (Herlihy & Wing, 1991)
  • Sequential Consistency : Under sequential consistency, all operations appear to be executed atomically, in an order that is consistent with the order seen on individual nodes and that is equal on all nodes. (Lamport, 1979)

The key difference is that linearizability requires that the order in which operations take effect is equal to the actual real-time order of operations. Sequential consistency allows operations to be reordered as long as the order observed on each node remains consistent. The two can only be distinguished if one can observe all inputs and timings in the system; for clients interacting with nodes, the two are equivalent.

This difference might seem unimportant, but it's worth noting that sequential consistency is not composable.

The strong consistency model allows you as a programmer to replace a single server with a distributed cluster of nodes without any issues.

All other consistency models are anomalous (compared to systems that guarantee strong consistency) because they behave differently than non-replicated systems. Often, however, these exceptions are acceptable, either because we don't care about the occasional problem, or because we've written code to somehow handle the inconsistency that occurs.

Note that there is no universally applicable taxonomy for weak consistency models, since "not a strong consistency model" (eg, "distinguished in some way from non-replicating systems") can be pretty much anything.

Client Centric Consistency Model

A client-centric consistency model is one that involves the notion of a client or session. For example, a client-centric consistency model might guarantee that clients never see old versions of data items. Typically this is achieved by building additional caches in the client library so that if the client moves to a replica node that contains old data, the client library will return its cached value instead of the old value from the replica.

Clients may still see the old version of the data if they are on a replica node that doesn't contain the latest version, but they will never see the old version's value reappear anomalously (e.g. because they are connected to a different replica node). Note that there are many client-centric consistency models.

eventual consistency

The eventual consistency model says that if you stop changing the value, then after an undefined amount of time, all replicas will agree on the value. This means that until then, the results between replicas are inconsistent in some undefined way. Since it is easily satisfiable (only liveness properties), it is useless without supplementary information.

Saying something is only eventually consistent is like saying "people die eventually". This is a very weak constraint, and we might wish for at least some more specific description:

First, how long does "eventually" refer to? It would be useful to have a hard lower bound, or at least an idea of ​​how long it usually takes for the system to converge to the same value.

Second, how do replicas agree on a value? A system that always returns "42" is eventually consistent: all replicas agree on the same value. But it doesn't converge to a useful value because it just keeps returning the same fixed value. Instead, we hope there is a better way. For example, one way to decide is to have the value with the largest timestamp always win.

So when vendors say "eventual consistency" they mean some more precise term like "eventually the last writer wins, and in the meantime reads the latest observed value" consistency. "How?" is important, because a bad approach can lead to lost writes - for example, if the clock on one node is set incorrectly and timestamps are used.

I will examine both of these issues in more detail in the chapter on Replication Methods for Weak Consistency Models.


in-depth reading

3. Timing and sequence

What is order and why is it important?

What do you mean "what is the order"?

I mean, why are we so obsessed with order? Why do we care if A happened before B? Why don't we care about other properties like "color"?

Well, my crazy friends, let's go back to the definition of a distributed system to answer this question.

As you may recall, I describe distributed programming as the art of using multiple computers to solve the same problem as a single computer.

This is actually at the heart of the fascination with order. Any system that does one thing at a time creates a complete sequence of operations. Just as people sequentially pass through a single door, each operation has a well-defined predecessor and successor. That's basically the programming model we try to keep.

The traditional model is: a program, a process, and a memory space run on a CPU. An operating system abstracts away the fact that there may be multiple CPUs and multiple programs, and that memory on a computer is actually shared among many programs. I'm not saying threaded programming and event-driven programming don't exist; it's just that they're special abstractions built on top of the "one/one/one" model. Programs are written to execute sequentially: start at the top and work your way to the bottom.

The reason "Order" gets so much attention as an attribute is because the easiest way to define "correctness" is to say "it works the same as it does on a single machine". And usually that means a) we run the same operations and b) run them in the same order - even with multiple machines.

The benefit of order preservation in distributed systems (as defined by a single system) is that they are generic. You don't need to care what the operations are, as they will be executed exactly as if on a single machine. That's great because you know you can use the same system no matter what the operation is.

In effect, a distributed program runs on multiple nodes; there are multiple CPUs and multiple operations flowing in. You can still assign a total order, but that requires accurate clocks or some form of communication. You could use a perfectly accurate clock to assign a timestamp to each operation, then use that to determine the total order. Or you might have some sort of communication system that assigns consecutive numbers like a total order.

total order and partial order

In a distributed system, the natural state is a partially ordered set . Neither the network nor individual nodes make any guarantees about relative order; but on each node, you can observe a local order.

A total order is a binary relation that defines an ordering for each element in a collection.

Two distinct elements are comparable when one of them is greater than the other . In a partially ordered set, some pairs of elements are not comparable, so a partially ordered does not specify the exact order of each item.

Both total order and partial order are transitive relations and antisymmetric relations . For all a, b, and c in the set X, the following statements hold in total and partial orders:

If a ≤ b and b ≤ a, then a = b (anti-symmetry);
If a ≤ b and b ≤ c, then a ≤ c (transitive);

However, a total order is total :

For all a, b in X, a ≤ b or b ≤ a (total order)

When the partial order is just a reflection relation :

a ≤ a (reflexivity) for all a in X

Note that totality implies reflexivity; thus, a partial order is a weaker variant of a total order. For some elements in a partial order, the property of totality does not hold - in other words, some elements are not comparable.

Git branches are an example of a partial order. As you probably know, the git version control system allows you to create multiple branches from a base branch (such as the master branch). Each branch represents the history of source code changes based on a common ancestor:

[Branch A (1,2,0)] [Main Branch (3,0,0)] [Branch B (1,0,2)]
[Branch A (1,1,0)] [Main Branch (2,0,0)] [Branch B (1,0,1)]
                \[master branch(1,0,0)]/

Branches A and B are derived from a common ancestor, but there is no definite order between them: they represent different histories that cannot be reduced to a single linear history without extra work (merging). Of course, you could sort all the commits in some arbitrary order (say, first by ancestors, then by A before B or B before A) - but doing so would lose information, since forcing a non-existent total sequence.

In a system with only one node, there is bound to be a total order: instructions are executed in a specific observable order, and messages are processed in a separate program in a specific observable order. We already rely on this general order - it makes the execution of the program predictable. This order can be maintained in a distributed system, but at a high cost: communication is expensive, and time synchronization is difficult and fragile.

What is time?

Time is the source of order - it allows us to define the sequence of operations - which also has human understandable interpretation (a second, a minute, a day, etc.).

In a sense, time is like any other integer counter. It's just that it happens to be so important that most computers have a dedicated time sensor, also known as a clock. It's so important that we've figured out how to synthesize an approximation of the same counter using a number of imperfect physical systems, from candles to cesium atoms. By "synthetic", I mean that we can approximate the value of an integer counter at physically distant places via some physical property, without direct communication.

Timestamps are actually shorthand values ​​used to represent the state of the world from the beginning of the universe to the present moment - if an event happened at a particular timestamp, it could be affected by everything that happened before. This idea can be generalized to a causal clock that explicitly tracks causes (dependencies) instead of just assuming that everything that happened before a timestamp is correlated. Of course, the usual assumption is that we only need to care about the state of a particular system, not the state of the world as a whole.

Assuming that time passes at the same rate everywhere - which is a big assumption and I'll come back to this in a moment - there are several useful interpretations when using times and timestamps in programs. The three explanations are:

  • Order
  • duration
  • explain

order . When I say time is a source of order, I mean:

  • We can append timestamps to unordered events to sort them
  • We can use timestamps to enforce a specific order of operations or delivery of messages (e.g. an operation can be delayed if it arrives in the wrong order)
  • We can use the value of a timestamp to determine that something happened by time before something

Explanation - time as a universally comparable value. The absolute value of the timestamp can be interpreted as a date, which is useful for people. From the timestamp in the log file when the outage started, you can tell it was last Saturday, when there was a thunderstorm .

Duration - Duration measured in units of time has some relevance to the real world. Algorithms usually don't care about the absolute value of the clock or its interpretation as a date, but they may use durations to make some judgments. In particular, the amount of time you wait can provide clues as to whether the system is partitioned or simply experiencing high latency.

By their nature, the components of a distributed system do not behave in predictable ways. They do not guarantee any particular order, speed of advancement, or lack of delay. Each node has a certain local order - execution is (roughly) sequential - but these local orders are independent of each other.

Imposing (or assuming) order is a way of reducing the space in which things can be executed and things can happen. Humans have a hard time reasoning when things can happen in any order - there are just too many permutations to consider.

Does time advance at the same speed everywhere?

We all have an intuitive concept of time based on personal experience. Unfortunately, this intuitive notion of time is more likely to form a complete sequence than a partial sequence. It is easier to imagine a sequence in which things happen one after the other, rather than simultaneously. It is easier to reason about a single order of one kind of messages than to reason about messages arriving in different orders and with different delays.

However, when implementing a distributed system, we want to avoid making too strong assumptions about time and order, because the stronger the assumption, the more vulnerable the system is to problems with "time sensors" or on-board clocks. Also, enforcing order has a cost. The more time non-determinism we can tolerate, the more we can take full advantage of the advantages of distributed computing.

There are three common answers to the question "Does time pass at the same rate everywhere?" They are:

  • "Global Clock": Yes
  • "local clock": no, but
  • "No Clock": No!

These roughly correspond to the three time assumptions I mentioned in Chapter 2: synchronous system models have a global clock, partially synchronous models have local clocks, and asynchronous system models have no clocks available. Let's look at these in more detail.

Time with "global clock" assumption

The global clock assumes that there is a perfectly accurate global clock and that everyone has access to it. This is how we usually think about time, because in human interactions, small differences in timing don't really matter.

The global clock is basically the source of the overall order (the order of every operation on all nodes is known exactly, even if those nodes never communicate).

However, this is only an idealized view of the world: in reality, clocks can only be synchronized to a limited degree of accuracy. This is limited by the accuracy of commodity computer clocks, by latency if a clock synchronization protocol such as NTP is used, and fundamentally by the nature of spacetime .

Assuming the clocks on the distributed nodes are perfectly synchronized means assuming the clocks start at the same value and never drift. That's a good assumption, since you're free to use timestamps to determine a global total order - limited by clock drift rather than latency - but it's a non-trivial operational challenge and a potential source of anomalies. There are many different situations, such as a user accidentally changing the local time on a machine, or an outdated machine joining a cluster, or synchronized clocks drifting at a slightly different rate, etc., that can cause anomalies that are difficult to track down.

Still, there are some real-world systems that make assumptions about this. Facebook's Cassandra is an example of a system that assumes synchronized clocks. It uses timestamps to resolve write conflicts - the write with the newer timestamp wins. This means that if the clock drifts, new data may be ignored or overwritten by old data; again, this is an operational challenge (from what I've heard, people are very aware of this). Another interesting example is Google's Spanner : the paper describes their TrueTime API, which not only synchronizes time, but also estimates worst-case clock drift.

Time based on "local clock" assumption

The second, and perhaps more reasonable assumption is that each machine has its own clock, but no global clock. This means that you cannot use your local clock to determine whether a remote timestamp is before or after your local timestamp; in other words, you cannot meaningfully compare timestamps from two different machines.

Local clocks are assumed to be closer to the real world. It imparts a partial order: events on each system are ordered, but events cannot be ordered across systems using clocks alone.

However, on a single machine you can use timestamps to order events; you can use timeouts on a single machine as long as you are careful not to let the clock jump around. Of course, on a machine controlled by the end user, this might assume too much: for example, a user might accidentally change their date to a different value when looking up a date using the operating system's date controls.

Time without a clock assumption

Finally, there is the notion of logical time. Here, we don't use a clock at all, but trace cause and effect in other ways. Remember, a timestamp is just shorthand for the state of the world at that point - so we can use counters and communications to determine whether something happened before, after, or at the same time as something else.

This way, we can determine the sequence of events between different machines, but can't make any assumptions about time intervals, and can't use timeouts (since we're assuming no "time sensor"). This is a partial order: events can be ordered on a single system using counters and no communication, but ordering events between systems requires a message exchange.

One of the most cited papers in distributed systems is Lamport's paper on time, clocks, and ordering of events. A vector clock is a generalization of this concept (which I'll cover in more detail), and it's a way to keep track of cause and effect without using a clock. Cassandra's cousins ​​Riak (Basho) and Voldemort (Linkedin) use vector clocks instead of assuming nodes have access to a perfectly accurate global clock. This allows these systems to avoid the clock accuracy issues mentioned earlier.

When clocks are not used, the maximum precision with which events can be sequenced across remote machines is limited by communication delays.

How is time used in distributed systems?

What is the benefit of time?

  1. Time can define sequence in the system (no communication required)
  2. Time can define the boundary conditions for the algorithm

The order of events is very important in distributed systems because many properties of distributed systems are defined in terms of the order of operations/events:

  • Correctness depends on (for) consistency in the correct order of events, e.g. serializable in a distributed database
  • Order can be used as a decision factor when resource contention occurs, e.g. if there are two orders for a widget, the first order is completed first and the second order is canceled

A global clock allows operations on two different machines to be sequenced without requiring the two machines to communicate directly. Without a global clock, we need to communicate to determine the order.

Time can also be used to define the boundary conditions of the algorithm - specifically, to distinguish between "high latency" and "server or network link failure". This is a very important use case; in most real-world systems, timeouts are used to determine whether a remote machine is failing, or is simply experiencing high network latency. Algorithms that make this determination are called fault detectors; I'll discuss them shortly.

vector clock (time in causal order)

Previously, we discussed different assumptions about the speed of time progress in distributed systems. Assuming we can't achieve accurate clock synchronization - or with the goal that our systems should not be sensitive to time synchronization issues, how do we order things?

Lamport clocks and vector clocks are alternatives to physical clocks that rely on counters and communications to determine the order of events in a distributed system. These clocks provide a counter that can be compared between different nodes.

Lamport clocks are simple. Each process maintains a counter using the following rules:

  • Whenever a process performs work, increment the counter
  • Whenever a process sends a message, contains the counter
  • When a message is received, set the counter tomax(local_counter, received_counter) + 1

Expressed in code form:

function LamportClock() {
  this.value = 1;
}

LamportClock.prototype.get = function() {
  return this.value;
}

LamportClock.prototype.increment = function() {
  this.value++;
}

LamportClock.prototype.merge = function(other) {
  this.value = Math.max(this.value, other.value) + 1;
}

A Lamport clock can compare counters between different systems, but there is a caveat: Lamport clocks define a partial order relationship. if 时间戳(a) < 时间戳(b):

  • amay boccur before , or
  • amay bnot be comparable to

This is called a clock consistency condition: if one event occurs before another, then the logical clock for this event precedes the other event. If aand bare from the same causal history, e.g., both timestamp values ​​were produced on the same process; or if bis a response to aa message sent in , then we know athat occurred bbefore .

Intuitively, this is because Lamport clocks can only carry information about one timeline/history; thus, comparing Lamport timestamps in systems that never communicate with each other can cause concurrent events to appear to be ordered when they are not.

Imagine a system that initially splits into two mutually independent subsystems that never communicate with each other.

For all events in each independent system, if a happens before b, then ts(a) < ts(b); however, if you pick two events from different independent systems (e.g., events that are not causally related), there is no way to compare their relative order to make any meaningful statement. While each part of the system assigns timestamps to events, these timestamps are not related to each other. Two events may appear to be in order even though there is no correlation between them.

However - and this is still a useful feature - from the standpoint of an individual machine, ts(a)any message sent using a will receive a ts(b)response with > ts(a).

A vector clock is an extension of the Lamport clock that maintains an array of N logical clocks [ t1, t2, ... ]- one for each node. Instead of incrementing a common counter on internal events, each node increments its own logical clock vector by one. Therefore, the update rules are as follows:

  • Each time a process performs work, increment the logical clock value of the node in the vector
  • Contains the full logical clock vector whenever a process sends a message
  • When a message is received:
    • Each element in the update vector ismax(本地时钟值, 接收到的时钟值)
    • Increments the logical clock value representing the current node

Again, expressed in code:

function VectorClock(value) {
// Represented as a hash table keyed by node id: eg { node1: 1, node2: 3 }
this.value = value || {};
}

VectorClock.prototype.get = function() {
return this.value;
};

VectorClock.prototype.increment = function(nodeId) {
if(typeof this.value[nodeId] == 'undefined') {
  this.value[nodeId] = 1;
} else {
  this.value[nodeId]++;
}
};

VectorClock.prototype.merge = function(other) {
var result = {}, last,
    a = this.value,
    b = other.value;
// This filter filters out duplicate keys in the hash table
(Object.keys(a)
  .concat(b))
  .sort()
  .filter(function(key) {
    var isDuplicate = (key == last);
    last = key;
    return !isDuplicate;
  }).forEach(function(key) {
    result[key] = Math.max(a[key] || 0, b[key] || 0);
  });
this.value = result;
};

This illustration ( source ) shows a vector clock:

Each node (A, B, C) tracks a vector clock. As events occur, they are timestamped with the current value of the vector clock. Checking { A: 2, B: 4, C: 1 }a vector clock like this one can identify exactly which messages (potentially) affected the event.

The problem with vector clocks is mainly that they require one entry per node, which means they can get very large for large systems. Various techniques have been employed to reduce the size of the vector clock (either by doing periodic garbage collection or by limiting the size to reduce accuracy).

We've seen how to track order and causality without a physical clock. Now, let's see how to truncate using the length of time.

Fault detector (cutoff time)

As I said before, the length of the wait time can provide clues as to whether the system is partitioned or is simply experiencing high latency. In this case, we don't need to assume a perfectly accurate global clock - it's enough to have a sufficiently reliable local clock.

Given a program running on one node, how does it know that the remote node has failed? In the absence of precise information, we can deduce that an unresponsive remote node has failed after a reasonable amount of time has elapsed.

But what is a "reasonable amount"? It depends on the latency between the local node and the remote node. Instead of explicitly specifying an algorithm with a specific value (which is bound to go wrong in some cases), it's better to use an appropriate abstraction to handle it.

Fault detectors are a way to abstract away assumptions about exact timing. Failure detectors are implemented using heartbeat messages and timers. Processes exchange heartbeat messages. If a message response is not received before the timeout occurs, the process suspects another process.

A timeout-based failure detector would run the risk of being too subjective (declaring a node has failed) or too conservative (taking a long time to detect a crash). How accurate do fault detectors need to be to be useful?

Chandra et al. (1996) discuss failure detectors in the context of solving consensus problems, which is a particularly relevant problem because it underlies most replication problems where replicas need to survive delays and network partitions. agree.

They describe fault detectors using two properties, completeness and accuracy:

Strong completeness.

Every crashed process is eventually suspect by every correct process.

Weak completeness.

Every process that crashes is eventually suspected by some correct process.

Strong accuracy.

None of the correct processes can be suspected.

weak accuracy.

Some correct processes are never in doubt.

Integrity is easier to achieve than accuracy; in fact, all important fault detectors can achieve it - you just don't have to wait forever to suspect someone. Chandra et al. show that fault detectors with weak integrity can be transformed into fault detectors with strong integrity (by broadcasting information about suspect processes), which allows us to focus on the range of accuracy properties.

Avoiding false suspicions about non-faulty processes is difficult unless you can assume that there is a hard maximum for message delays. This assumption can be made in a synchronous system model - so fault detectors can have robust accuracy in such systems. Under a system model that imposes no hard bounds on message latency, failure detection is at best only conclusively accurate.

Chandra et al. show that even a very weak fault detector - the final weak fault detector ⋄W (final weak accuracy + weak completeness) - can be used to solve the consensus problem. The diagram below (from the paper) illustrates the relationship between the system model and problem solvability:

As you can see above, certain problems cannot be solved without failure detectors in an asynchronous system. This is because without a failure detector (or strong assumptions about time bounds, such as a synchronization system model), it is impossible to determine whether a remote node has crashed or is simply experiencing high latency. This distinction is important for any system pursuing single-copy consistency: failed nodes can be ignored because they do not cause divergence, but partitioned nodes cannot be safely ignored.

How to implement a fault detector? Conceptually, a simple failure detector doesn't do much beyond detecting failures when a timeout expires. The most interesting part has to do with how to tell if a remote node has failed.

Ideally, we would like the failure detector to be able to adapt to changing network conditions and avoid hardcoding timeout values ​​in it. For example, Cassandra uses a Phi cumulative failure detector , which is a failure detector that outputs a suspect level (a value between 0 and 1) rather than a binary "up" or "down" judgment. This allows applications using fault detectors to make autonomous decisions based on the trade-off between accurate and early detection.

Timing, Sequence and Performance

Earlier, I mentioned the need to pay for order. what do i mean?

If you're writing a distributed system, chances are you have more than one computer. The natural (and realistic) view of the world is a partial order, not a total order. You can convert a partial order to a total order, but this requires communication, waits, and imposes a limit on the number of computers that can do work at any given point in time.

All clocks are approximate and limited by network latency (logic time) or physics. Keeping even a simple integer counter in sync across multiple nodes is a challenge.

While time and order are often discussed together, time itself is not a particularly useful property. Algorithms don't really care about timing, but more about more abstract properties:

  • causal order of events
  • Failure detection (e.g. approximation of an upper bound on message passing)
  • Consistent snapshots (e.g. the ability to inspect the state of the system at a point in time; not discussed here)

It is possible to impose a total order, but it is expensive. This requires you to operate at normal (minimum) speed. Often the easiest way to ensure that events are delivered in some defined order is to designate a single (bottleneck) node through which all operations pass.

Is timing/ordering/synchronicity really necessary? It depends on the situation. In some use cases, we want each intermediate operation to move the system from one consistent state to another. For example, in many cases we want the responses we get from the database to be representative of all available information, and we want to avoid dealing with issues where the system might return inconsistent results.

But in other cases we may not need as much timing/ordering/synchronization. For example, if you're running a long-running computation and don't really care what the system does until the last moment - then you don't need much synchronization as long as you can guarantee the answer is correct.

Synchronization is generally applied to all operations, but only a few cases actually have an impact on the final result. When is order required for correctness? The CALM theorem - which I discuss in the final chapter - provides an answer.

In other cases, it is acceptable to provide only an answer that represents the best known estimate, that is, it is based on only part of the information contained in the system. Especially during network partitions, it may be necessary to use part of the system to answer queries. In other use cases, end users cannot really distinguish answers that are relatively new and cheaply available from answers that are guaranteed to be correct and expensive to compute. For example, does a certain user X have the number of Twitter followers X or X+1? Or are movies A, B, and C the absolute best answer to a certain query? It is acceptable to do a cheaper, mostly correct "best effort".

In the next two chapters, we examine replication for fault-tolerant strongly consistent systems—systems that provide strong guarantees while becoming increasingly resilient. These systems provide a solution for the first case: when you need guarantees of correctness and are willing to pay for it. We then discuss systems with weak consistency guarantees that are still usable in the face of partitions but can only give "best effort" answers.


further reading

Lamport Clock, Vector Clock

Fault detection

snapshot

因果关系

4. 复制

复制问题是分布式系统中的众多问题之一。我选择把重点放在它上面,而不是其他问题,比如领导者选举、故障检测、互斥、一致性和全局快照,因为它往往是人们最感兴趣的部分。举个例子,并行数据库在复制特性方面有所区别。此外,复制为许多子问题提供了一个上下文,比如领导者选举、故障检测、一致性和原子广播。

复制是一个群体通信问题。什么样的安排和通信模式可以给我们带来期望的性能和可用性特性?在面对网络分区和同时节点故障时,我们如何确保容错性、持久性和非发散性?

Again, there are many ways to replicate. The approach I'm taking here is just from the perspective of a possible high-level pattern for a system with replication capabilities. Visually, this helps keep the discussion focused on the overall pattern rather than specific messaging. My goal here is to explore the design space, not to explain the specifics of each algorithm.

First let's define what replication looks like. Let's assume we have an initial database, and clients will make requests to change the state of the database.

Scheduling and communication patterns can be broken down into the following phases:

  1. (request) The client sends a request to the server
  2. (synchronous) The synchronous part of replication occurs
  3. (response) The response is returned to the client
  4. (asynchronous) The asynchronous part of the replication occurs

This model is loosely based on this article. Note that the pattern of messages exchanged in each portion of the task depends on the specific algorithm: I am intentionally trying to get by without discussing the specific algorithm.

Given these stages, what type of communication patterns can we create? How does the pattern we choose affect performance and availability?

synchronous replication

The first mode is synchronous replication (also known as active replication, or eager replication, or push replication, or pessimistic replication). Let's draw what it looks like:

Here we can see three distinct phases: First, the client sends the request. Next, the part of what we call synchronous replication kicks in. This term refers to the fact that the client is blocked - waiting for a reply from the system.

During the sync phase, the first server contacts the other two and waits until it receives replies from all the other servers. Finally, it sends a response to the client, notifying its outcome (such as success or failure).

All of this seems simple enough. Without discussing the details of the synchronization phase algorithm, what can we say about this particular communication pattern arrangement? First, observe that this is a write-of-N approach: each server has to see and acknowledge it before returning a response.

From a performance standpoint, this means that the system will only be as slow as the slowest server. At the same time, the system is very sensitive to changes in network latency, because it needs to wait for each server to reply before proceeding.

Given the N-of-N approach, the system cannot tolerate the loss of any server. When one server is lost, the system can no longer write to all nodes, so it cannot proceed. In this design, it might be possible to provide read-only access to data, but not allow modification after a node failure.

This arrangement can provide very strong durability guarantees: when the response comes back, the client can be sure that all N servers have received, stored, and acknowledged the request. In order to lose one accepted update, all N replicas need to be lost, which is a very good guarantee.

asynchronous replication

Let's contrast this with the second pattern - asynchronous replication (also known as passive replication, or pull replication, or delayed replication). As you might have guessed, this is the opposite of synchronous replication:

Here, the master server (/leader/coordinator) immediately sends a reply to the client. It might store the updates locally, but it doesn't do any significant work synchronously, and the client doesn't need to wait for more rounds of communication between the servers.

At a later stage, the asynchronous part of the replication task begins. Here, the master server contacts other servers using some communication pattern, and the other servers update their copies of the data. The exact details depend on the algorithm used.

What can we say about this particular arrangement without going into algorithmic details? Well, it's a write-once-read-many approach: return the response immediately, update propagation later.

From a performance perspective, this means that the system is fast: clients don't need to spend extra time waiting for the system's internals to perform work. The system is also more tolerant to network latency, since fluctuations in internal latency do not cause clients to wait extra.

This arrangement can only provide weak or probabilistic guarantees of durability. If nothing goes wrong, the data is eventually replicated to all N machines. However, if the only server containing the data is lost before then, the data will be permanently lost.

Given the 1-of-N approach, the system can remain available as long as at least one node is running (at least in theory, although in practice the load may be too high). A purely lazy approach like this offers no guarantees of durability or consistency; you may be allowed to write to the system, but if any failure occurs, there is no guarantee that you will be able to read what you wrote.

Finally, it's worth noting that passive replication cannot guarantee that all nodes in the system always contain the same state. If you accept writes at multiple locations, and you don't require those nodes to agree in sync, then you run the risk of divergence: reads may return different results from different locations (especially after node failure and recovery), and There is no way to enforce global constraints (this requires communication with everyone).

I didn't really mention the communication pattern during reads (rather than writes), because the pattern of reads is really determined from the pattern of writes: during reads, you want to contact as few nodes as possible . We discuss this in more detail in the context of quorum.

We only discuss two basic permutations and no specific algorithm. However, we have been able to figure out the possible communication modes and their performance, reliability guarantees and availability characteristics.

Overview of the main replication methods

After discussing the two basic replication methods: synchronous and asynchronous, let's look at the main replication algorithms.

There are many different ways to classify replication techniques. The second difference I want to introduce (after synchronous vs asynchronous) is:

  • Replication method to prevent divergence (single-replica system)
  • Replication method with risk of divergence (multi-master system)

The first group of methods has the property of "operating as a single system". In particular, the system ensures that only one replica is active in the event of a partial failure. Additionally, the system ensures that replicas are always consistent. This is called a consensus problem.

If all processes (or computers) agree on a value, it is said to achieve consensus. More formally:

  1. Consistency: Every correct process must agree on the same value.
  2. Completeness: Each correct process decides at most one value, and if it decides on a value, that value must have been proposed by some process.
  3. Termination: All processes eventually reach a decision.
  4. Validity: If all correct processes propose the same value V, then all correct processes decide on V.

Mutual exclusion, leader election, multicast, and atomic broadcast are all examples of more general consensus problems. A replication system that maintains the consistency of a single copy needs to resolve the consistency problem somehow.

Replication algorithms that maintain the consistency of a single replica include:

  • 1n messages (asynchronous primary/backup)
  • 2n messages (synchronous primary/backup)
  • 4n messages (2PC, multiple Paxos)
  • 6n messages (3PC, Paxos repeated leader elections)

These algorithms differ in their fault tolerance (for example, the types of failures they can tolerate). I categorized them simply by the number of messages exchanged during the execution of the algorithm, because I thought it would be interesting to try to find an answer to the question "What do we gain by increasing message exchanges?"

The diagram below, adapted from Ryan Barret from Google , describes some aspects of the different options:

In the diagram above, the consistency, latency, throughput, data loss, and failover characteristics can actually be traced back to two different replication methods: synchronous replication (eg, waiting for a response) and asynchronous replication. Performance is worse when you wait, but reliability is stronger. The difference in throughput between 2PC and quorum systems will become apparent when we discuss partition (and latency) tolerance.

In this diagram, algorithms with weak (/eventual) consistency are grouped together ("gossip"). However, I will discuss weakly consistent replication methods - gossip and (partial) quorum systems - in more detail. "Transaction" rows are really more about global predicate evaluation, which is not supported in systems with weak consistency (although local predicate evaluation can be supported).

It is worth noting that systems with weak consistency requirements have fewer general-purpose algorithms and more techniques that can be selectively applied. Since a system that does not enforce single-replica consistency can operate freely as a distributed system composed of multiple nodes, the fix has less obvious goals and more focus on ways that allow people to reason about the properties of the system.

for example:

  • A client-centric consistency model attempts to provide clearer consistency guarantees while allowing for divergence to occur.
  • CRDTs (Convergent and Commutative Replicated Data Types) exploit semilattice properties (associative, commutative, idempotent) of certain state and operation-based data types.
  • Fusion analysis (as in the Bloom language) exploits computational monotonicity information to maximize disorder.
  • PBS (Probabilistic Upper Bound Staleness) utilizes simulations and information gathered from real-world systems to describe the expected behavior of partially quorum systems.

I'll talk about all of this in detail later, but first let's look at a replication algorithm that maintains the consistency of a single copy.

Primary/Backup Replication

Master/backup replication (also known as master-slave replication or log shipping) is probably the most common method of replication, and the most basic algorithm. All updates are performed on the primary server, and the operational logs (or changes) are sent over the network to the backup copy. There are two variants:

  • asynchronous primary/backup replication and
  • Synchronous primary/backup replication

The synchronous version requires two messages ("update" + "acknowledged receipt"), while the asynchronous version requires only one message ("update").

P/B (Primary/Standby) is very common. For example, MySQL replication uses the asynchronous variant by default. MongoDB also uses P/B (with some additional failover procedures). All operations are performed on a primary server, serializing them to a local log, and asynchronously replicated to a backup server.

As we discussed earlier in the context of asynchronous replication, any asynchronous replication algorithm can only provide weak consistency guarantees. In MySQL replication, this manifests as replication lag: an asynchronous backup is always at least one operation behind the master. If the primary server fails, updates that have not been sent to the backup will be lost.

The synchronous variant of primary/backup replication ensures that writes are stored on other nodes before returning them to clients - but at the expense of waiting for responses from other replicas. However, it is worth noting that even this variant only provides weak guarantees. Consider the following simple failure scenario:

  • The primary server receives write operations and sends them to the backup server
  • The backup server persists and confirms the write operation
  • Then, before sending an acknowledgment to the client, the primary server fails

The client now assumes that the commit failed, but the backup was committed; this would be incorrect if the backup was promoted to primary. Manual cleanup may be required to reconcile failed primary nodes or inconsistent backups.

Of course, I'm simplifying here. While all primary/backup replication algorithms follow the same general message pattern, they differ in how they handle failover, replicas being offline for long periods of time, etc. However, in this scheme, there is no way to be resilient to untimely failure of the master node.

The key in log shipping/active-backup schemes is that they can only provide best-effort guarantees (e.g. prone to lost or wrong updates if a node fails at an inopportune time). Additionally, active-standby schemes are prone to split-brain, which results in both the primary and backup servers being active at the same time when a failover occurs due to temporary network issues.

To prevent untimely failures from causing consistency guarantees to be violated, we need to add another round of message passing, which leads to a two-phase commit protocol (2PC).

Two Phase Commit (2PC)

Two-phase commit (2PC) is a protocol used in many classical relational databases. For example, MySQL Cluster (not to be confused with regular MySQL) provides synchronous replication using 2PC. The following diagram illustrates the message flow:

[Coordinator] -> Can I submit? [ Peer ]
              <- yes/no

[coordinator] -> commit/rollback [peer]
              <- confirm

In the first phase (voting phase), the coordinator sends updates to all participants. Each participant processes the update and votes on whether to commit or abort. When voting to commit, participants store updates in staging areas (write-ahead logs). Updates are considered interim until the second phase is complete.

In the second phase (decision making), the coordinator decides on the outcome and informs each participant. If all participants vote to commit, the update is fetched from the staging area and becomes permanent.

Setting up a second phase before committing changes is useful because it allows the system to roll back updates if a node fails. In contrast, in primary/backup ("1PC"), there is no step to rollback operations, and replicas may diverge when some nodes fail while others succeed.

2PC is prone to blocking situations, as the failure of a single node (participant or coordinator) blocks progress until the node recovers. Recovery is usually achieved through a second phase, during which other nodes are informed of the state of the system. Note that 2PC assumes that data in each node's stable storage is never lost, and that no node crashes forever. Data loss can still occur if data in stable storage is corrupted in a crash.

The details of the recovery process during a node failure are complex, so I won't go into the specifics. The main tasks include ensuring that writes to disk are durable (e.g., flushed to disk rather than cache), and ensuring correct recovery decisions are made (e.g., learning the outcome of a round and then redoing or undoing updates locally).

As we learned in the chapter on CAP, 2PC is a CA - it does not support partition tolerance. The failure model handled by 2PC does not include network partitions; the prescribed way to recover from node failure is to wait for the network partition to recover. If one coordinator fails, there is no safe way to promote a new coordinator; manual intervention is required instead. 2PC is also very sensitive to latency, since it is a write N-of-N method, the write operation cannot proceed until the slowest node acknowledges it.

2PC strikes a nice balance between performance and fault tolerance, which is why it is popular in relational databases. However, newer systems often use partition-tolerant consensus algorithms because such algorithms can provide automatic recovery from temporary network partitions, as well as handle internode latency increases more gracefully.

Let's look at a partition-tolerant consensus algorithm.

Partition Fault Tolerant Consensus Algorithm

Partition fault-tolerant consensus algorithms are the furthest fault-tolerant algorithms we can go at maintaining single-replica consistency. There is also a class of fault-tolerant algorithms: algorithms that can tolerate arbitrary (Byzantine) errors ; these errors include node failures caused by malicious behavior. Such algorithms are rarely used in commercial systems because they are more expensive to run and more complex to implement - so I'll ignore them.

When it comes to partition-tolerant consensus algorithms, the most well-known algorithm is the Paxos algorithm. However, it is widely considered to be difficult to implement and explain, so I will focus on the Raft algorithm, which is a more recent (around early 2013) algorithm designed to be easier to teach and implement. Let's first look at the general properties of network partitions and partition-tolerant consensus algorithms.

What is a network partition?

A network partition is the failure of one or more nodes to link to the network. The node itself remains alive and may even be able to receive client requests from the side of the network partition. As we learned earlier - when discussing the CAP theorem - network partitions do happen, and not all systems handle them gracefully.

Network partitions are tricky because during a network partition there is no way to distinguish between a remote node failure and a node being unreachable. If a network partition occurs without a node failure, the system is divided into two simultaneously active partitions. The two diagrams below illustrate how a network partition is similar to a node failure.

A system of 2 nodes where one node fails with a network partition:

A system of 3 nodes, failure or network partition:

A system that enforces single-replica consistency must have some means of breaking symmetry: otherwise, it will split into two independent systems that can diverge from each other and can no longer maintain the illusion of a single replica.

In systems that enforce single-replica consistency, network partition tolerance requires that the system keep only one partition alive during a network partition, since there is no way to prevent divergence during a network partition (e.g., the CAP theorem).

majority decision

This is why consensus algorithms that tolerate partitions rely on majority votes. Requires a majority of nodes - not all nodes (as in 2PC) - to agree on updates, allowing some nodes to fail, slow down, or become inaccessible due to network partitions. As long as there are (N/2 + 1)-of-Nnodes up and reachable, the system can continue to operate.

A partition-tolerant consensus algorithm uses an odd number of nodes (e.g. 3, 5, or 7). With only two nodes, a clear majority cannot be reached after a failure. For example, if the number of nodes is three, the system can tolerate one node failure; if the number of nodes is five, the system can tolerate two node failures.

When a network partition occurs, the behavior of the partition is asymmetric. A partition will contain most nodes. A minority of partitions will stop processing operations to prevent divergence during network partitions, but a majority of partitions can remain active. This ensures that the only copy of the system state remains active.

Majorities are also useful because they tolerate dissent: nodes can vote differently if there is interference or failure. However, since there can only be one majority decision, temporary dissent can at best prevent the protocol from proceeding (abandoning liveness), but cannot violate the single-replica consistency criterion (the safety property).

Role

There are two ways to organize a system: all nodes can have the same responsibility, or nodes can have separate, distinct roles.

Replicated consensus algorithms often choose to assign different roles to each node. Having a single fixed leader or master server is an optimization that makes the system more efficient because we know that all updates must go through that server. Nodes that are not leaders only need to forward their requests to the leader.

Note that having different roles does not mean that the system cannot recover from failure of the leader (or any other role). Just because roles are fixed during normal operation, doesn't mean that after a failure it cannot be recovered by reassigning roles (e.g. through the leader election phase). Nodes can reuse the results of leader election until node failure and/or network partition occurs.

无论 Paxos 还是 Raft 都使用不同的节点角色。特别是,它们都有一个领导节点(在 Paxos 中被称为“提议者”),负责在正常操作期间进行协调。在正常操作期间,其余的节点都是追随者(在 Paxos 中被称为“接受者”或“选民”)。

时期

在 Paxos 和 Raft 中,每个正常操作的时期都被称为一个时期(在 Raft 中称为“术语”)。在每个时期中,只有一个节点被指定为领导者(日本也采用了类似的系统,在天皇更替时改变时代名称)。

在一次成功的选举之后,同一位领导人一直协调直到时代结束。如上图所示(取自 Raft 论文),有些选举可能会失败,导致时代立即结束。

纪元充当逻辑时钟,使其他节点能够识别出过时节点开始通信的时间 - 被分区或停止运行的节点的纪元号会比当前节点较小,它们的命令将被忽略。

通过决斗进行的领导人更迭

在正常操作期间,一个容错共识算法是相当简单的。正如我们之前所看到的,如果我们不关心容错性,我们可以简单地使用2PC。大部分复杂性实际上来自于确保一旦达成共识决策,它将不会丢失,并且协议可以处理由于网络或节点故障而引起的领导者更替。

所有节点最初都是跟随者;在开始时,选择一个节点作为领导者。在正常操作期间,领导者维持一个心跳,使得跟随者能够检测到领导者是否失败或被分割。

当一个节点检测到领导者变得无响应(或者在最开始的情况下,没有领导者存在),它会切换到一个中间状态(在 Raft 中称为 "候选人"),在这个状态中,它会将任期/纪元值增加一,发起领导者选举,并竞争成为新的领导者。

为了当选为领导者,一个节点必须获得大多数的选票。一种分配选票的方式是简单地按照先到先得的原则进行分配;这样一来,最终会选出一个领导者。在尝试当选时,加入随机的等待时间会减少同时尝试当选的节点数量。

时期内的编号提案

在每个时期中,领导者一次提议一个值供投票。在每个时期内,每个提议都用唯一的严格递增的数字进行编号。追随者(投票人/接受者)接受他们收到的特定提议编号的第一个提议。

正常运行

在正常运行期间,所有提案都通过领导节点进行。当客户端提交提案(例如更新操作)时,领导者联系协商组中的所有节点。如果不存在竞争提案(根据从跟随者节点得到的响应判断),领导者提出该值。如果大多数跟随者接受该值,则认为该值已被接受。

由于可能有其他节点也试图充当领导者,我们需要确保一旦一个提案被接受,其值就不能改变。否则,已经被接受的提案可能会被竞争的领导者撤销。Lamport将其陈述如下:

P2:如果选择了一个值为v的提案,则选择的每个较高编号的提案都具有值v

确保这个属性成立需要算法限制追随者和提议者永远不改变被大多数接受的值。注意,“值永远不会改变”是指协议的单次执行(或运行/实例/决策)的值。典型的复制算法会运行算法的多次执行,但大多数对算法的讨论都集中在单次运行上以保持简单。我们希望防止决策历史被修改或覆盖。

为了强制执行这个属性,提议者必须首先向跟随者询问他们(最高编号的)接受的提议和值。如果提议者发现已经存在一个提议,则它必须简单地完成协议的执行,而不是提出自己的提议。 Lamport将其陈述如下:

P2b. 如果选择了一个值为v的提案,那么任何提案者发布的比该提案编号更高的提案都具有值v

更具体地说:

P2c. 对于任意的 vn,如果一个具有值为 v 和编号为 n 的提案被 [领导者] 发布,那么存在一个由大多数接受者 [追随者] 组成的集合 S,使得以下情况之一成立:(a) S 中的任何接受者都没有接受过编号小于 n 的任何提案,或者 (b) v 是在 S 中的追随者们接受的所有编号小于 n 的提案中编号最高的提案的值。

这是 Paxos 算法的核心,以及从中派生的算法。要提出的值直到协议的第二阶段才会被选择。提议者有时必须简单地重传先前做出的决定以确保安全性(例如 P2c 中的条款 b),直到他们达到一个他们知道自己可以强加自己的提议值的点(例如条款 a)。

如果存在多个之前的提案,则提议最高编号的提案值。如果根本没有竞争提案,提议者只能尝试强制使用自己的值。

为了确保在提议者询问每个接受者其最新值的时间段内没有出现竞争性提议,提议者要求追随者不接受比当前提议号更低的提议。

将所有的碎片组合在一起,使用 Paxos 达成决策需要两轮通信:

[提议者] -> 准备(n)                                        [跟随者]
           <- 承诺(n; 上一个提议编号和上一个被接受的提议的值)

[提议者] -> 接受请求(n, 自身的值或者跟随者报告的最高提议编号对应的值)  [跟随者]
              <- 被接受(n, 值)

准备阶段允许提议者了解任何竞争或先前的提议。第二阶段是提议新值或先前接受的值。在某些情况下,比如同时有两个提议者活动(决斗),消息丢失,或者大多数节点失败,那么没有提议会被多数接受。但这是可以接受的,因为决定提议何值的规则会收敛到一个单一值(在先前尝试中具有最高提议编号的值)。

确实,根据 FLP 不可能性结果,这是我们能做到的最好的:解决共识问题的算法在消息传递边界不满足保证时,要么放弃安全性,要么放弃活跃性。Paxos 放弃了活跃性:它可能不得不无限期地延迟决策,直到没有竞争的领导者,以及大多数节点接受一个提议。这比违反安全性保证更可取。

当然,实现这个算法比听起来要困难得多。有许多小问题,即使在专家手中也会积累成相当大量的代码。这些问题包括:

  • 实用优化:
    • 通过使用领导租约(而不是心跳)避免重复进行领导选举
    • 在稳定状态下避免重复提议消息,其中领导者身份不会改变
  • Ensures that followers and proposers do not lose items in stable storage, and that results stored in stable storage are not subject to subtle corruption (e.g. disk corruption)
  • Enable changes in cluster membership in a safe manner (e.g. Paxos-based methods rely on a majority of nodes always intersecting, which is not true if membership can change arbitrarily)
  • Procedures for safely and efficiently updating new replicas to the latest state after a crash, disk loss, or new node configuration
  • Procedures for snapshotting and garbage collecting data that is needed to guarantee safety after a reasonable period of time (e.g., to balance storage requirements with fault tolerance requirements)

Google's Paxos Made Live paper details some of these challenges.

Partition fault-tolerant consensus algorithms: Paxos, Raft, ZAB

Hope this gives you some idea of ​​how partition tolerant consensus algorithms work. I encourage you to read one of the papers in the further reading section to grasp the specifics of the different algorithms.

Paxos . Paxos is one of the most important algorithms in writing strongly consistent partition-tolerant replication systems. It is used in many Google systems, including the Chubby lock manager , used by BigTable / Megastore , as well as Google Filesystem and Spanner .

Paxos, named after the Greek island of Pacos, was originally proposed by Leslie Lambert in a 1998 paper called "Part-time Parliament". It's generally agreed that it's hard to implement, and there's a series of papers from companies with considerable distributed systems expertise explaining the practical details further (see Further reading). You might want to read Lamport's comments on this issue here and here .

These issues are mainly related to the description of Paxos in single-round consensus decision-making, but real-world implementations usually want to run multiple rounds of consensus in an efficient manner. This has led to the development of many extensions to the core protocol that anyone wishing to build a Paxos-based system needs to understand. Additionally, there are other practical challenges, such as how to facilitate cluster membership changes.

ZAB . ZAB - Zookeeper Atomic Broadcast Protocol for Apache Zookeeper. Zookeeper is a system that provides coordination primitives for distributed systems and is used by many Hadoop-centric distributed systems for coordination (eg HBase , Storm , Kafka ). Zookeeper is basically the open source community version of Chubby. Atomic broadcasting is technically a different problem than pure consensus, but it still falls into the category of fault-tolerant algorithms that ensure strong consistency.

driftwood . Driftwood is the latest (2013) addition to this class of algorithms. It is designed to be easier to teach than Paxos while providing the same guarantees. In particular, the different parts of the algorithm are more clearly separated, and the paper also describes a mechanism for cluster membership changes. More recently, it has been adopted in etcd , inspired by ZooKeeper.

Replication method with strong consistency

In this chapter, we looked at strongly consistent replication methods. Starting with a comparison of synchronous and asynchronous work, we gradually introduce increasingly complex fault-tolerant algorithms. Here are some key features of each algorithm:

primary/backup

  • Single, static master node
  • Copy the log, the slave node does not participate in the execution operation
  • No upper limit on replication latency
  • Not Partition Tolerant
  • Manual/temporary failover, no fault tolerance, "hot backup"

two-phase commit

  • Unanimous Vote: Submit or Abort
  • static master node
  • 2PC cannot survive if both the coordinator and one node fail during commit
  • No partition tolerance, sensitive to tail delay

Paxos protocol

  • majority vote
  • dynamic master node
  • Able to tolerate n/2-1 simultaneous failures as part of the protocol
  • less sensitive to tail delay

further reading

Active/Standby and 2PC

Paxos (Paxos)

Raft 和 ZAB

5. 复制:弱一致性模型协议

现在,我们已经对在越来越多的实际故障情况下可以强制实施单副本一致性的协议有了初步了解,让我们把注意力转向一旦放弃单副本一致性要求所打开的选择世界。

总的来说,很难找到一个单一的维度来定义或描述允许复制品分歧的协议。大多数这样的协议都是高度可用的,关键问题更多地是终端用户是否发现保证、抽象和应用程序接口对他们的目的有用,尽管当节点和/或网络故障发生时,复制品可能会发生分歧。

为什么弱一致性系统没有更受欢迎?

正如我在引言中所陈述的,我认为分布式编程的很大一部分是关于处理分布的两个后果的影响:

  • 信息传输速度等于光速
  • 独立的事物独立失败

信息传输速度的限制导致节点以不同且独特的方式体验世界。在单个节点上进行计算很容易,因为一切都按照可预测的全局总序发生。在分布式系统上进行计算很困难,因为没有全局总序。

在最长的一段时间内(例如几十年的研究),我们通过引入全局总顺序来解决这个问题。我已经讨论了许多方法,通过创建顺序(以容错的方式)来实现强一致性,在没有自然发生的总顺序的情况下。

当然,问题在于强制执行秩序是昂贵的。特别是在大规模的互联网系统中,系统需要保持可用性。一个强制执行强一致性的系统不像一个分布式系统:它表现得像一个单一的系统,在分区期间可用性较差。

此外,对于每个操作,通常需要联系大部分节点 - 而且通常不止一次,而是两次(正如你在 2PC 的讨论中所看到的)。这对于需要在全球用户基础上提供足够性能的地理分布式系统来说尤为痛苦。

因此,默认情况下表现得像一个单一系统可能并不理想。

也许我们想要的是一个不需要昂贵协调的代码编写系统,但是却能返回一个“可用”的值。我们不再拥有一个单一的真相,而是允许不同的复本相互分歧-既要保持效率,也要容忍分区-然后尝试找到一种处理分歧的方式。

最终一致性表达了这个观念:各个节点在某段时间内可能会出现分歧,但最终它们会达成一致的价值。

在提供最终一致性的系统集合中,有两种类型的系统设计:

概率保证的最终一致性。这种类型的系统可以在稍后的某个时间点检测到冲突的写入,但不能保证结果等同于某个正确的顺序执行。换句话说,冲突的更新有时会用旧值覆盖新值,并且在正常操作(或分区期间)可能会出现一些异常情况。

在近年来,最具影响力的单副本一致性系统设计是亚马逊的Dynamo,作为一个提供概率保证的最终一致性系统的例子,我将对其进行讨论。

具有强一致性保证的最终一致性。这种类型的系统保证结果会收敛到一个与某个正确的顺序执行等价的共同值。换句话说,这样的系统不会产生任何异常结果;在没有任何协调的情况下,您可以构建相同服务的副本,并且这些副本可以以任何模式进行通信并以任何顺序接收更新,并且只要它们都看到相同的信息,它们最终将就最终结果达成一致。

CRDT's (convergent replicated data types) are data types that are guaranteed to converge to the same value despite network latency, partitioning, and message reordering. They are verified to be convergent, but the data types that can be used as CRDTs are limited.

The CALM (Consistency as Logical Monotonicity) conjecture is another expression of the same principle: it equates logical monotonicity with convergence. If we can conclude that something is logically monotonic, then it is also safe to run without coordination. Convergence analysis - especially when applied to the Bloom programming language - can be used to guide programmers when and where to use coordination techniques from strongly consistent systems and when it is safe to do without coordination.

Reconcile different order of operations

What would a system that doesn't enforce single-replica consistency look like? Let's make this more concrete by looking at a few examples.

Perhaps the most obvious feature of systems that do not enforce single-replica consistency is the ability to allow replicas to diverge from each other. This means that there is no strictly defined communication pattern: replicas can be separated from each other and still be available and accept writes.

Let's imagine a system of three replicas, each separate from the others. For example, these replicas may be in different data centers and cannot communicate for some reason. During detach, each replica remains available to accept reads and writes from some clients:

[Customer] -> [A]

--- partition ---

[Customer] -> [B]

--- partition ---

[Client] -> [C]

After a period of time, partitions recover and replicas exchange information. They receive different updates from different clients and diverge from each other, so some form of coordination is required. What we want to happen is that all replicas converge to the same result.

[A] \
  --> [merge]
[B] /     |
        |
[C] ----[Merge]---> Result

Another way to think about a system with weak consistency guarantees is to imagine a set of clients sending messages to two replicas in some order. Since there is no coordinating protocol that enforces a single total order, messages can be delivered to the two replicas in different orders:

[Customer] --> [A] 1, 2, 3
[Customer] --> [B] 2, 3, 1

Essentially, this is why we need a coordination protocol. For example, suppose we're trying to concatenate a string, and the operations in messages 1, 2, and 3 are:

1: { action: connect('hello') }
2: { operation: connect('world') }
3: { operation: connect('!') }

Then, without coordination, A will produce "Hello world!", and B will produce "Hello world!".

A: concat(concat(concat('', 'Hello'), 'World'), '!') = 'Hello World!'
B: concat(concat(concat('', 'World'), '!'), 'Hello') = 'Hello World!'

This is of course incorrect. What we want to happen is that the replicas converge to the same result.

Keeping these two examples in mind, let's first look at Amazon's Dynamo to establish a benchmark, and then discuss some novel ways to build systems with weak consistency guarantees, such as CRDT and the CALM theorem.

Amazon Dynamo

Amazon's Dynamo system design (2007) is probably the best known system that provides weak consistency guarantees but high availability. It is the basis for many other real-world systems, including LinkedIn's Voldemort, Facebook's Cassandra, and Basho's Riak.

Kinetics is an eventually consistent, highly available key-value store. A key-value store is like a big hash table: clients can set(key, value)set the value and get(key)retrieve the value by using the key. A Dynamics cluster consists of N peer nodes; each node has a set of keys it is responsible for storing.

Dynamo prioritizes availability over consistency; it cannot guarantee the consistency of a single replica. Conversely, replicas can diverge when values ​​are written; when a key is read, there is a read reconciliation phase that attempts to reconcile differences between replicas before returning the value to the client.

For many functions on Amazon, avoiding failures is more important than ensuring that the data is completely consistent, because failures can lead to loss of business and loss of reputation. Furthermore, if the data is not particularly critical, weakly consistent systems can provide better performance and higher availability than traditional relational databases at a lower cost.

Since Dynamo is a complete system design, there are many different parts to consider besides the core replication task. The diagram below illustrates some tasks, specifically how writes are routed to a node and written to multiple replicas.

[client]
  |
(map keys to nodes)
  |
  V
[Node A]
  |     \
(synchronous replication tasks: minimum durability)
  |        \
[Node B] [Node C]
  A
  |
(conflict detection; asynchronous replication tasks: ensure partition/recovery nodes can recover)
  |
  V
[node D]

After looking at the initial acceptance of write operations, we'll look at how to detect conflicts, as well as asynchronous replica synchronization tasks. Due to the high-availability design, nodes may be temporarily unavailable (down or partitioned), so replica synchronization tasks are required to ensure that nodes can catch up relatively quickly after a failure.

consistent hashing

Whether we are reading or writing, the first thing we need to do is locate where the data is stored in the system. This requires a mapping of keys to nodes.

In Dynamo, keys are mapped to nodes using a hashing technique called consistent hashing (which I won't discuss in detail). The main idea is to map a key to the set of nodes responsible for it, by doing a simple computation on the client. This means that clients can locate keys without querying the system for each key's location; this saves system resources, since hashing is usually faster than performing a remote procedure call.

partial quorum

Once we know where the key should be stored, we need to do some work to keep the value persistent. This is a synchronous task; the reason we write values ​​to multiple nodes at once is to provide a higher level of durability (e.g. against immediate failure of a node).

Like Paxos or Raft, Dynamo uses quorums for replication. However, Dynamo's quorums are loose (partial) quorums, not strict (majority) quorums.

Informally, a strict quorum system is a quorum system with the property that any two quorums (sets) overlap in the quorum system. Requiring a majority to vote in favor before accepting an update ensures that only one history is accepted, since each majority quorum must overlap on at least one node. For example, Paxos relies on this property.

Partial quorums do not have this property; this means that no majority is required, and different subsets of the quorum may contain different versions of the same data. The user can choose the number of nodes to write and read from:

  • The user can choose some number of nodes W-of-N to meet the requirements of write operations; and
  • Users can specify the number of nodes (R-of-N) to contact during a read operation.

Wand Rspecify the number of nodes that need to participate in the write or read. Writing to more nodes makes writing slightly slower, but increases the probability that the value is not lost; reading from more nodes increases the probability of reading the latest value.

This is generally recommended R + W > Nbecause it means that the quorum for reads and writes overlaps on one node - making it less likely to return stale values. A typical configuration is N = 3(e.g. a total of three copies of each value); this means that the user can choose between the following options:

 R = 1,W = 3;
R = 2, W = 2 or
R = 3,W = 1

More generally, assume again R + W > N:

  • R = 1, W = N: fast read, slow write
  • R = N, W = 1: fast write, slow read
  • R = N/2And W = N/2 + 1: good for both

N rarely exceeds 3, because keeping so many copies of large amounts of data can become expensive!

As I mentioned before, the Dynamo paper inspired many other similar designs. They both use the same partial quorum-based replication approach, but differ in default values ​​for N, W, and R:

  • Plantain Riak (N=3, R=2, W=2 default)
  • LinkedIn's Voldemort (N=2 or 3, R=1, W=1 default)
  • Apache Cassandra (N=3, R=1, W=1 default)

还有一个细节:在发送读取或写入请求时,是要求所有 N 个节点都回应(Riak),还是只需要满足最低要求的节点数(例如 R 或 W;Voldemort)回应。"发送给所有"的方法更快,对延迟更不敏感(因为它只等待 N 个节点中最快的 R 或 W 节点),但效率较低;而"发送给最小要求"的方法对延迟更敏感(因为与单个节点通信的延迟会延迟操作),但效率更高(总的消息 / 连接更少)。

当读写的法定人数重叠时,例如 (R + W > N),会发生什么情况?具体来说,经常有人声称这会导致"强一致性"。

R + W > N 是否等同于“强一致性”?

编号

这并不完全没有道理:一个满足 R + W > N 条件的系统可以检测到读/写冲突,因为任何读取仲裁集和任何写入仲裁集都会共享一个成员。例如,至少有一个节点同时在两个仲裁集中:

   1     2   N/2+1     N/2+2    N
[...] [读]  [读 + 写]   [写]    [...]

这保证了之前的写操作会被后续的读操作看到。然而,这仅在 N 中的节点不发生变化时才成立。因此,Dynamo 不符合要求,因为在 Dynamo 中,集群成员关系可能会发生变化,例如节点故障。

Dynamo 被设计为始终可写。它有一种机制,通过在原始服务器宕机时将一个不同的、不相关的服务器添加到负责某些键的节点集合中来处理节点故障。这意味着不再保证四分之一总数始终重叠。即使 R = W = N,也不符合要求,因为虽然四分之一大小等于 N,但在故障期间,这些四分之一中的节点可能会发生变化。具体来说,在分区期间,如果无法达到足够数量的节点,Dynamo 将从不相关但可访问的节点中添加新节点到四分之一中。

Furthermore, Dynamo does not handle partitions in the way that the strong consistency model requires: that is, writes are allowed on both sides of the partition, which means that at least some of the time the system will not operate as a single replica. Therefore, R + W > Nreferring to "strong consistency" is misleading; this guarantee is only probabilistic - which is not what strong consistency is meant for.

Conflict detection and read repair

A system that allows replicas to diverge must have a way to eventually reconcile two different values. As briefly mentioned in the partial quorum approach, one approach is to detect collisions on read and then apply some conflict resolution. But how is this possible?

Typically, this is done by tracking the causal history of the data and supplementing it with some metadata. Clients must preserve metadata information when reading data from the system and return metadata values ​​when writing to the database.

We've already encountered a way to do this: vector clocks can be used to represent the history of a value. In fact, this is how the original Dynamo was designed to detect conflicts.

However, using a vector clock is not the only option. If you look at many real system designs, you can infer how they work by looking at the metadata they track.

No metadata . When the system doesn't keep track of metadata, and only returns values ​​(e.g. via a client API), it can't really do anything special about concurrent write operations. A common rule is that the last writer wins: in other words, if two writers write at the same time, only the slowest writer's value is kept.

timestamp . Nominally, the value with the higher timestamp value wins. However, if time is not carefully synchronized, many strange things can happen, where old data from a system with a faulty or fast clock overwrites newer values. Facebook's Cassandra is a variant of Dynamo that uses timestamps instead of vector clocks.

version number . Version numbers may avoid some of the problems associated with using timestamps. Note that when there are multiple histories, the minimal mechanism to accurately track causality is a vector clock, not a version number.

Vector clock . Using vector clocks, concurrent and stale updates can be detected. Read repair can then be performed, although in some cases (concurrent changes) we need to ask the client to choose a value. This is because if changes are concurrent and we don't know much about the data (like a simple key-value store), then asking is better than randomly dropping data.

When reading a value, the client contacts Rnodes Nand asks them for the latest key value. It receives all responses and discards those that are strictly old (detected using the vector clock value). If there is only one unique vector clock+value pair, it returns that pair. If there are multiple vector clock+value pairs being edited at the same time (e.g. not comparable), then all of those values ​​will be returned.

As shown above, read repair may return multiple values. This means that client/application developers must occasionally choose a value based on some use case specific criteria to handle these cases.

Also, a key component of a practical vector clock system is that the clock cannot grow forever - so there needs to be a program that periodically garbage collects the clock in a safe manner to balance fault tolerance with storage requirements.

Replica Synchronization: Hearsay and Merkle Trees

Given that the Dynamo system is designed to be tolerant to node failures and network partitions, it needs a way to handle nodes rejoining the cluster, either because of rejoining after a partition, or because a failed node was replaced or partially recovered.

Replica synchronization is used to update nodes to the latest state after a failure and to periodically synchronize replicas with each other.

Gossip is a probabilistic technique for synchronizing replicas. The pattern of communication (eg, which node contacts which node) is not predetermined. Instead, nodes have a certain probability pof trying to synchronize with each other. Every tsecond, each node selects a node to communicate with. This provides an additional mechanism, beyond synchronous tasks (eg, partial quorum writes), to keep replicas up to date.

Gossip rumors are scalable and have no single point of failure, but only provide probabilistic guarantees.

为了使副本同步期间的信息交换更加高效,Dynamo使用一种称为Merkle树的技术,我将不会详细介绍。关键思想是数据存储可以以多个不同的粒度进行哈希:表示整个内容的哈希,一半的键的哈希,四分之一的键的哈希,以此类推。

通过保持这种相当细粒度的哈希,节点可以比使用简单技术更高效地比较其数据存储内容。一旦节点确定了哪些键具有不同的值,它们会交换必要的信息以使副本保持最新。

Dynamo实践中的概率有界陈旧度(PBS)

这基本上涵盖了 Dynamo 系统的设计:

  • 一致性哈希算法用于确定键的位置
  • 读取和写入的部分法定人数
  • 通过向量时钟进行冲突检测和读取修复
  • 使用八卦协议进行副本同步

我们如何描述这样一个系统的行为?Bailis等人(2012)最近的一篇论文中描述了一种叫做PBS(概率有界陈旧度)的方法,它利用模拟和从现实世界系统中收集的数据来描述这样一个系统的预期行为。

PBS 通过使用有关反熵(八卦)速率、网络延迟和本地处理延迟的信息来估计不一致性的程度,从而估计读取一致性的预期水平。它已经在 Cassandra 中实现,在其他消息上附加定时信息,并根据这些信息的样本计算出一个估计值,使用蒙特卡洛模拟进行计算。

根据这篇论文,在正常运行时,最终一致的数据存储通常更快,并且可以在数十或数百毫秒内读取一致的状态。下表说明了在 LinkedIn(SSD 和 15k RPM 磁盘)和 Yammer 的经验计时数据中,给定不同的 RW 设置时,读取一致性的99.9%概率所需的时间:

For example, in the Yammer case, changing from R=1, to , can reduce the inconsistency window from 1352 ms to 202 ms - while keeping read latency lower (32.6 ms), faster than the fastest strict quorum ( , ; 219.27 ms).W=1R=2W=1R=3W=1

For more details, check out the PBS website and related papers.

out-of-order programming

Let's review examples of different situations we wish to address. The first scenario involves being on three different servers after a partition; we want the servers to converge to the same value after partition recovery. Amazon's Dynamo Rdoes this by reading data from each node and then performing read coordination.

In the second example, we considered a more concrete operation: string concatenation. It turns out that there is no known technique for string concatenation to get the same value without imposing order on the operations (eg, without expensive coordination). However, some operations can be safely applied in arbitrary order, which cannot be done with simple registers. As Pat Helland writes:

...operation center work can be made commutative with correct operations and correct semantics, while simple read/write semantics are not suitable for commutativity.

借方For example, consider a system that implements a simple accounting system, where sums operate in two different ways 贷方:

  • use registers with 读取AND 写入operations, and
  • Use integer data types with native 借方AND 贷方operations

The latter implementation is more aware of the internals of the data type, so it can preserve the intent of the operation even if the operation is reordered. Debits or credits can be applied in any order, the end result is the same:

100 + Credits(10) + Credits(20) = 130 and
100 + Credits(20) + Credits(10) = 130

However, it is not possible to write fixed values ​​in an arbitrary order: if you reorder the writes, one of the writes will overwrite the other:

100 + write(110) + write(130) = 130 but
100 + write(130) + write(110) = 110

Let's look at the example from the beginning of this chapter, but with different operations. In this case, a client sends a message to two nodes, which see these operations in a different order:

[Customer] --> [A] 1, 2, 3
[Customer] --> [B] 2, 3, 1

Instead of using string concatenation, suppose we want to find the maximum value of a set of integers (e.g. MAX()). Messages 1, 2 and 3 are:

1: {Operation: take the maximum value (previous value, 3) }
2: {Operation: take the maximum value (previous value, 5) }
3: {Operation: take the maximum value (previous value, 7) }

Then, without coordination, both A and B converge to 7, for example:

A: max(max(max(0, 3), 5), 7) = 7
B: max(max(max(0, 5), 7), 3) = 7

In both cases, the two replicas see the updates in a different order, but we can merge the results in a way that they have the same result regardless of the order. Due to the merging procedure we use ( max), the result converges to the same answer in both cases.

Most likely it will be impossible to write a merge program that works for all data types. In Dynamo, a value is a binary blob, so the best approach is to make it public and require the application to handle each collision.

However, if we know the data is of some more specific type, it becomes possible to handle these conflicts. CRDTs are data structures designed to provide data types that can always converge as long as they see the same set of operations (regardless of the order).

CRDTs: Convergent Replication Data Types

CRDTs (Convergent Replicated Data Types) exploit knowledge of the commutative and associative laws of specific operations on specific data types.

In order for a set of operations to converge to the same value in a replica environment that only communicates occasionally, the operations need to be out-of-order and insensitive to (message) duplication/re-delivery. Therefore, their operation needs to be:

  • associativity ( a+(b+c)=(a+b)+c), so grouping doesn't matter
  • Commutativity ( a+b=b+a), so order of application does not matter
  • idempotence ( a+a=a), so duplication doesn't matter

It turns out that these structures are already well known in mathematics; they are called "union" or "intersection" semilattices .

A lattice is a partially ordered set with a definite top (smallest upper bound) and a definite bottom (largest lower bound). A semilattice is a lattice-like structure, but only has a definite top or bottom. An up-combined half-lattice is a half-lattice with a well-defined top (minimum upper bound), and a lower-joint half-lattice is a half-lattice with a well-defined bottom (largest lower bound).

Any data type that can be represented as a semilattice can be implemented as a data structure with guaranteed convergence. For example, computing a set of values max()​​will always return the same result, regardless of the order in which the values ​​are received, as long as all values ​​are eventually received, because max()operations are associative, commutative, and idempotent.

For example, here are two grids: one drawn for sets, where the coalescing operator is union(items), and another for strictly increasing integer counters, where the coalescing operator is max(values):

{ a, b, c }              7
/      |    \            /  \
{a, b} {b,c} {a,c}        5    7
|  \  /  | /           /   |  \
{a} {b} {c}            3   5   7

Using a data type that can be represented as a semilattice, you can have replicas communicate in any mode and receive updates in any order, and as long as they all see the same information, they will eventually agree on a result. This is a strong property that is guaranteed as long as the preconditions remain the same.

However, representing a data type as a semilattice usually requires some level of interpretation. Many data types are not actually unordered. For example, adding items to a collection is associative, commutative, and idempotent. However, if we also allow items to be removed from the collection, then we need some way to resolve conflicting operations, such as add(A)and remove(A). What does it mean to remove an element if the local copy never added it? This solution has to be accounted for in an unordered fashion, and there are several different options, with different tradeoffs.

This means some familiar data types as specialized implementations of CRDTs to achieve different trade-offs for out-of-order conflict resolution. Unlike key-value stores, which only deal with registers (e.g., opaque binary chunks from the system's perspective), people using CRDTs must use the correct data type to avoid exceptions.

Some examples of different data types specified as CRDTs include:

  • counter
    • Increment-only counter (merge = max(values); payload = single integer)
    • Positive and negative counters (consisting of two increasing counters, one for incrementing and one for decrementing)
  • register
    • Last write to winning register (timestamp or version number; merge = max(ts); payload = chunk of data)
    • Multi-valued register (vector clock; merge = get two values ​​at the same time)
  • gather
    • Increment-only collection (merge = union(items); payload = collection; no deletes)
    • Two-phase collections (consisting of two collections, one for addition and one for removal; elements can only be added and removed once)
    • unique collections (optimized version of two-stage collections)
    • Write the winning collection last (merge = max(ts); payload = collection)
    • Positive and negative collections (each collection item has a positive and negative counter)
    • Observe delete collection
  • graphics and text sequences (see paper for details)

To ensure exception-free operation, you'll need to find the right data type for your particular application - for example, if you know you'll only delete items once, then a two-phase collection is appropriate; if you'll only be adding items to the collection and not Remove them, then only growing collections apply.

Not all data structures have known CRDT implementations, but in a recent (2011) survey paper by Shapiro et al., CRDT implementations for booleans, counters, sets, registers, and graphs are presented.

Interestingly, the implementation of registers corresponds directly to that used by key-value stores: a last-written-to-wins register uses a timestamp or some equivalent, and simply converges to the largest timestamp value; a multivalued register corresponds to the Dynamo strategy , that is, retain, expose, and coordinate concurrent changes. For details, I suggest you check out the paper in the further reading section of this chapter.

CALM theorem

The CRDT data structure is based on the recognition that the data structure expressed as a semilattice is convergent. But programming isn't just about the evolution of state, unless you're just implementing a data store.

Clearly, out-of-orderness is an important property of any convergent computation: if the order in which data items are received affects the result of the computation, computations cannot be performed without order guarantees.

However, in many programming models, the order of statements does not play a significant role. For example, in the MapReduce model , both Map and Reduce tasks are specified as stateless tuple processing tasks that need to be run on datasets. How and in what order data is routed to tasks is not explicitly specified, but the batch job scheduler is responsible for scheduling tasks to run on the cluster.

Likewise, in SQL, we specify the query, but not how the query will be executed. A query is just a declarative description of a task, and the job of the query optimizer is to figure out an efficient way to execute the query (across multiple machines, databases, and tables).

Of course, these programming models are not as forgiving as general-purpose programming languages. MapReduce tasks need to be expressed as stateless tasks in acyclic dataflow programs; SQL statements can perform fairly complex calculations, but many things are difficult to express in them.

However, it is clear from these two examples that there are many kinds of data processing tasks that are amenable to being expressed in declarative languages, where the order of execution is not clearly specified. Programming models that express the desired result but leave the exact order of statements to the optimizer usually have unordered semantics. This means that such programs may potentially execute without coordination, since they depend on the inputs received, not necessarily the specific order in which the inputs were received.

The key point is that such procedures may be safe to execute without coordination. We cannot implement a program and maintain certainty about the correctness of the result without clearly defining the rules for what is safe to execute without coordination and what is not.

This is what the CALM theorem is about. The CALM theorem is based on the recognition of the association between logical monotonicity and useful forms of eventual consistency (eg, convergence/convergence). It shows that programs that are logically monotonic are guaranteed to be eventually consistent.

Then, if we know that a computation is logically monotonic, then we know that it is safe to perform without coordination.

To understand this better, we need to contrast monotonic logic (or monotonic computation) with non-monotonic logic (or non-monotonic computation).

monotonicity

If the sentence φis the consequence of a set of premises Γ, then it can also be inferred from any extended Γset of premisesΔ

Most standard logical frameworks are monotonic: inferences made in a framework like first-order logic, once logically valid, cannot be negated by new information. Non-monotonic logic is a system that does not have this property, in other words, certain conclusions can be negated by learning new knowledge.

In the field of artificial intelligence, non-monotonic logic is associated with defeasible reasoning - reasoning that takes place with partial information, where new knowledge can invalidate previous assertions. For example, if we learn that a bird is a bird, we think that the bird can fly; but if we later learn that the bird is a penguin, then we must revise our conclusion.

Monotonicity focuses on the relationship between premises (or facts about the world) and conclusions (or assertions about the world). In monotonic logic, we know that our results are irreversible: monotonic computations do not require recomputation or reconciliation; answers get more accurate over time. Once we know that a bird is a bird (and we are reasoning using monotonic logic), we can safely conclude that a bird can fly, and nothing we have learned negates this conclusion.

While any computation that produces a human-oriented result can be interpreted as an assertion about the world (e.g., the value of "foo" is "bar"), it is difficult to determine whether a computation based on the von Neumann machine programming model is monotonic , because the relationships between facts and assertions and whether those relationships are monotonic are not entirely clear.

However, there are programming models that determine its monotonicity. In particular relational algebra (such as the theoretical basis of SQL) and Datalog provide highly expressive languages ​​with a good understanding.

Both basic Datalog and relational algebra (even with recursion) are considered monotonic. More specifically, computations expressed using a specific set of basic operators are considered monotonic (selection, projection, natural join, Cartesian product, union, and recursive Datalog without negation), using more advanced operators introducing Non-monotonicity (negation, set difference, division, universal quantifiers, aggregation).

This means that computations expressed in these systems using a large number of operators (e.g. map, filter, join, union, intersection) are logically monotonic; any computation using these operators is also monotonic and thus can be safely run under the condition. On the other hand, expressions using negation and aggregation are not safe to operate without coordination.

It is important to be aware of the connection between nonmonotonicity and performing expensive operations in distributed systems. Specifically, both distributed aggregation and coordination protocols can be viewed as a form of negation. As Joe Hellerstein writes :

In order to verify the truth of a negative predicate in a distributed environment, the evaluation strategy must start "counting to 0" for null, and wait for the distributed counting process to definitively terminate. Aggregation is a generalization of this idea.

and:

This idea can also be viewed from another angle. Coordination protocols are themselves aggregates because they involve voting: two-phase commit requires a unanimous vote, Paxos consensus requires a majority, and Byzantine agreement requires a 2/3 majority. Waiting needs to count.

If we can express computations in a way that we can test for monotonicity, then we can do static analysis of the whole program, detecting which parts are eventually consistent and can run without coordination (i.e. monotonic parts), and which parts are not (non-monotonic parts).

Note that this requires a different type of language, because for traditional programming languages, where sequences, selections, and iterations are at the core, it is difficult to make these inferences. That's why the Bloom language was designed.

What are the benefits of non-monotonicity?

The distinction between monotonicity and non-monotonicity is interesting. For example, adding two numbers is monotonic, but computing the aggregate over two nodes containing the numbers is not. What is the difference between them? One of them is a calculation (adding two numbers), while the other is an assertion (calculating an aggregate).

How is a calculation different from an assertion? Let's consider the query "Is pizza a vegetable?" To answer this question, we need to get to the core: When can something be inferred to be true (or not)?

There are several accepted answers, each corresponding to a different set of assumptions about the information we have and how we should act on it - and we have accepted different answers in different contexts.

In everyday reasoning, we make what's called the open-world assumption : we assume we don't know everything, and therefore cannot draw conclusions from lack of knowledge. That is, any statement may be true, false or unknown.

                                OWA+ | OWA+
                              monotonic logic | non-monotonic logic
P(true) can be deduced | P(true) can be asserted | P(true) cannot be asserted
P(false) can be deduced | P(false) can be asserted | P(true) cannot be asserted
Unable to derive P(true) | unknown | unknown
or P(false)

When making open-world assumptions, we can only safely assert what we can infer from what we know. Our information about the world is assumed to be incomplete.

Let's first look at the case where we know the reasoning is monotonic. In this case, any (possibly incomplete) knowledge we have is not invalidated by learning new knowledge. Therefore, if we can deduce that a sentence is true based on some reasoning such as "Anything with two tablespoons of tomato sauce is a vegetable" and "Pizza has two tablespoons of tomato sauce", then we can conclude that "Pizza is a vegetable" in conclusion. Likewise, if we can deduce that a sentence is false, that's the same.

However, if we cannot infer anything - for example, we have a knowledge set that contains information about customers, but not about pizza or vegetables - then by the open world assumption we must say that we cannot draw any conclusions.

With nonmonotonic knowledge, anything we know now can be disproved. Therefore, we cannot safely draw any conclusions, even if we can infer true or false from what we currently know.

However, in the context of databases, and in many computer science applications, we prefer to draw firmer conclusions. This means assuming the so-called closed world assumption : that everything that cannot be proven to be true is assumed to be false. This means that there is no need to explicitly declare false. In other words, we assume that the database of facts we have is complete (minimal), so anything not in it can be assumed to be false.

For example, under the Clean Water Act, if there is no record of a flight from San Francisco to Helsinki in our database, then we can safely conclude that no such flight exists.

We need one extra thing to be able to make unambiguous assertions: logical narrowing.

Narrowing is a speculative formalization rule. Domain reduction infers that known entities are all entities. We need to be able to assume that known entities are all entities in order to draw definite conclusions.

                                CWA + | CWA +
                              contains + | contains +
                              monotonic logic | non-monotonic logic
P(true) can be deduced | P(true) can be asserted | P(true) can be asserted
可以推导出 P(false)      |   可以断言 P(false)      |  可以断言 P(false)
无法推导出 P(true)      |   可以断言 P(false)      |  可以断言 P(false)
或 P(false)

特别是,非单调推理需要这个假设。只有在我们假设我们拥有完整信息的情况下,我们才能做出自信的断言,因为额外的信息可能会使我们的断言无效。

这在实践中意味着什么?首先,单调逻辑一旦可以推导出一个句子是真(或假)的,就可以得出明确的结论。其次,非单调逻辑需要额外的假设:已知的实体就是全部。

所以为什么两个表面上相等的操作会不同呢?为什么将两个数字相加是单调的,但在两个节点上计算聚合不是呢? 因为聚合不仅仅计算总和,还要确保已经看到了所有的值。而确保这一点的唯一方法是在节点之间进行协作,确保执行计算的节点真正看到了系统中的所有值。

因此,为了处理非单调性,需要使用分布式协调来确保只在所有信息已知后才进行断言,或者在断言时附带一个警告,即结论可能在以后被否定。

处理非单调性对于表达能力来说非常重要。这意味着能够表达非单调的事物;例如,能够说某一列的总计是X是很好的。系统必须检测到这种计算需要全局协调边界,以确保我们已经看到了所有的实体。

纯单调系统很少见。似乎大多数应用程序都在封闭世界假设下运行,即使它们有不完整的数据,而我们人类对此也很满意。当一个数据库告诉你旧金山和赫尔辛基之间没有直飞航班时,你很可能会把它理解为“根据这个数据库,没有直飞航班”,但你并不排除在现实中可能仍然存在这样的航班。

事实上,只有在副本能够发生分歧时(例如在分区或正常运行期间由于延迟)这个问题才变得有趣。此时需要进行更具体的考虑:答案是基于当前节点还是整个系统的总体。

Furthermore, since non-monotonicity is caused by making assertions, it seems quite possible that many computations can be performed over a long period of time, only applying coordination when passing a certain result or assertion to a third-party system or end user. Of course, total ordering does not need to be enforced on every single read and write operation in the system if those reads and writes are just part of a long-running computation.

Bloom language

The Bloom language is a language designed to exploit the theorems of CALM. It is a Ruby DSL based on a temporal logic programming language called Dedalus.

In Bloom, each node has a database consisting of sets and lattices. Programs are represented as unordered collections of statements that interact with sets (collections of facts) and lattices (CRDTs). Statements are unordered by default, but non-monotonic functions can also be written.

Take a look at the Bloom website and tutorials to learn more about Bloom.


read more

CALM Theorem, Convergence Analysis and Bloom Filters

Joe Hellerstein's talk at RICON 2012 is a good introduction to the topic, as is Neil Conway's talk at Basho . For Bloom, refer to Peter Alvaro's speech at Microsoft .

Consistent Replicated Datatypes

Marc Shapiro's presentation at Microsoft is a good starting point for understanding CRDTs.

Dynamo; PBS; optimistic replication

6. Further reading and appendices

If you've gotten this far, thank you.

If you like this book, follow me on Github (or Twitter ). I love seeing myself having some positive impact. "create more value than you get" or something.

Many thanks to: logpath, alexras, globalcitizen, graue, frankshearar, roryokane, jpfuentes2, eeror, cmeiklejohn, stevenproctor, eos2102 and steveloughran for their help! Of course, any omissions and mistakes are my responsibility!

It's worth noting that my chapter on eventual consistency is biased towards Berkeley; I want to change that. I also skipped one prominent use case: consistent snapshots. There are also some topics that I should expand on further: namely an explicit discussion of security and liveness properties and a more detailed discussion of consistent hashing. I'm going to Strange Loop 2013 though , so whatever.

If this book had a sixth chapter, it would probably be about how to use and process large amounts of data. It seems that the most common type of "big data" computing is processing a large data set with a simple program. I'm not sure what the next chapters will be (maybe high performance computing, given the current focus is on feasibility), but I'll probably know in a few years.

Books on Distributed Systems

Distributed algorithm (Lynch)

This is probably the most frequently recommended book on distributed algorithms. I would also recommend it, but with one caveat. It's very comprehensive, but is written for a graduate audience, so you'll spend a lot of time reading about synchronization systems and shared-memory algorithms before getting to what's most interesting to practitioners.

An Introduction to Reliable and Secure Distributed Programming (Cachin, Guerraoui, and Rodrigues)

For practitioners, this is an interesting example. It's short and contains the implementation of the actual algorithm.

Replication: Theory and Practice

If you're interested in replication, this book is great. The chapter on reproduction is largely based on a synthesis of interesting parts of the book, as well as more recent readings.

Distributed Systems: Algorithmic Approaches (Gaussian)

Introduction to Distributed Algorithms (TEL)

Transactional Information Systems: Theory, Algorithms, and Practice of Concurrency Control and Recovery (Weikum & Vossen)

This book is about traditional transactional information systems, such as native relational databases. There are two chapters on distributed transactions at the end, but the focus of this book is on transaction processing.

Transaction Processing: Concepts and Techniques (by Gray and Reuter)

a classic. I found Weikum & Vossen to be more up to date.

seminal paper

The Dykstra Prize in Distributed Computing is awarded annually to recognize an outstanding paper on the principles of distributed computing. Click on the link to see the full list, which includes the following classic papers:

Microsoft Scholar has a list of the top publications in distributed parallel computing sorted by citation count - this might be an interesting list to browse for more classic works.

Here is an additional list of some recommended papers:

system

Guess you like

Origin blog.csdn.net/community_717/article/details/131559004