[Transfer] From distributed consistency to CAP theory and BASE theory

 

 

statement of problem

In the field of computer science, distributed consistency is a very important and widely explored and demonstrated problem. Let's first look at three business scenarios.

1. Ticket sales at the railway station

Suppose our end user is a traveler who often travels by train. Usually, he goes to the ticket office of the station to buy a ticket, then takes the ticket to the ticket gate, and then gets on the train and starts a beautiful journey—everything It seems to be so harmonious. Imagine if the destination he chose was Hangzhou, and there was only the last ticket left for a certain train bound for Hangzhou. Maybe at the same moment, another passenger in a different ticket window also bought the same ticket. If the ticketing system does not guarantee consistency, both of them have successfully purchased tickets. And at the ticket gate, one of the passengers will be told that his ticket is invalid - of course, this is rarely the case with the modern Chinese railway ticketing system. But in this example we can see that the end user's needs for the system are very simple:

"Please sell me the ticket, if there are no more tickets, please tell me when the ticket is sold that the ticket is invalid"

This puts forward strict consistency requirements for the ticketing system - the data of the system (in this case, it refers to the number of remaining tickets for the train bound for Hangzhou) no matter which ticket window it is in, it must be the same at all times. Exactly!

2. Bank transfer

If our end user is a fresh graduate, he usually chooses to send money home when he gets his first monthly salary. When he came to the bank counter and completed the transfer operation, the bank's counter clerk would kindly remind him: "Your transfer will arrive in N working days!". At this time, the graduate was a little frustrated, and would tell the counter attendant: "Well, it doesn't matter how long it takes, just don't have less money!" ---- This has also become the most important thing that almost all users have for the modern banking system. basic needs

3. Online shopping

Suppose our end user is an online shopping expert. When he sees a favorite product with an inventory of 5, he will quickly confirm the purchase, write down the delivery address, and then place the order - however, when the order is placed At that moment, the system may inform the user: "Insufficient inventory!". At this time, most consumers will complain that they are too slow, so that their beloved products are robbed by others.

But in fact, engineers who have experience in the development of online shopping systems must understand that the inventory displayed on the product details page is usually not the real inventory of the product. Only when the actual order is placed, the system will check the authenticity of the product. inventory. But, who cares?

 

Interpretation of the problem

For the above three examples, I believe you must have seen that our end users have different requirements for data consistency when using different computer products:

1. For some systems, it is necessary to respond quickly to users, and at the same time to ensure that the data of the system is true and reliable for any client, just like the train station ticketing system

2. Some systems need to ensure absolutely reliable data security for users. Although there is a delay in data consistency, strict consistency must be guaranteed in the end, just like a bank's transfer system

3. In some systems, although some data that can be said to be "wrong" are displayed to the user, the system data will be checked accurately in a certain process during the use of the entire system, so as to avoid unnecessary occurrences for users. losses, like the online shopping system

 

Proposition of Distribution Consistency

An important problem to be solved in a distributed system is the replication of data. In our daily development experience, I believe many developers have encountered such a problem: Suppose client C1 updates a value K in the system from V1 to V2, but client C2 cannot immediately read the latest value of K , which takes a while to be read. This is normal because there are delays between database replications.

The demand for data replication in distributed systems generally comes from the following two reasons:

1. In order to increase the availability of the system to prevent system unavailability caused by a single point of failure

2. Improve the overall performance of the system. Through load balancing technology, data copies distributed in different places can provide services to users

It is self-evident that data replication brings great benefits to distributed systems in terms of availability and performance. However, the consistency challenge brought by data replication is also something that every system developer has to face.

The so-called distribution consistency problem refers to the data inconsistency that may occur between different data nodes after the introduction of a data replication mechanism in a distributed environment, and cannot be solved by computer applications themselves. Simply put, data consistency means that when updating one copy of data, it must be ensured that other copies can also be updated, otherwise the data between different copies will be inconsistent.

So how to solve this problem? One idea is " Since the problem is caused by the delayed action, then I can block the writing action until the data copying is completed, and then the writing action will not be completed. ". Yes, this seems to solve the problem, and some system architectures do use this idea directly. However, this idea not only solves the consistency problem, but also brings a new problem: the performance of writing. If your application scenario has a lot of write requests, after using this idea, subsequent write requests will be blocked on the write operation of the previous request, resulting in a sharp drop in the overall system performance.

In general, we cannot find a distributed consensus solution that satisfies all system properties of distributed systems. Therefore, how to ensure the consistency of the data without affecting the performance of the system operation is a key consideration and trade-off for every distributed system. Thus, the consistency level was born:

1. Strong consistency

This consistency level is most in line with user intuition. It requires the system to write what will be read out. The user experience is good, but its implementation often has a great impact on the performance of the system.

2. Weak consistency

This consistency level constrains the system to not promise to read the written value immediately after the write is successful, and to promise how long it will take for the data to be consistent, but it will try to guarantee to a certain time level (such as the second level). ), the data can reach a consistent state

3. Eventual consistency

Eventual consistency is a special case of weak consistency, and the system will ensure that a data-consistent state can be reached within a certain period of time. The reason why eventual consistency is put forward separately here is because it is a very respected consistency model in weak consistency, and it is also a model that is more respected in the data consistency of large-scale distributed systems in the industry

 

Various problems of distributed environment

The distributed system architecture has been accompanied by many difficulties and challenges since its inception:

1. Communication abnormality

In the process of evolution from centralized to distributed, network factors are inevitably introduced, and additional problems are also introduced due to the unreliability of the network itself. Distributed systems need network communication between nodes, so every network communication will be accompanied by the risk of network unavailability. The unavailability of hardware devices or systems such as network fibers, routers, or DNS will cause the final distributed system to fail. Complete a network communication. In addition, even if the network communication between the various nodes of the distributed system can be carried out normally, the delay will be greater than that of the stand-alone operation. Usually we think that in modern computer architecture, the delay of single-machine memory access is in the order of nanoseconds (usually 10ns), while the delay of a normal network communication is about 0.1~1ms (equivalent to 105 times the memory access delay), Such a huge difference in delay will also affect the process of sending and receiving messages, so message loss and message delay become very common

2. Network partition

When an abnormal situation occurs in the network, the network delay between some nodes in the distributed system continues to increase, and finally, among all the nodes that make up the distributed system, only some nodes can communicate normally, while others cannot. ----We refer to this phenomenon as network partitioning . When network partitions appear, distributed systems will have local small clusters. In extreme cases, these local small clusters will independently perform functions that originally required the entire distributed system, including data transaction processing. Consistency presents a very big challenge

3. Tri-state

In the above two points, we have learned that in a distributed environment, various problems may occur in the network. Therefore, each request and response of a distributed system has a unique three-state concept, namely success, failure, and timeout. . In a traditional stand-alone system, an application can get a very clear response after calling a function: success or failure. In a distributed system, because the network is unreliable, although in most cases, network communication can also receive a successful or failed response, when the network is abnormal, a timeout may occur. Usually there are two situations:

(1) Due to network reasons, the request was not successfully sent to the receiver, but a message loss occurred during the sending process

(2) After the request is successfully received by the receiver, it is processed, but in the process of feeding back the response to the sender, a message loss phenomenon occurs

When such a timeout occurs, the initiator of the network communication cannot determine whether the current request has been successfully processed

4. Node failure

Node failure is another common problem in a distributed environment. It refers to the downtime or "zombie" phenomenon of the server nodes that make up the distributed system. Usually, according to experience, each node may fail. and it happens every day

 

Distributed things

With the development of distributed computing, things have also been widely used in the field of distributed computing. In a stand-alone database, we can easily implement a transaction processing system that satisfies the ACID characteristics. However, in a distributed database, the data is scattered on different machines. How to perform distributed transaction processing on these data is very important. challenge.

Distributed transaction means that the participants of the transaction, the server supporting the transaction, the resource server and the transaction manager are located on different nodes of the distributed system. Usually, a distributed transaction involves operations on multiple data sources or business systems.

A most typical distributed transaction scenario can be imagined: an inter-bank transfer operation involves calling two remote banking services, one of which is the withdrawal service provided by the local bank, and the other is the deposit service provided by the target bank. The services themselves are stateless and independent of each other, and together they constitute a complete distributed thing. If the withdrawal from the local bank is successful, but the deposit service fails for some reason, it must be rolled back to the state before the withdrawal, or the user may find that their money is missing.

从这个例子可以看到,一个分布式事务可以看做是多个分布式的操作序列组成的,例如 上面例子的取款服务和存款服务,通常可以把这一系列分布式的操作序列称为子事物。因此,分布式事务也可以被定义为一种嵌套型的事物,同时也就具有了 ACID事物特性。但由于在分布式事务中,各个子事物的执行是分布式的,因此要实现一种能够保证ACID特性的分布式事物处理系统就显得格外复杂。

 

CAP理论

一个经典的分布式系统理论。CAP理论告诉我们:一个分布式系统不可能同时满足一致性(C:Consistency)、可用性(A:Availability)和分区容错性(P:Partition tolerance)这三个基本需求,最多只能同时满足其中两项

1、一致性

在分布式环境下,一致性是指数据在多个副本之间能否保持一致的特性。在一致性的需求下,当一个系统在数据一致的状态下执行更新操作后,应该保证系统的数据仍然处于一直的状态。

对于一个将数据副本分布在不同分布式节点上的系统来说,如果对第一个节点的数据进 行了更新操作并且更新成功后,却没有使得第二个节点上的数据得到相应的更新,于是在对第二个节点的数据进行读取操作时,获取的依然是老数据(或称为脏数 据),这就是典型的分布式数据不一致的情况。在分布式系统中,如果能够做到针对一个数据项的更新操作执行成功后,所有的用户都可以读取到其最新的值,那么 这样的系统就被认为具有强一致性

2、可用性

可用性是指系统提供的服务必须一直处于可用的状态,对于用户的每一个操作请求总是能够在有限的时间内返回结果。这里的重点是"有限时间内"和"返回结果"。

"有限时间内"是指,对于用户的一个操作请求,系统必须能够在指定的时间内返回对 应的处理结果,如果超过了这个时间范围,那么系统就被认为是不可用的。另外,"有限的时间内"是指系统设计之初就设计好的运行指标,通常不同系统之间有很 大的不同,无论如何,对于用户请求,系统必须存在一个合理的响应时间,否则用户便会对系统感到失望。

"返回结果"是可用性的另一个非常重要的指标,它要求系统在完成对用户请求的处理后,返回一个正常的响应结果。正常的响应结果通常能够明确地反映出队请求的处理结果,即成功或失败,而不是一个让用户感到困惑的返回结果。

3、分区容错性

分区容错性约束了一个分布式系统具有如下特性:分布式系统在遇到任何网络分区故障的时候,仍然需要能够保证对外提供满足一致性和可用性的服务,除非是整个网络环境都发生了故障

网络分区是指在分布式系统中,不同的节点分布在不同的子网络(机房或异地网络) 中,由于一些特殊的原因导致这些子网络出现网络不连通的状况,但各个子网络的内部网络是正常的,从而导致整个系统的网络环境被切分成了若干个孤立的区域。 需要注意的是,组成一个分布式系统的每个节点的加入与退出都可以看作是一个特殊的网络分区。

既然一个分布式系统无法同时满足一致性、可用性、分区容错性三个特点,所以我们就需要抛弃一样:

用一张表格说明一下:

选    择 说    明
CA 放弃分区容错性,加强一致性和可用性,其实就是传统的单机数据库的选择
AP 放弃一致性(这里说的一致性是强一致性),追求分区容错性和可用性,这是很多分布式系统设计时的选择,例如很多NoSQL系统就是如此
CP 放弃可用性,追求一致性和分区容错性,基本不会选择,网络问题会直接让整个系统不可用

需要明确的一点是,对于一个分布式系统而言,分区容错性是一个最基本的要求。因为 既然是一个分布式系统,那么分布式系统中的组件必然需要被部署到不同的节点,否则也就无所谓分布式系统了,因此必然出现子网络。而对于分布式系统而言,网 络问题又是一个必定会出现的异常情况,因此分区容错性也就成为了一个分布式系统必然需要面对和解决的问题。因此系统架构师往往需要把精力花在如何根据业务 特点在C(一致性)和A(可用性)之间寻求平衡。

 

BASE理论

BASE是Basically Available(基本可用)、Soft state(软状态)和Eventually consistent(最终一致性)三个短语的缩写。BASE理论是对CAP中一致性和可用性权衡的结果,其来源于对大规模互联网系统分布式实践的总结, 是基于CAP定理逐步演化而来的。BASE理论的核心思想是:即使无法做到强一致性,但每个应用都可以根据自身业务特点,采用适当的方式来使系统达到最终一致性。接下来看一下BASE中的三要素:

1、基本可用

基本可用是指分布式系统在出现不可预知故障的时候,允许损失部分可用性----注意,这绝不等价于系统不可用。比如:

(1)响应时间上的损失。正常情况下,一个在线搜索引擎需要在0.5秒之内返回给用户相应的查询结果,但由于出现故障,查询结果的响应时间增加了1~2秒

(2)系统功能上的损失:正常情况下,在一个电子商务网站上进行购物的时候,消费者几乎能够顺利完成每一笔订单,但是在一些节日大促购物高峰的时候,由于消费者的购物行为激增,为了保护购物系统的稳定性,部分消费者可能会被引导到一个降级页面

2、软状态

软状态指允许系统中的数据存在中间状态,并认为该中间状态的存在不会影响系统的整体可用性,即允许系统在不同节点的数据副本之间进行数据同步的过程存在延时

3、最终一致性

Eventual consistency emphasizes that all data copies will eventually reach a consistent state after a period of synchronization. Therefore, the essence of eventual consistency is that the system needs to ensure that the final data can be consistent, and it does not need to ensure the strong consistency of system data in real time.

In general, the BASE theory is aimed at large-scale, highly available and scalable distributed systems, which is the opposite of the traditional ACID characteristics of things. It is completely different from the strong consistency model of ACID, but is achieved by sacrificing strong consistency. Availability and allow data to be inconsistent for a period of time but eventually reach a consistent state . But at the same time, in actual distributed scenarios, different business units and components have different requirements for data consistency. Therefore, in the specific distributed system architecture design process, ACID characteristics and BASE theory are often combined.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326977139&siteId=291194637