SpringCloud Alibaba - Distributed transaction theory (CAP theorem and BASE theory)

Table of contents

1. Distributed transaction theory

1.1. Distributed transaction issues

1.2. What is distributed transaction

1.3. Ideas for solving distributed transactions

1.3.1. CAP theorem

a) Consistency

b)Availability

c) Partition tolerance (partition fault tolerance)

Question 1: Why can only two of the three be satisfied at most?

analysis Summary

Question 2: Which mode does the ElasticSearch cluster belong to?

1.3.2. BASE theory

a) Based on base theory, solved through AP mode

b) Based on basse theory, solved through cp mode

Analysis: Common features of the above methods


1. Distributed transaction theory


1.1. Distributed transaction issues

In the monolithic architecture we have studied in the past, there is often only one service that directly accesses a database. The business is relatively simple and can already achieve ACID (atomicity, consistency, isolation, and durability). However, the business of microservices is often relatively simple. It is complicated. One business may span multiple services, and each service has its own database. At this time, relying on the characteristics of the database itself may not necessarily guarantee the ACID of the entire business.

For example, I have three microservices here, namely order service, account service, and inventory service. Now there is a business for users to place orders. That is to say, when users place orders, they hope that the order service will create the order and then call the account service. Deduct the user balance, and finally call the inventory service to deduct the product inventory quantity. 

It can be seen that in the entire business, each microservice has its own independent database and independent transactions, so what I definitely hope is that once the order business is executed, it will either succeed or fail. 

In fact, if the order service and account service are executed successfully, but the inventory service fails, there will be no rollback. Why?

  1. First of all, because each service is independent and executed synchronously in sequence, even if the last inventory service throws an exception, neither the order service nor the account service will be aware of it. 
  2. Each service is independent, so their transactions are also independent, which means that after executing the order service and account service, their transactions will be completed, so there is no way to roll back, and ultimately No transaction consensus has been reached.

1.2. What is distributed transaction

A business under a distributed system spans multiple services and databases. Each anti-virus task can be a branch transaction. Distributed transactions must ensure that the final status of all branch transactions is consistent. Either everyone succeeds or everyone succeeds. All failed.

The problem with distributed transactions is that the individual transactions are isolated from each other and cannot be sensed. Therefore, it is impossible to roll back, which further makes it impossible to guarantee consistency.

1.3. Ideas for solving distributed transactions

1.3.1. CAP theorem

The CAP theorem proposes: There are three indicators in a distributed system

  1. Consistency
  2. Availability
  3. Partition tolerance

And pointed out that these three indicators cannot be met at the same time, and only two of them can be met at most.

why? We need to figure out what these three indicators mean.

a) Consistency

Consistency means that when users access any node in the distributed system, the data they get must be consistent.

For example, I now have two nodes. There is a data k1 on the first node with a value of 0, and the same is true on the second node, forming a master-slave relationship. Now I modify the data on the first node. , then in order to meet consistency, the data of the first node needs to be synchronized to the second node. Therefore, in a distributed system, when data is modified, data synchronization must be completed in time.

b)Availability

Availability means that when users access any healthy node in the cluster, they must get a response, not rejection or timeout.

For example, I have a cluster here that contains three nodes. Under normal circumstances, there is no problem for users to access any node. But suddenly one day, the request sent to one of the nodes is blocked or rejected, which means that the node is unavailable. , it cannot be accessed normally.

c) Partition tolerance (partition fault tolerance)

Let’s first look at “partitioning” , which means that due to network failure or other reasons, some nodes in the distributed system lose their connections and form independent partitions. 

For example, I have three nodes, node1, node2, and node3. node1 and node2 can access each other, node2 and node3 can access each other, but node3 is disconnected from node2 due to network reasons, which results in node3 unable to communicate with the first two nodes. Data synchronization is performed, which means that two partitions are formed. node1 and node2 are one partition, and node3 is another partition. At this time, if a user writes data to node1, the data will be synchronized to node2, but node3 does not sense it. If so, the data in the two partitions will be inconsistent.

"Partition fault tolerance" means that no matter whether the cluster is partitioned or not, the entire system must continue to provide services to the outside world. Even if you are partitioned, users must continue to access it.

Question 1: Why can only two of the three be satisfied at most?

AP (availability, partition fault tolerance): I just talked about partition fault tolerance. Even if a partition occurs, users must continue to access it, which will lead to data inconsistency.

CP (consistency, fault tolerance): What if I want to satisfy consistency? Then can I wait for node3 (the node that was partitioned due to a network failure) before the network between this node and node2 is restored, block all requests, and tell them "Wait a minute, the data here is not synchronized yet!", is that right? Data consistency can be satisfied, but if users are not allowed to access three healthy nodes, availability is not satisfied.

 CA (Consistency, Availability): If I want to meet availability and ensure consistency, I need to ensure data consistency, so partitioning is not allowed. 

analysis Summary

But partitioning is unavoidable, because in a distributed system, although nodes are not necessarily connected through the network, as long as you are connected through the network, there is no way to guarantee that the network is 100% healthy. So since Partitions will definitely appear. At this time, you need to choose between a (availability) and c (consistency).

Question 2: Which mode does the ElasticSearch cluster belong to?

When a network failure occurs and a node is disconnected from other nodes, the es cluster will be in a warning state. The failed node will be removed from the cluster after a period of time, and the data shards on this node will be Distribute it to other healthy nodes. Then the failed node will not be able to access it, which means that availability is sacrificed, so es is obviously a CP.

1.3.2. BASE theory

We know that in a distributed system, partitioning is inevitable, so you have to make a choice between consistency and availability. However, these two features are very important. What should you do if you don’t want to give up either one? BASE theory can just solve this problem.

BASE theory contains three ideas:

  1. Basically Available: When a failure occurs in a distributed system, partial availability is allowed to be lost, that is, core availability is guaranteed.
  2. Soft State: Within a certain period of time, intermediate states are allowed to occur, such as temporary inconsistent states.
  3. Eventually Consistent: Although strong consistency cannot be guaranteed, data consistency will eventually be achieved after the soft state ends.

In fact, the BASE theory is the reconciliation and choice of availability and consistency. In the event of a failure, I can sacrifice part of the availability or be temporarily unavailable, which will also lead to some data inconsistencies during this period. After the network has recovered, add it to the cluster and re-shard it to see if it is available again and achieve final consistency.

So how to solve distributed transactions based on base theory?

a) Based on base theory, solved through AP mode

That is to meet availability and sacrifice a certain degree of consistency.

For example, if there are many sub-transactions, they will be executed and submitted separately in the future. Some will succeed and some will fail, resulting in inconsistencies, that is, they will be in a soft state. After the execution is completed, they will communicate with each other and ask "Huh? Did you succeed?" Oh, I succeeded. Oops, I failed here...".

After a comparison, it was found that someone failed and he submitted it. It doesn't matter. We can also do the reverse operation. For example, if you added a new one before, then you can just delete it. It will also achieve final consistency. Already.

Therefore, this is a distributed transaction solution based on ap thinking.

b) Based on basse theory, solved through cp mode

It is to satisfy consistency and sacrifice certain usability.

For example, there are many sub-transactions now. Don't submit them after they are executed one by one. They wait for each other and check whether they are completed. If they are all executed and there is no problem, they will be submitted at the same time. If someone fails in the middle, they will be rolled back at the same time. 

Then in this process, each sub-transaction has to wait for each other, so the service is in a weakly available state in real time, because you will lock the resource, resulting in inaccessibility.

Analysis: Common features of the above methods

Each sub-transaction needs to communicate with each other and identify each other's execution status. Therefore, a transaction coordinator is needed to help each sub-transaction in the distributed transaction communicate and perceive each other's status.

For example, taking the previous order as an example, the user places an order, calls the order service, and then calls the account and inventory services. At this time, a transaction coordinator is needed. 

If you want strong consistency now, don't submit when the order service is executed. First tell the transaction coordinator the status of your execution (success or failure), then execute the account service and inventory service, and then tell the transaction your status. The coordinator, if it finds that someone has failed, will notify them to roll back so that everyone is consistent.

During this entire process, each sub-transaction is also called a branch transaction , and the entire associated branch transaction is also called a global transaction .

Guess you like

Origin blog.csdn.net/CYK_byte/article/details/133514663