Art~Distributed transaction CAP and BASE theory, 2PC, 3PC, TCC model

Preface

The traditional serial system design or the MVC style system design to ensure data consistency is to use relational databases. The advantage is that the ACID transaction feature is used to ensure data consistency. You only need to open a transaction, then perform the update operation, and finally submit the transaction. Or roll back the transaction. What's more convenient is that you only need to pay attention to the business itself that caused the data change with the help of data access technologies and frameworks such as Spring.

But at present, with the continuous increase of QPS and the continuous increase of business volume, the all in one architecture can no longer meet the huge data volume and business volume. It is necessary to split the business horizontally. After the split, there will be two problems. The situation of one or more databases cannot be solved by local transactions at this time, and it is necessary to rely on distributed transactions to ensure consistency.

Before talking about the actual situation, don't guard against some theory. For example, traditional relational databases have ACID, so distributed transactions also have certain characteristics.

CAP theory

The CAP theory can be regarded as a standard to verify the reliability of a distributed transaction.

Consistency

Refers to whether all data backups in a distributed system have the same value at the same time.

Availability

Refers to whether the cluster as a whole can respond to requests normally after some nodes in the cluster fail.

Partition Tolerance

If your distributed system cannot guarantee data consistency in a timely manner, what is the time interval for fault tolerance.
It is equivalent to the time limit of communication. If the system cannot achieve data consistency within a certain time limit, what should the system do?
Since the current network hardware will definitely have communication abnormalities such as delay and packet loss, partition fault tolerance must be realized.

BASE thought

The BASE theory is an extension of the CAP theory. The basic idea is that even if strong consistency cannot be achieved, applications can use appropriate methods to achieve final consistency.
You can also simply understand the bottom line of Wie, a distributed system.

BasicallyAvailiable

It means that when a distributed system fails, part of its availability can be lost, and it needs to be guaranteed and core available. Service current limiting and degradation are its basic performance.

SoftState

Refers to allowing the system to have an intermediate state, and the intermediate state will not affect the overall available state. In distributed storage, there are generally several copies of a piece of data, and the delay that allows the synchronization of copies between different nodes is a manifestation of soft state.

Eventual Consistency

It means that all data copies in the system can finally reach a consistent state after a certain period of time.
The consistency of CAP is strong consistency. This consistency level is most in line with the user's intuition. The user experience is good, but the implementation has a greater impact on system performance. The weak consistency is just the opposite. Eventual consistency in BASE can be regarded as a special case of weak consistency.

Comparison summary

The significance of the discussion of CAP theory and BASE thinking is that even if strong consistency cannot be achieved, we can also use appropriate methods to achieve final consistency.
For the microservice architecture, it is recommended to adopt a looser way to maintain consistency, that is, the so-called final consistency. For developers, the solution to achieve final consistency can make different choices according to different business scenarios.

Strong consistency solution

2PC

The 2PC distributed transaction is mainly divided into two stages.

  • The first stage:
    preparation stage: the coordinator proposes and collects feedback from other participants (the proposed node is the coordinator, and the node participating in the decision is the participant). Then, if the transaction logic is all ok, the transaction logic is directly issued to each participant.
  • The second stage: the
    execution stage: the decision to commit or abort the transaction based on the feedback. Each node executes itself according to the transaction logic sent by the coordinator and records the redolog and undolog by itself. If no errors occur after the execution, the coordinator will return to normal. If all return to normal, the coordinator will let the participants submit the transaction. When a node returns an error to the coordinator, the coordinator will notify the other nodes to roll back the data.

In short, the coordinator initiates a proposal to ask each participant whether to accept the scene, and then the coordinator submits or aborts the transaction based on the participant's feedback.

2PC's delay problem.

Problem 1: Synchronous blocking problem During the
execution process, all participants are transaction blocking. When participants occupy public resources, other third-party nodes have to be in a blocked state to access public resources.

Problem 2: Single point of failure
Once the coordinator fails, the participants will continue to block. Especially in the second stage, when the coordinator fails, all participants are still in the state of locking resources, but cannot complete subsequent transaction operations.

Problem 3: Data inconsistency.
When a local network abnormality occurs after the coordinator sends a submission request to the participant, or the coordinator fails during the process of sending the submission request, only a part of the participants will receive the submission request, and this part of the participants After receiving the request, the commit operation will be executed, and the machine that has not received the commit request cannot execute the transaction commit, so the problem of data inconsistency occurs.

3PC

Compared with 2PC, 3PC has two main changes: the timeout mechanism is introduced between the coordinator and the participants + the preparation phase is divided into two.

3PC: CanCommit + PreCommit + DoCommit, the specific operation is as follows.

Insert picture description here
CanCommit stage: The coordinator sends a submission request to the participant, and the participant returns a Yes response if the transaction operation is possible, otherwise it returns a No response. If there is a participant who cannot perform transaction operations, the coordinator directly terminates the transaction, reducing the participant's blocking time.

PreCommit stage: The coordinator decides whether the PreCommit operation of the transaction can be carried out according to the response of the participants.
If the coordinator receives feedback from all participants in a Yes response, then the task is dispatched to let the participants execute the logic of the transaction.
If any participant sends a No response to the coordinator during accepting the task, or the coordinator does not receive the ACK response from the participant after the timeout after the task is dispatched, then the other participants will be notified of the interruption of the execution of the transaction.

DoCommit stage: Perform commit or interrupt the transaction, and the participants themselves record the redolog and undolog of the transaction. When the coordinator receives an exception or waits for a timeout without receiving Commited, it notifies other participants to roll back.

It can be seen that 3PC mainly solves the single point problem and synchronization blocking problem of 2PC. It is to use one more query, and finally execute it to reduce the blocking time, improve the success rate of transaction execution, and reduce the rollback rate.

Eventually consistent solution

Compensation mode

The basic idea is to use an additional compensation service to coordinate various microservices that need to ensure consistency. The compensation service calls each microservice in order. If a microservice call fails, all the microservices that have been completed before are cancelled, and the compensation service pair Microservices that need to ensure consistency provide compensation operations.

The example involves two microservices, the order microservice and the payment microservice, which provide compensation operations. If the payment service fails, the previous order service needs to be cancelled.

key point

For compensation services, the operation record of all services is a key point, and the operation record is the prerequisite for executing the cancellation operation.

For example, order service and payment service need to keep detailed operation records and logs. These logs and records are helpful to determine the steps and status of failure, and then clarify the scope of compensation, and then obtain the business data for compensation.

If only the order service fails, then only one service needs to be compensated. If the payment service also fails, the two services are rolled back.

The compensation operation requires the business data to include the business serial number, account number and amount at the time of payment. Theoretically, the compensation operation can be completed based on the unique business serial number, but providing more data is beneficial to the robustness of microservices.

solution

The key element to realize the compensation model is to record the complete business flow, which can provide the required business data for the compensation operation through the business flow.

The compensation service can know the scope of compensation from the status of the business flow, and the business data needed in the compensation process can also be obtained from the recorded business flow.

Compensation service as a service invocation process also has unsuccessful invocation. A certain robustness mechanism is needed to ensure the success rate of compensation, and the related operations of compensation need to be idempotent.

(1) Service restart: If the cause of the failure is not temporary, but a business error caused by business factors, the problem needs to be corrected and executed again.

(2) Retry immediately: For transient abnormalities such as network failures or database locks, retrying can to a large extent ensure the normal execution of tasks.

(3) Timing call: Generally, the upper limit of the number of calls will be specified. If the number of calls reaches the upper limit, there will be no retry.

If the problem cannot be solved through service restart, immediate retry, timed invocation and other strategies, the relevant personnel need to be notified to deal with it, that is, manual intervention mode.

TCC mode

A complete TCC business consists of a main service and several slave services. The main service initiates and completes the entire business process.

Provide three interfaces from the service: Try, Confirm, Cancel:

Try interface: Complete all business rule checks and reserve business resources.

Confirm interface: The actual execution of the business, it does not do any business inspection itself, only uses the business resources reserved in the Try phase, and the operation needs to meet idempotence.

Cancel interface: To release the business resources reserved in the Try phase, it also needs to satisfy idempotence.

For example, the order system is split into two scenarios: order placement and order payment

1) Try stage: Try to execute the business.

On the one hand, all business checks are completed, such as placing an order for this order, it is necessary to verify the availability of the product and whether the user account amount is sufficient.

On the other hand, business resources need to be reserved, such as freezing the user account balance to pay for the order, to ensure that no other concurrent processes will deduct the account balance and make subsequent payments impossible.

(2) Confirm phase: execute business.

If everything is normal in the Try stage, the order will be placed and the payment amount in the user account will be deducted.

(3) Cancel stage: cancel the execution of the business.

Release the business resources reserved in the Try phase. If the Try phase is partially successful, such as the product is available and the order is placed normally, but the account balance is insufficient and the freezing fails, you need to cancel the product order and release the occupied product.

TCC summary

The TCC service framework does not need to record detailed business flow, and the business that completes the Confirm and Cancel operations is provided by the business service.

In the realization of TCC mode, the most important work is to design a stable, highly available, and scalable TCC transaction manager.

In a cross-service business operation, first use Try to lock the business resources in the service for resource reservation. Only when the resource reservation is successful, the subsequent operations can proceed normally. The Confirm operation is a business operation performed on the resources locked in the Try phase after Try, and Cancel is used to roll back when all operations fail.

The operation of TCC requires the business side to provide corresponding functions, and the development cost is relatively high. The recommended TCC framework is:
http://github.com/protera/spring-cloud-rest-tcc;

Guess you like

Origin blog.csdn.net/Shangxingya/article/details/115041755