[Distributed] How to ensure consistency

When it comes to consistency, we have to mention transactions. The word transaction is often used in databases now, but one thing to note is that from a certain point of view, transactions are not only applicable to databases. If you have to make a definition from the database point of view:

Transaction (Transaction) generally refers to something to be done or done. In computer terms, it refers to a program execution unit (unit) that accesses and possibly updates various data items in a database. Transactions are usually caused by the execution of user programs written in advanced database manipulation languages ​​or programming languages ​​(such as SQL, C++ or Java), and are defined by statements (or function calls) such as begin transaction and end transaction. A transaction consists of all operations performed between the begin transaction and the end transaction.

In order to better understand and realize the transaction, the ACID theory is abstracted. In other words, if a system can realize the ACID characteristic, then the transaction characteristic is realized.

  • Atomic atomicity: All series of operation steps of a transaction are regarded as an action, and all steps are either completed or none of them will be completed.
  • Consistent Consistency: When a transaction is completed, all data must remain in a consistent state
  • Isolated isolation: mainly used to implement concurrency control.
    Isolation can ensure that concurrently executed transactions can be executed sequentially one after another. Through isolation, one unfinished transaction will not affect another unfinished transaction.
  • Durable Persistence: Persistence means that once a transaction is committed, its changes to the data in the database are permanent, and then even if the database fails, it should not have any impact on it

It is not difficult to implement ACID on a single machine, and it can usually be guaranteed by mechanisms such as locks, time series, or sequential logs. Let me add one more thing here. Generally speaking, isolation can be achieved by using a lock mechanism, while atomicity, consistency, and persistence can all be achieved by using logs. The consensus algorithm Raft also uses logs to ensure consistency.

However, in a distributed environment, the network environment is more complicated than a single machine. The reason is that network communication is unreliable. If multiple nodes in different networks want to ensure consistency, network delays, network failures and other factors need to be considered.

local affairs

Before introducing data consistency under microservices, let me briefly introduce the background of transactions. Traditional stand-alone applications use an RDBMS as a data source. The application starts a transaction, performs CRUD, commits or rolls back a transaction, all of which occur in a local transaction, and the resource manager (RM) directly provides transaction support. Data consistency is guaranteed within a local transaction.

insert image description here

distributed transaction

Two Phase Commit (2PC)

The two-phase (2PC) commit protocol is proposed based on the industry's first distributed transaction standard specification X/OpenDTP. As the name implies, it negotiates a commit operation through two phases.

X/OpenDTP designed a model to describe the various roles and specifications of distributed transactions:
insert image description here

  1. AP: User program, which is responsible for triggering distributed transactions. In this process, special transaction instructions (XA instructions) are used. These instructions are taken over by TM and sent to the relevant RM for execution.
  2. RM: resource manager, generally refers to the database, each RM only executes its own related instructions. Mapped to the program level such as: ODBC, ADO.Net, JDBC, etc.
  3. TM: The transaction manager or the transaction coordinator is responsible for receiving the instructions initiated by the AP, scheduling and coordinating all RMs participating in the transaction, and ensuring the normal completion or rollback of the transaction.

The two-phase commit protocol was first used to implement distributed transactions in databases, and now most distributed transactions in databases use the XA protocol. In the two-phase commit protocol, the commit process of a transaction is divided into two processes:

  1. notification phase. At this stage, all resource managers (RM) participating in the transaction will be notified to reserve resources and other preparations, including the locking of persistent log files and resources. So this stage takes up most of the time in the whole transaction process, preparing to complete and return the result to the transaction manager (TM).
  2. Commit phase. At this stage, the transaction manager (TM) will decide whether to commit or rollback the operation based on the result of the previous step. Only when all resource managers (RM) agree to commit, the transaction manager (TM) will notify all resource managers (RM) to submit the transaction formally, otherwise TM will notify all RMs to cancel the transaction.

insert image description here

Preparation phase
insert image description here
Two-phase commit-commit
insert image description here
Two-phase commit-rollback

The essence of the two-phase protocol is that it minimizes the probability of unreliable transaction commit failure through two stages. In the process of a real two-phase commit transaction, the first stage actually occupies most of the time of the entire transaction. The second phase of actually committing the transaction is almost instantaneous, so this is the ingenuity of the second phase.

But in reality, we rarely use the two-phase commit protocol to ensure transactionality. Why?

  1. In real scenarios, there are very few businesses with strong consistency, and the most commonly used one is the final consistency based on BASE theory
  2. The two-phase commit protocol needs to lock resources, and there will be a certain loss in performance, which is not suitable for high-concurrency scenarios.
  3. The two-phase commit protocol introduces a transaction manager (TM), which increases the complexity of the system, and most developers are not proficient in TM and RM skills.

However, the two-phase commit cannot completely guarantee the data consistency problem, and has the problem of synchronous blocking, so its optimized version, the three-phase commit (3PC), was invented.

Three Phase Commit (3PC)

insert image description here
However, 3PC can only guarantee data consistency in most cases.

So, is distributed transaction 2PC or 3PC suitable for transaction management under microservices? The answer is no, for three reasons:

  • Because microservices cannot directly access data, and microservices call each other through RPC (Dubbo) or Http API (Spring
    Cloud), it is no longer possible to use TM to manage the RM of microservices in a unified manner.
  • The types of data sources used by different microservices may be completely different. If the microservice uses a database that does not support transactions such as NoSQL, transactions will be out of the question.
  • Even if the data sources used by microservices support transactions, if a large transaction is used to manage the transactions of many microservices, the maintenance time of this large transaction will be several orders of magnitude longer than that of local transactions. Such long-term transactions and cross-service transactions will generate many locks and data unavailability, seriously affecting system performance.

It can be seen that traditional distributed transactions can no longer meet the transaction management requirements under the microservice architecture. Well, since the traditional ACID transactions cannot be satisfied, transaction management under microservices must follow a new rule-BASE theory.

The BASE theory was proposed by eBay architect Dan Pritchett. The BASE theory is an extension of the CAP theory. The core idea is that even if strong consistency cannot be achieved, the application should be able to achieve final consistency in a suitable way. BASE refers to Basic Available, Soft State, and Eventual Consistency.

  • Basic availability: It means that when a distributed system fails, it is allowed to lose part of its availability, that is, to ensure that the core is available.
  • Soft state: Allows for an intermediate state of the system that does not affect the overall availability of the system. In distributed storage, there are generally at least three copies of a piece of data, and the delay in allowing copy synchronization between different nodes is the embodiment of soft state.
  • Final consistency: Final consistency means that all data copies in the system can finally reach a consistent state after a certain period of time. Weak consistency is the opposite of strong consistency, and eventual consistency is a special case of weak consistency.

The final consistency in BASE is the fundamental requirement for transaction management under microservices. Although transaction management based on microservices cannot achieve strong consistency, it must ensure the most important consistency. So, what methods can guarantee the final consistency of transaction management under microservices? According to the implementation principle, there are mainly two types, event notification type and compensation type, and event notification type can be divided into reliable event notification mode and best effort Notification mode, and compensation mode can be divided into TCC mode and business compensation mode. These four modes can all achieve the final consistency of data under microservices.

Reliable Event Notification Pattern

synchronous event

The design concept of the reliable event notification mode is relatively easy to understand, that is, after the main service is completed, the result is passed to the slave service through an event (usually a message queue), and the slave service consumes the message after receiving the message and completes the business, so as to achieve the master service and Message consistency between slave services. The first and simplest thing that comes to mind is synchronous event notification. Business processing and message sending are executed synchronously. See the code and sequence diagram below for the implementation logic.

public void trans() {
    
    
    try {
    
    
    // 1. 操作数据库
        bool result = dao.update(data);
    // 操作数据库失败,会抛出异常    
    // 2. 如果数据库操作成功则发送消息        
    if(result){
    
                
        mq.send(data);
        // 如果方法执行失败,会抛出异常
     }    
    } catch (Exception e) {
    
            
        roolback();
        // 如果发生异常,就回滚
    }
}

insert image description here
The above logic seems seamless. If the database operation fails, it will exit directly without sending a message; if the message sending fails, the database will roll back; if the database operation succeeds and the message is sent successfully, the business is successful, and the message is sent to the downstream consumer. After careful consideration, synchronous message notifications actually have two shortcomings.

  1. Under the microservice architecture, there may be network IO problems or server downtime problems. If these problems appear in step 7 of the sequence diagram, the main service cannot be notified (network problem) after the message is delivered, or the submission cannot continue transaction (downtime), the master service will think that the message delivery failed, and will roll the master service business, but in fact the message has been consumed by the slave service, which will cause data inconsistency between the master service and the slave service. The specific scenarios can be seen in the following two sequence diagrams.
    insert image description here
    insert image description here
  2. The event service (here, the message service) is too coupled with the business. If the message service is unavailable, the business will be unavailable. The event service should be decoupled from the business and executed independently and asynchronously, or try to send a message after the business is executed, and if the message fails to be sent, it will be downgraded to asynchronous sending.

asynchronous event

Local event service:

In order to solve the problem of synchronous events described in the above synchronous events, an asynchronous event notification mode has been developed, that is, business services and event services are decoupled, events are performed asynchronously, and reliable delivery of events is guaranteed by a separate event service.
insert image description here
When the business is executed, the event is written into the local event table in the same local transaction, and the event is delivered at the same time. If the event delivery is successful, the event is deleted from the event table. If the delivery fails, use the event service to process the failed delivery event asynchronously and uniformly at regular intervals, re-deliver until the event is delivered correctly, and delete the event from the event table. This method guarantees the effectiveness of event delivery to the greatest extent possible, and when the first delivery fails, the asynchronous event service can also be used to ensure that the event is delivered at least once.

However, this method of using local event service to ensure reliable event notification also has its disadvantages, that is, the business still has a certain coupling with the event service (at the time of the first synchronous delivery), and more seriously, the local transaction needs Responsible for the operation of additional event tables, which puts pressure on the database. In high-concurrency scenarios, each business operation will generate corresponding event table operations, which will almost cut the available throughput of the database in half. This is undoubtedly impossible. accepted. It is for this reason that the Reliable Event Notification pattern was developed further - the External Event Service came into view.

External event service:

The external event service takes a step further on the basis of the local event service, and separates the event service from the main business service. The main business service no longer has any strong dependence on the event service.
insert image description here
Before submission, the business service sends an event to the event service, and the event service only records the event and does not send it. The business service notifies the event service after committing or rolling back, and the event service sends or deletes the event. There is no need to worry that the business system will fail to send confirmation events to the event service because the business system will go down after submission or rollover, because the event service will regularly obtain all unsent events and query the business system, and decide to send or delete the event based on the return of the business system event.

Although external events can decouple business systems and event systems, they also bring additional workload: compared with local event services, external event services have two more network communication overheads (before submission, after submission/rollback) , and the business system also needs to provide a separate query interface for the event system to judge the status of unsent events.

Notes on the reliable event notification pattern:

There are two points that need to be paid attention to in the reliable event mode: 1. Correct sending of events; 2. Repeated consumption of events.

The correct sending of events can be ensured through the asynchronous message service. However, events may be sent repeatedly, so the consumer needs to ensure that the same event will not be consumed repeatedly. In short, it is to ensure the idempotency of event consumption.

If the event itself is an idempotent state event, such as a notification of the order status (order placed, paid, shipped, etc.), the sequence of events needs to be judged. It is generally judged by the timestamp. After the new message is consumed, when the old message is received, it is directly discarded and not consumed. If a global timestamp cannot be provided, then a globally unified sequence number should be considered.

For events that do not have idempotency, they are generally action events, such as deduction of 100, deposit of 200, the event ID and event result should be persisted, and the event ID should be queried before consuming the event. If it has been consumed, the execution result will be returned directly ; If it is a new message, execute it and store the execution result.

Best Effort Notification Mode

Compared with the reliable event notification pattern, the best effort notification pattern is much easier to understand. The characteristic of the best effort notification type is that after the business service submits the transaction, it sends a limited number of messages (set the maximum number of times limit), for example, three messages are sent, and if the three message sendings fail, the sending will not continue. So there may be a loss of messages. At the same time, the main business party needs to provide a query interface to the slave business service to recover lost messages. The best-effort notification type has relatively poor timeliness guarantee (it may appear in a soft state for a long time), so systems with relatively high timeliness requirements for data consistency cannot be used. This mode is usually used in different business platform services or notifications for third-party business services, such as bank notifications, merchant notifications, etc., which will not be expanded here.

business compensation model

Next, two compensation modes are introduced. The biggest difference between the compensation mode and the event notification mode is that the upstream service in the compensation mode depends on the running results of the downstream services, while the upstream service in the event notification mode does not depend on the running results of the downstream services. First, the business compensation mode is introduced. The business compensation mode is a pure compensation mode. Its design concept is that the business is normally submitted when it is called. When a service fails, all the upstream services it depends on will perform business compensation operations. For example, Xiao Ming departs from Hangzhou for a business trip to New York, USA. Now he needs to book a train ticket from Hangzhou to Shanghai, and an air ticket from Shanghai to New York. If Xiao Ming successfully purchased the train ticket and found that the air ticket was sold out that day, then instead of staying in Shanghai for one more day, Xiao Ming might as well cancel the train ticket to Shanghai and choose to fly to Beijing and then transfer to New York, so Xiao Ming canceled A train ticket to Shanghai. In this example, the purchase of train tickets from Hangzhou to Shanghai is service a, and the purchase of air tickets from Shanghai to New York is service b. The business compensation mode is to compensate service a when service b fails. Shanghai train tickets.

The compensation mode requires each service to provide an excuse for compensation, and this kind of compensation is generally incomplete compensation, that is, even if the compensation operation is performed, the canceled train ticket record is still stored in the database and can be tracked (generally there is a belief that The status field "Cancelled" is used as a mark), after all, the submitted online data generally cannot be physically deleted.

The biggest disadvantage of the business compensation model is that the soft state lasts for a long time, which means that the timeliness of data consistency is very low, and multiple services may often be in a situation of data inconsistency.

TCC

Due to a series of flaws in the two-phase commit protocol, TCC was introduced into distributed transactions. TCC is the abbreviation of the three operations of Try (reservation), Confirm (confirmation), and Cancel (cancellation), which includes two stages of reservation, confirmation or cancellation. The general process of TCC for distributed transactions is as follows:

The TCC model is an optimized business compensation model, which can achieve complete compensation, that is, no compensation record is left after compensation, as if nothing happened. At the same time, the soft state time of TCC is very short, because TCC is a two-stage mode, and only when the first stage (try) of all services is successful, the second-stage confirmation (Confirm) operation is performed, otherwise Compensation (Cancel) operation is performed, and real business processing will not be performed in the try phase.

  • Reservation stage: The transaction initiator initiates a request to all business parties participating in the transaction, requesting to reserve business resources, and the business party responds.
  • Commit phase: The transaction initiator receives replies from each business party participating in the transaction. If they are all ok, each business party is notified to submit the transaction operation. If at least one business party returns a non-ok result, all business parties are notified. Undo transaction operation.
    insert image description here

The specific process of the TCC mode is divided into two stages:

  • Try, the business service completes all business checks and reserves necessary business resources
  • If the Try is successful in all services, then execute the Confirm operation. The Confirm operation does not do any business checks (because it has already been done in the try), but only uses the business resources reserved in the Try phase for business processing; otherwise, perform the Cancel operation, Cancel The operation releases the business resources reserved in the Try phase


insert image description here

It may be vague to say so, let me give a specific example below:

Xiao Ming transfers 100 yuan from China Merchants Bank to China Guangfa Bank online. This operation can be regarded as two services. Service a transfers 100 yuan from Xiaoming's China Merchants Bank account, and service b remits 100 yuan to Xiaoming's Guangfa Bank account.

Service a (Xiao Ming transfers 100 yuan from China Merchants Bank):

try:update cmb_account set balance=balance-100, freeze=freeze+100 where acc_id=1 and balance>100;

confirm:update cmb_account set freeze=freeze-100 where acc_id=1;

cancel:update cmb_account set balance=balance+100, freeze=freeze-100 where acc_id=1;

Service b (Xiao Ming remits 100 yuan to China Guangfa Bank):

try:update cgb_account set freeze=freeze+100 where acc_id=1;

confirm:update cgb_account set balance=balance+100, freeze=freeze-100 where acc_id=1;

cancel:update cgb_account set freeze=freeze-100 where acc_id=1;

Specific instructions:

In the try phase of a, the service does two things: 1. Business inspection, here is to check whether the money in Xiao Ming's account is more than 100 yuan; 2. Reserve resources, transfer 100 yuan from the balance to the frozen funds.

In the confirm stage of a, no business check is performed here, because the try stage has already been done, and because the transfer has been successful, the frozen funds will be deducted.

In the cancel stage of a, the reserved resources are released, that is, 100 yuan is frozen and the balance is restored.

The try phase of b is carried out, resources are reserved, and 100 yuan is frozen.

In the confirm stage of b, the resources reserved in the try stage are used to transfer the frozen funds of 100 yuan to the balance.

In the cancel phase of b, the reserved resources in the try phase are released, and 100 yuan is subtracted from the frozen funds.

As can be seen from the simple example above, the TCC mode is more complex than the pure service compensation mode, so each service needs to implement the Cofirm and Cancel interfaces.

TCC is essentially a compensation transaction. It can be seen from the operation that each business party needs to register three operations for the current transaction: reservation operation, confirmation operation, and cancellation operation. These three operations are coded and implemented by each business that needs to participate in the transaction. Corresponding to the coding level, each business party needs to provide three operation interfaces. For consistency, the confirmation operation and cancellation operation must be idempotent, because these two operations may be systematically retried or retried by human intervention. try.

TCC is more like a programming model in operation, it is mainly aimed at the business level, so its performance is much higher than the two-phase commit mainly aimed at the database level. At present, the main application scenario of the two-phase commit protocol is still on the database, so it essentially uses the locking mechanism of the database, which is one of the important reasons why the two-phase commit protocol is rarely used in high-concurrency Internet applications.

Summarize

The table below compares the four commonly used modes:
insert image description here

source

How to ensure data consistency in a distributed environment? ?
Several Realization Ways to Ensure Data Consistency in Distributed Environment

Guess you like

Origin blog.csdn.net/weixin_44231544/article/details/126517019