Transaction processing of microservices

Informal Essay By English

Hi guys、happy labor day. Everyone should have a good time to relax during the Labor Day holiday. But don’t forget to improve yourself during the holiday period
insert image description here

Reference books:

"Phoenix Architecture"
"Microservice Architecture Design Pattern"

introduction

One of the most concerned issues under the microservice architecture is how to implement transactions across multiple services. Transactions are an essential element of every enterprise application. This article will introduce knowledge points such as local transactions, global transactions, shared transactions, and distributed transactions in detail (explained from the perspective of relational database mysql)

local affairs

A local transaction is a transaction that only operates on a single transaction resource and does not require coordination by the global transaction manager.

Local transaction is the most basic transaction solution, which is only applicable to scenarios where a single service uses a single data source (in fact, it is a database transaction). Often when we explain database transactions, we will explain them from the four attributes, that is, the "ACID" characteristics of transactions, but the author of "Phoenix Architecture" believes that these four characteristics are not orthogonal. A, I, and D are means, and C is Purpose, the former is the cause, and the latter is the effect (I personally think the explanation is very good). The explanation of these attributes is as follows:

Transaction processing is involved in almost every information system. It exists to ensure that all data in the system meet expectations, and there will be no contradictions between interrelated data, that is, the consistency of data status (Consistency ).

According to the classic theory of the database, to achieve this goal, three parties need to work together to ensure it.

  • Atomic (Atomic): In the same business process, the transaction guarantees that the modification of multiple data will either succeed at the same time or be revoked at the same time.
  • Isolation: In different business processes, transactions ensure that the data being read and written by each business are independent of each other and will not affect each other.
  • Durability: The transaction should ensure that all successfully submitted data modifications can be correctly persisted without data loss.

The realization principles of several attributes of transactions need to be traced to the ARIES theory. Interested friends can go and
insert image description here
have a look. Redo log, Undo log, WAL (write the log first) have corresponding explanations in this article, I recommend you to read the original text

Atomicity and Persistence

Atomicity and persistence are two properties that are closely related to each other, and are hereby explained together. Atomicity ensures that multiple operations of a transaction either take effect or do not take effect, and there is no intermediate state; persistence ensures that once a transaction takes effect, its modified content will not be revoked or lost for any reason.

But how does the database guarantee its atomicity and persistence? In the early stage of database development, a mechanism called Commit Logging was adopted. How to understand this commit logging, for example (quoted from "Phoenix Architecture"):

To purchase a book, three data needs to be modified: subtracting the payment from the user account, adding payment to the merchant account, and marking a book as a delivery status in the commodity warehouse. Due to the intermediate state of the write, the following scenarios can occur:

  • Uncommitted transaction, crash after writing : The program has not finished modifying the three data, but the database has already written the changes of one or two data to the disk. At this time, a crash occurs. Once restarted, the database must have a way to know An incomplete shopping operation occurred before the crash, and the modified data was restored from the disk to the unmodified state to ensure atomicity.
  • The transaction has been submitted and crashed before writing : the program has modified three data, but the database has not yet written all three data changes to the disk. At this time, a crash occurs. Once restarted, the database must have a way to know A complete shopping operation occurred before the crash, and the part of the data that has not had time to be written to the disk is rewritten to ensure persistence

Since writing intermediate states and crashes are unavoidable, in order to ensure atomicity and durability, recovery measures can only be taken after a crash. This data recovery operation is called "Crash Recovery" (Crash Recovery, also known as The data is called Failure Recovery or Transaction Recovery)

In order to successfully complete crash recovery, writing data to the disk cannot directly change a certain value of a certain row and certain column of a table like a program modifying the value of a variable in memory. Instead, all data required for the operation of modifying data must be Information, including what data was modified, in which memory page and disk block the data was physically located, changed from what value to what value, etc., in the form of a log—that is, in the form of file writes with sequential appends only (this is The most efficient way to write) to disk first. Only after all the log records are safely placed on the disk, and the database sees the "Commit Record" (Commit Record) in the log representing the successful submission of the transaction, will the real data be modified according to the information on the log. Adding an "End Record" (End Record) to the log indicates that the transaction has completed persistence. This transaction implementation method is called "Commit Logging" (commit log)

The principle of Commit Logging to ensure data persistence and atomicity is not difficult to understand: first, once the log is successfully written to the Commit Record, the entire transaction is successful. You can restore the information to the site and continue to modify the data, which ensures durability; secondly, if the log crashes without being successfully written to the Commit Record, the entire transaction will fail. After the system restarts, you will see some logs without Commit Records. Then mark this part of the log as a rollback state, and the entire transaction will be as if nothing happened at all, which ensures atomicity.

Although the Commit Loggin mechanism can guarantee its atomicity and durability during the transaction process, there is a huge inherent defect in Commit Logging: all real modifications to the data must occur after the transaction is committed, that is, after the log is written to the Commit Record . Before this, even if the disk I/O is free enough, even if the amount of data modified by a certain transaction is very large and takes up a large amount of memory buffer, no matter what the reason is, it is never allowed to modify the data on the disk before the transaction is committed. This is the premise of Commit Logging, but it is very detrimental to improving the performance of the database.

So how to improve this performance has become what we need to focus on. In fact, you can guess how to solve it here. The ARIES we mentioned in the introduction began to make its debut. ARIES proposed "Write-Ahead Logging"'s log improvement plan, the so-called "Write-Ahead" (Write-Ahead), is to allow the meaning of changing data to be written in advance before the transaction is committed

insert image description here
The description of the mechanism of WAL in the paper is as shown in the figure above, translated into Chinese, the general content is:

In computer science, write-ahead logging (WAL) is a family of techniques that provide atomicity and durability (two of the ACID properties) in database systems. It can be seen as an implementation of an "event sourcing" architecture, where the state of a system is the result of incoming events evolving from an initial state. The write-ahead log is an append-only secondary disk-resident structure used for crash and transaction recovery. Changes are first recorded in the log and must be written to stable storage before the changes can be written to the database.

The main functions of the write-ahead log can be summarized as:

  • Allows the page cache to buffer updates to disk-resident pages while ensuring persistence semantics within the larger context of the database system.
  • Persist all operations on disk until cached copies of pages affected by those operations are synchronized on disk. Every operation that modifies the state of the database must be logged to disk before modifying the content on the associated page
  • Allows reconstruction of lost memory changes from the oplog in case of a crash.

In systems using WAL, all modifications are written to the log before being applied. Usually redo and undo information are stored in the log.

The purpose of this can be illustrated with an example. Imagine a program doing something while the machine it's running on loses power. When restarting, the program may need to know whether the operation it was performing succeeded, partially succeeded, or failed. If a write-ahead log is used, a program can examine that log and compare what it should have done in the event of an unexpected power outage with what it actually did. On the basis of this comparison, the program can decide to undo what it has started, finish what it has started, or leave it as it is.

After a certain number of operations, the program should perform a checkpoint (checkpointing is a technique that provides fault tolerance to computing systems. It basically consists of saving a snapshot of the state of the application so that the application can restart from that point in the event of a failure. startup. This is especially important for long-running applications executing on failure-prone computing systems.), writes all changes specified in the WAL to the database and clears the log.

After translating into Chinese, you may still not be able to understand some of the details, so here is another understanding of this paper by the author of the Phoenix architecture:

ARIES proposed a "Write-Ahead Logging" log improvement solution. The so-called "Write-Ahead" means to allow changing data to be written in advance before the transaction is committed.

Write-Ahead Logging first divides when to write changed data into two types: FORCE and STEAL, according to the transaction commit time point.

  • FORCE: After the transaction is committed, it is called FORCE if the changed data must be written at the same time, and it is called NO-FORCE if the changed data must be written at the same time. In reality, most databases adopt the NO-FORCE strategy, because as long as there are logs, the changed data can be persisted at any time. From the perspective of optimizing disk I/O performance, there is no need to force data writing to be performed immediately.
  • STEAL: Before the transaction is committed, it is called STEAL if the changed data is allowed to be written in advance, and it is called NO-STEAL if it is not allowed. Considering the optimization of disk I/O performance, allowing data to be written in advance is conducive to utilizing idle I/O resources and saving memory in the database cache.

Commit Logging allows NO-FORCE, but not STEAL. Because if part of the changed data is written to the disk before the transaction is committed, once the transaction is rolled back or a crash occurs, the changed data written in advance will become an error.

Write-Ahead Logging allows NO-FORCE and also allows STEAL. The solution it provides is to add another type of log called Undo Log. Before the changed data is written to the disk, the Undo Log must be recorded first. Note Where data is modified, from what value to what value, and so on. In order to erase the data changes written in advance according to the Undo Log when the transaction is rolled back or crash recovery. Undo Log is now generally translated as "rollback log", and the previously recorded log for replaying data changes during crash recovery is correspondingly named Redo Log, which is generally translated as "redo log". Due to the addition of Undo Log, Write-Ahead Logging will perform the following three stages of operations during crash recovery.

  • Analysis phase (Analysis): This phase scans the log from the last checkpoint (Checkpoint, which can be understood as all changes that should be persisted before this point have been safely placed on the disk), finds out all transactions without End Record, and forms a pending Recovered transaction collection, this collection will include at least two components: Transaction Table and Dirty Page Table.
  • Redo phase (Redo): This phase replays the history (Repeat History) based on the collection of transactions to be recovered generated in the analysis phase. The specific operation is: find all logs containing Commit Record, and write the modified data of these logs to disk After the writing is completed, an End Record is added to the log, and then the transaction set to be recovered is removed.
  • Rollback phase (Undo): This phase processes the rest of the recovery transaction set after the analysis and redo phase. At this time, the remaining transactions that need to be rolled back are called Losers. According to the information in the Undo Log, the The information that has been written to the disk in advance is rewritten to achieve the purpose of rolling back these Loser transactions.

According to whether the database allows FORCE and STEAL, a total of four combinations can be generated. From the perspective of optimizing disk I/O, the performance of the combination of NO-FORCE and STEAL is undoubtedly the highest; from the perspective of algorithm implementation and logs, NO-FORCE plus STEAL Combination complexity is undoubtedly the highest. The specific relationship between these four combinations and Undo Log and Redo Log is shown in the figure below:

insert image description here
In fact, the paper also mentioned a way to pass Shadow page (in computer science, shadow paging is a technology that provides atomicity and persistence in database systems. Pages in this context refer to physical storage units, usually The order of magnitude from 1 to 64 KiB. It is a bit like the idea of ​​copying and reading in COW (CopyOnWriteArrayList) in java)

The general idea of ​​​​Shadow Paging is that changes to data will be written to the data on the hard disk, but instead of directly modifying the original data on the spot, a copy of the data is first copied, the original data is retained, and the copy data is modified. During the transaction process, the modified data will exist in two copies at the same time, one is the data before modification, and the other is the data after modification, which is also the origin of the name "Shadow". When the transaction is successfully submitted and all data modifications are successfully persisted, the last step is to modify the reference pointer of the data, changing the reference from the original data to the newly copied modified copy, and the last "modify pointer" operation will be It is considered an atomic operation, and the write operation of modern disks can be considered as a hardware guarantee that the phenomenon of "half value change" will not occur. So Shadow Paging can also guarantee atomicity and persistence. Shadow Paging is simpler to implement transactions than Commit Logging, but when it comes to isolation and concurrency locks, the transaction concurrency capability of Shadow Paging is relatively limited, so it is not widely used in high-performance databases.
insert image description here

So far, we have introduced how the database guarantees atomicity and persistence. If you are interested in this knowledge, you can read the original paper. Next, let’s introduce how the database guarantees isolation

isolation

Transaction isolation refers to the execution of a transaction, that is, the operations and data used within a transaction are isolated from other concurrent transactions, and the concurrently executed transactions cannot interfere with each other.

What if the isolation cannot be guaranteed? Assume that account A has 200 yuan and account B has 0 yuan. Account A transfers money to account B twice, each time with an amount of 50 yuan, which are executed in two transactions respectively. If isolation cannot be guaranteed, the following situation will occur

UPDATE accounts SET money = money - 50 WHERE NAME = 'AA';
UPDATE accounts SET money = money + 50 WHERE NAME = 'BB';

insert image description here
From the definition of isolation, we can smell that isolation must be closely related to concurrency, because if there is no concurrency and all transactions are serial, then no isolation is required, or such access has natural isolation . But in reality, it is impossible to have no concurrency. How to realize serial data access under concurrency? The database provides two solutions: the first is through locking (easy to think of), the second is through MVCC

lock

In a database, in addition to the contention of traditional computing resources (such as CPU, RAM, I/O, etc.), data is also a resource shared by many users. In order to ensure data consistency, concurrent operations need to be controlled, so locks are generated. At the same time, the lock mechanism also provides guarantees for the realization of various isolation levels of MySQL. Lock conflict is also an important factor affecting the performance of concurrent access to the database. Therefore, locks are particularly important and more complicated for databases. Modern databases provide the following three types of locks:

  • Write Lock (Write Lock, also known as exclusive lock, eXclusive Lock, abbreviated as X-Lock): If the data has a write lock, only the transaction holding the write lock can write the data, and the data holds the write lock , other transactions cannot write data, nor can they impose read locks.
  • Read Lock (Read Lock, also known as Shared Lock, Shared Lock, abbreviated as S-Lock): multiple transactions can add multiple read locks to the same data, and the data cannot be written after the read lock is added lock, so other transactions cannot write to the data, but can still read it. For a transaction that holds a read lock, if only one transaction of the data has a read lock, it is allowed to directly upgrade it to a write lock, and then write the data.
  • Range Lock: Directly add an exclusive lock to a certain range, and the data in this range cannot be written, such as: Gap lock in Innodb (please note that "range cannot be written" and "a batch of data cannot "Written" difference, that is, do not understand range locks as a set of exclusive locks. After adding a range lock, not only cannot modify the existing data in the range, but also cannot add or delete any data in the range. The latter is not possible with a set of exclusive locks)
MVCC

MVCC (Multiversion Concurrency Control), multi-version concurrency control. As the name implies, MVCC implements database concurrency control through multiple version management of data rows. This technique enables consistent read operations to be performed under InnoDB's transaction isolation level. In other words, it is to query some rows that are being updated by another transaction, and you can see their values ​​before they are updated, so that you don't have to wait for another transaction to release the lock when doing the query. In this sentence, "version" is a keyword. You may wish to understand the version as two invisible fields in each row of records in the database: CREATE_VERSION and DELETE_VERSION. The values ​​of these two field records are transaction IDs. ID is a globally strictly increasing value, and then data is written according to the following rules.

  • When inserting data: CREATE_VERSION records the transaction ID for inserting data, and DELETE_VERSION is empty.
  • When deleting data: DELETE_VERSION records the transaction ID of deleting data, and CREATE_VERSION is empty.
  • When modifying data: consider modifying data as a combination of "deleting old data and inserting new data", that is, copying the original data first, the DELETE_VERSION of the original data records the transaction ID of the modified data, and CREATE_VERSION is empty. The CREATE_VERSION of the copied new data records the transaction ID of the modified data, and the DELETE_VERSION is empty.

At this point, if another transaction wants to read the changed data, it will decide which version of the data should be read according to the isolation level.

  • The isolation level is repeatable read: always read the record whose CREATE_VERSION is less than or equal to the current transaction ID. On this premise, if there are still multiple versions of the data, the latest one (with the largest transaction ID) is taken.
  • The isolation level is read committed: always take the latest version, that is, the data record of the version that was most recently committed.

So far, we have finished the introduction of isolation ~, let's start to introduce the global transaction

global affairs

A global transaction is a transaction managed and coordinated by a resource manager. Another definition of a global transaction is: a transaction in which a single service uses multiple data source scenarios. Please note that in theory a true global transaction does not have a "single service" constraint

The global transaction is a DTP model transaction. The so-called DTP model refers to (X/Open Distributed Transaction Processing Reference Model), which is a set of distributed transaction standards defined by this organization, that is, the specification and API interface are defined. Manufacturers carry out specific implementations.
Its core content is to define the communication interface between the global transaction manager (Transaction Manager, used to coordinate global transactions) and the local resource manager (Resource Manager, used to drive local transactions).

X/Open DTP defines three components and two protocols:

  • AP (Application Program): that is, the application program, which can be understood as a program using DTP
  • RM (Resource Manager): resource manager, here can be understood as a DBMS system, or a message server management system, the application controls resources through the resource manager.
  • TM (Transaction Manager): Transaction Manager, responsible for coordinating and managing transactions, providing AP application programming interfaces and managing resource managers
  • XA protocol: the interface between application or application server and transaction management
  • TX protocol: an interface for communication between the global transaction manager and the resource manager

We focus on the XA protocol, where transaction submission is split into a two-stage process:

  • Preparation phase: Also known as the voting phase, in this phase, the coordinator asks all participants of the transaction whether they are ready to commit, and if the participants are ready to commit, they reply Prepared, otherwise they reply Non-Prepared. The preparation operation mentioned here is not the same as the preparation commonly understood in human language. For the database, the preparation operation is to record the content of all transaction commit operations in the redo log, which is different from the real commit in the local transaction It’s just that the last Commit Record is not written for the time being, which means that the isolation is not released immediately after the data persistence is completed, that is, the lock is still held to maintain the isolation of the data from other non-transactional observers.
  • Commit phase: Also known as the execution phase, if the coordinator receives the Prepared message replied by all transaction participants in the previous stage, it first persists the transaction status locally as Commit, and sends Commit to all participants after the operation is completed command, all participants execute the commit operation immediately; otherwise, if any participant replies to the Non-Prepared message, or any participant fails to reply after timeout, the coordinator will persist its own transaction status as Abort, and send it to all participants The Abort instruction, the participant performs the rollback operation immediately. For the database, the commit operation at this stage should be very light. It is only to persist a Commit Record, which can usually be completed quickly. Only when the Abort command is received, the committed data needs to be cleaned up according to the rollback log. This can be a relatively heavy-duty operation.

insert image description here
The above two processes are called the "two-phase commit" (2 Phase Commit, 2PC) protocol, and it needs some other prerequisites to successfully ensure consistency:

  • It must be assumed that the network is reliable for the short period of the commit phase, i.e. no messages are lost during the commit phase. At the same time, it is also assumed that there will be no errors in the whole process of network communication, that is, messages can be lost, but wrong messages will not be delivered. The design goal of XA is not to solve problems such as Byzantine generals. The failure of the voting stage in the two-stage commit can be remedied (rolled back), while the failure of the commit stage cannot be remedied (the result of the commit or rollback will no longer be changed, and the crashed node can only be recovered), so this stage takes time and should As short as possible, this is also a consideration for controlling network risks as much as possible.
  • It must be assumed that nodes that are disconnected due to network partitions, machine crashes, or other reasons will eventually recover and will not be permanently disconnected. Since the complete redo log has been written in the preparation phase, once the lost machine recovers, it can find out the transaction data that has been prepared but not committed from the log, and query the status of the transaction to the coordinator. Determines whether the next step should be a commit or rollback operation.

The coordinator and participants mentioned above can be played by the database itself without the intervention of the application program. The coordinator is generally elected among the participants, and the application only plays the role of the client relative to the database. The interaction timing of the two-phase commit is shown in the figure below.

insert image description here
The principle of two-phase commit is simple and not difficult to implement, but it has several very significant disadvantages

  • Single point problem: the coordinator plays a pivotal role in the two-phase submission. The coordinator can have a timeout mechanism when waiting for the participant's reply, allowing the participant to go down, but the participant cannot perform timeout processing while waiting for the coordinator's instruction. Once it is not one of the participants that goes down, but the coordinator, all participants will be affected. If the coordinator has not recovered and has not sent a Commit or Rollback command normally, then all participants must wait.
  • Performance issues: During the two-stage submission process, all participants are equivalent to being bound into a unified scheduling whole. During this period, two remote service calls and three data persistence (write redo logs in the preparation stage, and the coordinator makes persistent state In the commit phase, the log is written to the Commit Record), and the whole process will continue until the slowest processing operation in the participant cluster ends, which determines that the performance of the two-stage commit is usually poor.
  • Consistency risk: As mentioned earlier, there are prerequisites for the establishment of two-phase commit. When the assumptions of network stability and downtime recovery capabilities are not established, consistency problems may still occur. There is no need to talk about downtime recovery capabilities. In 1985, Fischer, Lynch, and Paterson proposed the "FLP Impossibility Principle" , which proved that if the downtime cannot be recovered in the end, there is no distributed protocol that can correctly reach a consensus. sex results. This principle is a theory with the same name as "CAP cannot have both principles" in distributed. The consistency risk brought about by network stability means that although the submission period is very short, it is still a period of clear danger. After confirming that the transaction state can be submitted, the coordinator will first persist the transaction state and submit its own transaction. If the network is suddenly disconnected at this time and it is no longer possible to send Commit instructions to all participants through the network, some data will be lost. (of the coordinator) has been committed, but part of the data (of the participant) has neither been committed nor rolled back, resulting in data inconsistency.

In order to alleviate some of the defects of 2PC, 3PC (three-stage submission) was proposed later . The three-stage commit subdivides the original two-stage commit preparation stage into two stages, which are called CanCommit and PreCommit, and the commit stage is renamed as the DoCommit stage. Among them, the newly added CanCommit is an inquiry phase, and the coordinator allows each participating database to evaluate whether the transaction is likely to be successfully completed according to its own state. The reason for dividing the preparation phase into two is that this phase is a heavy-duty operation. Once the coordinator sends a message to start preparation, each participant will immediately start writing redo logs, and the data resources involved will be locked. , if a certain participant declares that the submission cannot be completed at this time, it means that everyone has done a round of useless work. Therefore, if you add a round of inquiry phase, if you get a positive response, you will be more confident that the transaction can be successfully committed. get smaller. Therefore, in the scenario where the transaction needs to be rolled back, the performance of the three-stage method is usually much better than that of the two-stage method, but in the scenario where the transaction can be submitted normally, the performance of both is still very poor, even the three-stage method Because of one more inquiry, it is slightly worse

It is also because the probability of transaction failure rollback becomes smaller. In the three-stage commit, if the coordinator is down after the PreCommit stage, that is, if the participant does not wait for the DoCommit message, the default operation strategy will be commit Transactions instead of rolling back transactions or continuing to wait, which is equivalent to avoiding the risk of a single point of problem for the coordinator. The operation sequence of the three-phase commit is shown in the figure below.
insert image description here

shared affairs

Shared transaction (Share Transaction) means that multiple services share the same data source

This kind of transaction is not recommended, so I won’t introduce too much, everyone knows that this type of transaction can be used

distributed transaction

Distributed Transaction refers specifically to the transaction processing mechanism in which multiple services access multiple data sources at the same time. Please note the difference between it and the "distributed transaction" in the DTP model. The "distributed" referred to by the DTP model is relative to the data source and does not involve the service

At present, there are several solutions to solve distributed transactions: reliable event queue, TCC transaction, SAGA transaction, space problem, this article only explains SAGA transaction.

SAGA affairs

Maintain data consistency across multiple services by using asynchronous messages to coordinate a series of local transactions.

Saga is a mechanism for maintaining data consistency in the microservice architecture, which can avoid problems caused by distributed transactions. A saga represents a system operation that requires updating data in multiple services.

The implementation of a saga contains the logic to coordinate the steps of the saga. When a saga is started by a system command, the coordination logic must select and notify the first saga participant to perform a local transaction. Once the transaction is complete, the Saga Coordinator selects and invokes the next Saga participant. This process continues until the Saga has executed all steps. If any local transaction fails, the Saga must execute compensating transactions in reverse order. There are several different ways to build the coordination logic of a Saga:

  • Choreography: Distribute Saga's decision-making and execution sequence logic among each participant in Saga, and they communicate by exchanging events.
  • Orchestration: Concentrate Saga's decision-making and execution sequence logic in a Saga orchestrator class. The Saga orchestrator sends imperative messages to each Saga participant, instructing these participant services to complete specific operations (local transactions)
Collaborative Saga

One way to implement a Saga is to use coroutines. When using coordination, there is no central coordinator that tells the saga participants what to do. On the contrary, Saga participants subscribe to each other's events and respond accordingly. Take an example to explain the collaborative Saga in detail:

Case: A user places an order at a store on Meituan (it is said that Meituan is just to let everyone have an intuitive experience. In fact, this case is quoted from "Microservice Architecture Design". Since the case software in the book is similar to Meituan, it is here Written by Meituan)

From the perspective of software, when the user places an order in the store to the order creation process (Create Order Saga), the general calling process is as follows:

  1. The order service creates an Order in the APROVAL_PENDING state and publishes the order-Created event
  2. The consumer service consumes the ordercreated event, verifies whether the consumer can place an order, and publishes the consumerverified event
  3. Kitchenservice consumes the ordercreated event, verifies the order, creates a kitchen work order Ticket in the CREATE_PENDING state, and publishes the Ticketcreated event.
  4. Accountingservice consumes Ordercreated event and creates a CreditCardAuthorization in PENDING state
  5. The Accountingservice consumes Ticketcreated and ConsumerVerified events, charges the consumer's credit card, and publishes a creditCardauthorized event.
  6. Kitchenservice consumes CreditCardAuthorized event and changes Ticket's state to AWAITING_ACCEPTANCE.
  7. 0rderservice receives the CreditCardAuthorized event, changes the status of the Order to APPROVED, and publishes the OrderApproved event

insert image description here
The Create Order Saga must also handle the scenario where a Saga participant rejects the Order and publishes some kind of failure event. For example, authorization for a consumer's credit card may fail. Saga must execute compensatory transactions to undo completed transactions. The event execution process is as follows:

  1. The order service creates an order in the APPROVAL_PENDING state and publishes the OrderCreated event
  2. Consumerservice consumes the ordercreated event, verifies whether the consumer can place an order, and publishes the Consumerverified event
  3. Kitchenservice consumes the ordercreated event, verifies the order, creates a kitchen work order Ticket in the CREATE_PENDING state, and publishes the Ticketcreated event
  4. The Accounting service consumes the ordercreated event and creates a CreditCardAuthorization in the PENDING state
  5. Accountservice consumes Ticketcreated and consumerveritied events, debits the consumer's credit card (failed), and publishes the CreditCardAuthorizationFailed event
  6. Kitchenservice consumes the CreditCardAuthorizationFailed event, and then changes the status of the kitchen work order Ticket to REJECTED
  7. The Orderservice consumes the creditCardauthorizationFailed event and changes the status of the order to REJECTED

insert image description here
The completion of data interaction in the case we mentioned above depends on the publish/subscribe mode in the message, so we must consider some issues related to communication between services.

The first problem is to ensure that saga participants will update the local database and publish events as part of the database transaction. Every step in a coroutine-based Saga updates the database and publishes an event. For example, in CreateorderSaga, Kitchen Service receives consumerverified event, creates Ticket, and publishes Ticketcreated event. Database updates and event publishing must be atomic. Therefore, we must use transactional messages when publishing event messages.

The second problem is to ensure that Saga participants must be able to map each event received to their own data. For example, when the Orderservice receives a CreditCardAuthorized event, it must be able to look up the corresponding order. The solution is for saga participants to publish events containing a dependency 1D that enables other participants to perform operations on the data. For example, parties to the Create Order Saga could use orderId as the correlation ID passed from one party to the next. The Accounting Service publishes a CreditcardAuthorized event that includes the orderId from the Ticketcreated event. When the Orderservice receives the Creditcard-Authorized event, it uses the orderId to retrieve the corresponding Order. Similarly, Kitchen Service uses the orderId of the event to retrieve the corresponding Ticket

Collaborative-based Saga has the following benefits:

  • Simple: The service publishes events when a business object is created, updated, or deleted.
  • Songlaihe: Participants subscribe to events and there will be no transition between each other.

Several disadvantages of collaborative Saga:

  • Harder to understand: Unlike orchestration, there is no single place in the code where a Saga is defined. Instead, the logic of a coordinated saga is distributed across each service's implementation. As a result, it is sometimes difficult for developers to understand how a particular saga works.
  • Circular dependencies between services: Saga participants subscribe to each other's events, which often results in circular dependencies. For example OrderService-Accounting Service-Order Service. While this is not necessarily a problem, circular dependencies are considered a bad design style
  • Risk of tight fit: Every Saga participant needs to subscribe to all events that affect them. For example, the Accounting Service must subscribe to all events that may result in a consumer's credit card being charged or charged back. Therefore, there is a risk that the internal code of the Accounting Service needs to be kept up to date with the order lifecycle code implemented by the Orderservice
Choreographed Saga

Orchestration is another way to implement Saga. When using an orchestrated Saga, the developer defines an orchestrator class whose sole responsibility is to tell the participants of the Saga what to do. The Saga Orchestrator communicates with the Saga's participant services using a command/asynchronous response approach. In order to complete a link in the Saga, the orchestrator sends an imperative message to a participant, telling the participant what to do. When the participant service completes its operation, it sends a reply message to the orchestrator. The orchestrator processes the message and decides what the Saga's next action is

Or use the case in the collaborative Saga to complete the orchestration design. The Saga is orchestrated by the Createorder-saga class, which uses asynchronous request/response to call the Saga participants. This class tracks the process and sends imperative messages to saga participants such as Kitchen Service and ConsumerService. The Create Order Saga class reads the reply message from its reply channel and then determines the next step in the Saga (if any)

The Order Service first creates (instantiates) an Order object and a Create Order Saga orchestrator object. The process under normal circumstances is as follows:

  1. The Saga orchestrator sends the verifyconsumer command to the Consumer service.
  2. Consumer service replies with Consumer verified message
  3. Saga orchestrator sends CreateTicket command to Kitchenservice.
  4. Kitchenservice replies with the ricketCreated message.
  5. Saga orchestrator sends AuthorizeCard message to Accounting Service
  6. Accounting Service replies with a Card Authorized message
  7. Saga orchestrator sends Approve Ticket command to Kitchen Service
  8. Saga orchestrator sends Approve Order command to Order Service

insert image description here
It should be noted that in the last step, the Saga orchestrator will send imperative messages to the Order Service even though it is a component of the Order Service. In principle, the Create Order Saga can approve the order by directly updating the order. But for consistency, Saga
treats the order service as another party.

The above process describes everything under normal circumstances, but a Saga may have many scenarios, for example, due to the failure of Consumer Service, KitchenService or AccountingService, Saga may fail.

So is there a solution that can describe all possible scenarios well? The answer is: Model Saga as a state machine

A state machine consists of a set of states and a set of transitions between states triggered by events. Each transition can have an action. For Saga, an action is a call to a participant. Transitions between states are triggered by the completion of local transactions performed by Saga participants. The current state and the specific outcome of the local transaction determine the state transitions and the actions performed (if any). There are also effective testing strategies for state machines. Therefore, it is easier to design, implement and test Saga using the state machine model

The Create Order Saga after modeling through the state machine contains the following states in total:

  • Verifying Consumer: initial state. When in this state, the Saga is waiting for the Consumer Service to verify that the consumer can place an order
  • Creating Ticket: The Saga is waiting for a reply to the Create Ticket command
  • Authorizing Card: Waiting for the Accounting Service to authorize the consumer's credit card.
  • Order Approved: The final state, indicating that the Saga has been successfully completed.
  • Order Rejected: The final state, indicating that the Order was rejected by one of the participants

insert image description here
A state machine also defines a number of state transitions. For example, the state machine transitions from the CreatingTicket state to the Authorizing Card or Rejected Order state. When it receives a success reply to the Create Ticket command, it transitions to the Authorizing Card state. Alternatively, if the Kitchen Service cannot create a Ticket, the state machine will transition to the Order Rejected state.

The initial operation of the state machine is to send the Verify Consumer command to the Consumer Service. The response from the Consumer Service will trigger the next state transition. If the consumer is successfully authenticated, the Saga will create a Ticket and transition to the Creating Ticket state. However, if the consumer fails validation, the Saga rejects the Order and transitions to the OrderRejected state. The state machine goes through many other state transitions, driven by the responses of the saga participants, until it reaches a final state of one of Order Approved or Order Rejected

Orchestration-based Saga has the following benefits:

  • Easier dependencies: One of the benefits of orchestration is that it doesn't introduce circular dependencies. The Saga Orchestrator calls the Saga Participants, but the Participants do not call the Orchestrator. So the orchestrator depends on the parties but not the other way around, so there is no circular dependency
  • Less coupling: each service implements an API for the orchestrator to call, so it does not need to be aware of events published by saga participants
  • Improve separation of concerns and simplify business logic: Saga's coordination logic is localized in the Saga orchestrator. Domain objects are simpler and don't require knowledge of the Saga they participate in. For example, when using orchestration-style Saga, the order class does not know any Saga, so it has a simpler state machine model. During the execution of the Createorder Saga, it transitions directly from the APPROVAL_PENDING state to the APPROVED state. The order class does not have any intermediate state corresponding to the steps of the Saga. Therefore, business is simpler.

Disadvantages of orchestration-based Saga:

  • There is a risk of centralizing too much business logic in the orchestrator. This leads to an architectural design where the smart orchestrator tells the dumb service what to do. Fortunately, you can avoid this problem by designing the orchestrator to only be responsible for sorting, and not contain any other business logic
Isolation problem handling

In the local transaction, we said that the isolation in the transaction can be achieved in two ways: MVCC or locking. SAGA itself does not support the isolation properties of ACID transactions. This is because once the transaction is committed, the updates made by each Saga's local transaction will be immediately seen by other Sagas. This behavior can cause two problems. First, other saga can change the data accessed by this saga while executing. Other Saga can read its data before the Saga has finished updating, thus potentially exposing inconsistent data. In fact, you can think that Saga only satisfies the three properties of ACD:

  • Atomicity: A Saga implementation ensures that all transactions are executed or all changes are undone.
  • Consistency: Referential integrity within the service is handled by the local database. Referential integrity between services is handled by the service
  • Persistence: Handled by local database

The following abnormalities may occur due to lack of isolation:

  • Lost Update: One Saga did not read the update, but directly overwrote the changes made by another Saga

For example:
The first step of the Create order Saga creates an Order.
While the Saga is executing, another Cancel Order Saga cancels the Order.
The last step of the Create Order Saga approves the Order.
In this case, the Create order saga ignores the Cancel order saga update and overwrite it

  • Dirty read: A transaction or a Saga reads an update made by a Saga that has not yet completed

For example:
Consumer Service: increase available quota
Order Service: change order status to canceled
Delivery Service: cancel delivery

  • Fuzzy or non-repeatable reads: Two different steps of a Saga read the same data and get different results because another Saga has already performed an update

Although SAGA does not support transaction isolation, our developers can achieve isolation through additional coding. A 1998 paper titled "Semantic ACID properties in multidata- bases using remote procedure calls and update propagations" described how to deal with the lack of transaction isolation in multi-database architectures without using distributed transactions. The countermeasures described in the paper are as follows:

  • Semantic locks: application-level locks
  • Exchange update: Design update operations to be performed in any order
  • The Pessimistic View: Reordering the Steps of a Saga to Minimize Business Risk
  • Reread Value: Prevents dirty writes by rewriting data to verify that it remains the same before overwriting it
  • version file: records updates so they can be reordered
  • Business risk rating (by value): Use the business risk of each request to dynamically select a concurrency mechanism

The implementation of the above several solutions is not described here, and you can find the corresponding cases on github. Then combined with this article, it should be able to bring you some gains.

Guess you like

Origin blog.csdn.net/a_ittle_pan/article/details/130448879