Transaction problems in distributed systems

Introduction

Today, when distributed systems and microservice architectures are popular, it has become normal for services to fail to call each other. How to deal with exceptions and how to ensure data consistency has become a difficult problem that cannot be avoided in the process of microservice design. In different business scenarios, the solutions will be different. Common ways are:

  1. Blocking retry;
  2. 2PC, 3PC traditional affairs;
  3. Use queues, asynchronous processing in the background;
  4. TCC compensation matters;
  5. Local message table (asynchronous guarantee);
  6. MQ transaction.

This article focuses on several other items. There are already a lot of online information about 2PC and 3PC traditional affairs, so I won't repeat them here.

Blocking retry

In the microservice architecture, blocking retry is a relatively common way. Pseudo code example:

m := db.Insert(sql)

err := request(B-Service,m)

func request(url string,body interface{}){
  for i:=0; i<3; i ++ {
    result, err = request.POST(url,body)
    if err == nil {
        break 
    }else {
      log.Print()
    }
  }
}

As above, when requesting the API of service B fails, a maximum of three retries will be initiated. If it fails three times, it will print the log and continue execution or throw an error to the upper layer. This approach will bring the following problems

  1. The call to the B service is successful, but due to the network timeout, the current service thinks it has failed and continues to try again, so that the B service will generate two identical data.

  2. Failed to call service B. Because service B is unavailable, it still fails after 3 retries. The current service inserts a record into DB in the previous code, which becomes dirty data.

  3. Retrying will increase the delay of the upstream for this call. If the downstream load is heavy, retrying will amplify the pressure on downstream services.

The first problem is to be solved by making the API of the B service support idempotence.

The second problem: you can correct the data through the background timing, but this is not a very good way.

The third question: This is an indispensable sacrifice to improve consistency and availability through blocking retry.

Blocking retry is suitable for scenarios where the business is not sensitive to consistency requirements. If there are requirements for data consistency, additional mechanisms must be introduced to solve them.

Asynchronous queue

In the process of solution evolution, introducing queues is a more common and better way. The following example:

m := db.Insert(sql)

err := mq.Publish("B-Service-topic",m)

After the current service writes the data into the DB, it pushes a message to MQ, and the independent service consumes MQ to process business logic. Compared with blocking retry, although MQ is much more stable than ordinary business services, there is still the possibility of failure in the call to push messages to MQ, such as network problems and current service downtime. In this way, you will still encounter the same problem of blocking retry, that is, the DB write succeeds, but the push fails.

Theoretically speaking, in a distributed system, there are such cases in the code involving multiple service calls. In the long-term operation, the call failure will definitely occur. This is also one of the difficulties of distributed system design.

TCC compensation matters

In the case of transaction requirements and inconvenient decoupling, TCC compensation transaction is a better choice.

TCC divides each service call into 2 phases and 3 operations:

  • Phase one, Try operation: check business resources and reserve resources, such as inventory check and withholding.
  • Phase two, Confirm operation: Submit the resource reservation for the Try operation. For example, update inventory withholding to deduction.
  • Phase two, Cancel operation: After the Try operation fails, release its withheld resources. For example, add back the inventory withholding.

TCC requires each service to implement the API of the above three operations. The operation that was completed by the previous call of the service access to the TCC transaction now needs to be completed in 2 stages and completed in three operations.

For example, a store application needs to call A inventory service, B amount service, and C point service, as follows pseudo code:

    m := db.Insert(sql)
    aResult, aErr := A.Try(m)
	bResult, bErr := B.Try(m)
	cResult, cErr := C.Try(m)
	if cErr != nil {
		A.Cancel()
		B.Cancel()
    C.Cancel()
	} else {
		A.Confirm()
		B.Confirm()
		C.Confirm()
	}

The code calls A, B, and C service APIs to check and reserve resources, and all return success and then submit the confirmation (Confirm) operation; if the C service Try operation fails, call the Cancel APIs of A, B, and C to release their reservations resource of.

TCC has solved the problem of data consistency across multiple services and multiple databases in a distributed system. However, there are still some problems in the TCC method, and you need to pay attention to it in actual use, including the call failure mentioned in the above chapter.

Empty release

In the above code, if the C.Try() call fails, the redundant C.Cancel() call below will release the resource without locking it. This is because the current service cannot determine whether the call failure is really locking the C resource. If it is not called, it actually succeeds, but it fails due to network reasons. This will cause C's resources to be locked and never released.

Empty release often occurs in the production environment. When the service implements the TCC transaction API, it should support the execution of empty release.

Timing

If C.Try() fails in the above code, then call C.Cancel() operation. Due to network reasons, it is possible that the C.Cancel() request will arrive at the C service first, and the C.Try() request will arrive later. This will cause the issue of empty release, and cause C's resources to be locked and never released.

Therefore, the C service should refuse the Try() operation after releasing the resource. In terms of specific implementation, a unique transaction ID can be used to distinguish the first Try() from the Try() after release.

Call failed

In the process of calling Cancel and Confirm, there will still be failures, such as common network reasons.

Cancel() or Confirm() operation failure will cause the resource to be locked and never be released. Common solutions to this situation are:

  1. Blocking retry. But there are the same problems, such as downtime and continuous failure.

  2. Write the log, queue, and then have a separate asynchronous service automatically or manually intervene in processing. But there will be problems as well. When writing logs or queues, there will be failures.

Theoretically speaking, two pieces of non-atomic and transactional code will have an intermediate state, and there is a possibility of failure if there is an intermediate state.

Local message table

The local message table was originally proposed by eBay. It allows the local message table and the business data table to be in the same database, so that local transactions can be used to meet transaction characteristics.

The specific method is to insert a piece of message data when inserting business data in a local transaction. Then we are doing follow-up operations. If other operations succeed, the message will be deleted; if it fails, the message will not be deleted. The message will be monitored asynchronously and retrying continuously.

The local message table is a good idea, and it can be used in multiple ways:

With MQ

Sample pseudo code:

	messageTx := tc.NewTransaction("order")
	messageTxSql := tx.TryPlan("content")

  m,err := db.InsertTx(sql,messageTxSql)
  if err!=nil {
    return err
  }

  aErr := mq.Publish("B-Service-topic",m)
  if aErr!=nil { // 推送到 MQ 失败
    messageTx.Confirm() // 更新消息的状态为 confirm
  }else {
    messageTx.Cancel() // 删除消息
  }
// 异步处理 confirm 的消息,继续推送
func OnMessage(task *Task){
   err := mq.Publish("B-Service-topic", task.Value())
   if err==nil {
     messageTx.Cancel()
   }
}

In the above code, the messageTxSql is a piece of SQL inserted into the local message table:

insert into `tcc_async_task` (`uid`,`name`,`value`,`status`) values ('?','?','?','?')

It is executed in the same transaction as business SQL, and it either succeeds or fails.

Push to the queue if successful, call messageTx.Cancel() to delete the local message if push is successful; mark the message as if push fails  confirm. Local message table  status has two states  try, confirmno matter what kind of state  OnMessage can be monitored, thereby initiating try again.

Local transaction guarantees that messages and services will be written to the database. After that, regardless of downtime or network push failure, asynchronous monitoring can be followed up to ensure that messages will be pushed to MQ.

The MQ guarantees that it will definitely reach the consumer service. Using MQ's QOS strategy, the consumer service must be able to process or continue to be delivered to the next business queue, thus ensuring the integrity of the transaction.

Cooperate with service call

Sample pseudo code:

	messageTx := tc.NewTransaction("order")
	messageTxSql := tx.TryPlan("content")

  body,err := db.InsertTx(sql,messageTxSql)
  if err!=nil {
    return err
  }

  aErr := request.POST("B-Service",body)
  if aErr!=nil { // 调用 B-Service 失败
    messageTx.Confirm() // 更新消息的状态为 confirm
  }else {
    messageTx.Cancel() // 删除消息
  }
// 异步处理 confirm 或 try 的消息,继续调用 B-Service 
func OnMessage(task *Task){
  // request.POST("B-Service",body)
}

This is an example of local message table + calling other services, without the introduction of MQ. This use of asynchronous retries and the use of a local message table to ensure the reliability of messages solves the problems caused by blocking retries, which is more common in daily development.

If there is no operation to write DB locally, you can just write to the local message table, which is also  OnMessageprocessed in:

messageTx := tc.NewTransaction("order")
messageTx := tx.Try("content")
aErr := request.POST("B-Service",body)
// ....

Message expired

Configure the local message table  Try and  Confirm message processor:

TCC.SetTryHandler(OnTryMessage())
TCC.SetConfirmHandler(OnConfirmMessage())

In the message processing function, it is necessary to judge whether the current message task has existed for too long. For example, if it has been retried for an hour or failed, consider sending emails, short messages, log alerts, etc. to allow manual intervention.

func OnConfirmMessage(task *tcc.Task) {
if time.Now().Sub(task.CreatedAt) > time.Hour {
    err := task.Cancel()  // 删除该消息,停止重试。
   // doSomeThing() 告警,人工介入
    return
 }
}

In the  Try processing function, it is necessary to separately determine whether the current message task is too short, because the  Trystate of the message may have just been created and has not been confirmed to be submitted or deleted. This will be repeated with the execution of normal business logic, which means that successful calls will also be retried; to avoid this situation as much as possible, you can check whether the message creation time is short, and you can skip it if it is short.

The retry mechanism inevitably depends on the idempotence of the downstream API in the business logic. Although it is feasible without processing, it is still designed to avoid interference with normal requests.

Independent messaging service

The independent message service is an upgraded version of the local message table, which separates the local message table into an independent service. Add a message to the message service before all operations, delete the message if the subsequent operation succeeds, and submit a confirmation message if it fails.

Then use asynchronous logic to monitor the message and do the corresponding processing, which is basically the same as the processing logic of the local message table. However, since adding a message to the message service cannot be combined with a local operation in a transaction, there will be a success in adding a message and subsequent failure, and the message at this time is a useless message.

The following example scenario:

  err := request.POST("Message-Service",body)
  if err!=nil {
    return err
  }
  aErr := request.POST("B-Service",body)
  if aErr!=nil {
    return aErr
  }

This useless message requires the message service to confirm whether the message is successfully executed, if not, delete it, and continue to execute the subsequent logic. Table compared to local affairs  try and  confirm message services in front of more than one state  prepare.

MQ transaction

Some MQ implementations support transactions, such as RocketMQ. The transaction of MQ can be regarded as a concrete realization of independent message service, and the logic is completely consistent.

Before all operations, a message is delivered to MQ. If the subsequent operations succeed, the Confirm message will be  confirmed, and the message will be Canceldeleted if it fails . MQ transactions will also have  preparestatus, and MQ consumption processing logic is required to confirm whether the business is successful.

to sum up

From the perspective of distributed system practice, to ensure data consistency scenarios, additional mechanisms must be introduced.

The advantage of TCC is that it acts on the business service layer, does not rely on a specific database, is not coupled with a specific framework, and the granularity of resource locks is relatively flexible, which is very suitable for microservice scenarios. The disadvantage is that each service has to implement 3 APIs, which requires large business intrusions and changes, and various failure exceptions have to be handled. It is difficult for developers to fully handle various situations. Finding a mature framework can greatly reduce costs, such as Ali's Fescar.

The advantage of the local message table is that it is simple, does not depend on the transformation of other services, can be used well with service invocation and MQ, and is more practical in most business scenarios. The disadvantage is that the local database has more message tables, which are coupled with business tables.

The advantage of MQ transaction and independent message service is to extract a common service to solve transaction problems, avoid each service having a message table and service coupling, and increase the processing complexity of the service itself. The disadvantage is that there are few MQs that support transactions; and calling the API to add a message before each operation will increase the delay of the overall call, which is an unnecessary overhead in most business scenarios with normal responses.

Guess you like

Origin blog.csdn.net/a159357445566/article/details/108985924