Data Consistency of Messaging System Communication Between Microservices

foreword

Microservices are a hot topic at the moment. Today, let's talk about a sensitive topic in microservices: how to ensure the data consistency of microservices. When it comes to distributed transactions, the CAP theory cannot be avoided.

CAP theory means that for a distributed computing system, it is impossible to satisfy the following three points at the same time:

1. Consistence (equivalent to all nodes accessing the same latest copy of data)

2. Availability ( high availability for data updates)

3. Tolerating network partition (Partition tolerance) (In terms of practical effect, partition is equivalent to the time limit for communication. If the system cannot achieve data consistency within the time limit, it means that a partition has occurred, and the current operation must be Choose between C and A.)

According to the theorem, a distributed system can only satisfy two of the three but not all three. The above theoretical introduction to CAP comes from Wikipedia. Similarly, how to ensure data consistency between microservices has always been an ongoing topic, in fact, how to make a trade-off among the three .

There have been a series of articles on this official account before, on the topic of transaction consistency under the microservice architecture, including BASE theory, two-phase commit, three-phase commit, reliable event mode, TCC mode, compensation mode, etc. I want to go further For more information, please refer to here: Data Consistency Guarantee under Microservice Architecture (1), Data Consistency Guarantee under Microservice Architecture (2), Data Consistency Guarantee under Microservice Architecture (3). Today I will just talk about one of the scenarios: using the message system for communication between microservices, how to ensure data consistency between microservices.

1. The derivation of the problem:

Data Consistency Issues in Microservice Architecture

Here we first use the following example to elicit the problem: Take a deployment product in public cloud market 1 as an example, when a user wants to deploy an existing product in the public cloud, such as a Redis product, the user will first go to the public cloud market Find the corresponding Redis product in , when the user clicks to publish, the corresponding record will be made in the market, and there is a module that is actually responsible for deployment in the background, here we call the deployment module. When the product is successfully deployed, the deployment module and the market will synchronize the final state.

1. Public cloud market: This refers to a simple model, similar to Alibaba Cloud's cloud mirror market or Amazon AWS's mirror market. In the cloud mirror market, users will choose the products they are interested in, such as mysql, and then pay and publish, which saves users from manually downloading the installation package in their own cloud platform environment, and then installing, configuring, and starting a series of cumbersome the process of. In the cloud platform market, all users need to do is some necessary configuration, and then click start to complete the release of a product. This is generally the process of purchasing an image and then starting the instance.

The above are all carried out under ideal conditions, and the general process is as follows:

At this point, both the marketplace and the deployment module are independent microservices. After the platform user applies to activate the product, the public cloud marketplace will first perform a series of initialization work and send a deployment request to the deployment module. When the deployment module is successfully deployed or After failure, corresponding records will be made, and the market will also record the status to the local database. Since the market and deployment both exist in the form of microservices and have their own local transactions, at this point, we can no longer guarantee the atomicity of operations through the control of local transactions. Then the question arises:

If the market module sends a request to the deployment module, and the market microservice encounters a database connection abnormality (such as a network connection to the database abnormality, database drift, etc.), the market will report an error to the front end, indicating that an error occurred during the deployment process, resulting in deployment Failed, but actually the deployment module has silently opened an instance for the user in the background.
The same problem will also occur when the market microservice crashes after sending the request to the deployment module. The database of the market microservice simply does not save the user's request for this activation, but in fact the deployment module However, product examples have been opened in the process.

If the public cloud platform has a limit of 5 instances for user resources, that is, a user (such as a trial version user) can only open up to 5 product instances, the user can only open up to 4 product instances in the market at this time, because there is one The instance is successfully deployed in the deployment module, but the market module is not clear. At this time, there is a serious problem of data inconsistency. So how to solve such problems? How to solve this kind of business inconsistency problem?

2. Introduce the message framework to solve the problem of data inconsistency

Here we use the message communication framework Kafka to fulfill the corresponding requirements through the event mechanism.

While using Kafka to complete the delivery of messages, it is inevitable to face the unexpected situation of message loss. Here we first look at the main scenario we implemented, and then we will discuss how to ensure the absolute delivery and consumption of messages at the business level.

Message sender processing

The process is handled as follows:

Let's analyze how this design can meet our needs:

The operation of Product and Event in the market module is performed in local transactions, which ensures the consistency of local operations.

If an exception occurs in the market area before the event is released when the product is activated, the downtime or the database cannot be connected. According to the design, the event publishing timer and the service of the product activation are separate operations. At this time, unexpected events such as downtime occur, and it is not It will affect the data in the database, and the next time the server is normal, the event publishing timer will go to the Event table to find unpublished data for publishing and update the message status to PUBLISHED.

What if there was an accident while updating the state in the library? At this point, the message has been sent to the Kafka broker, and the next time the service is normal, the message will be resent, but because of the uniqueness of the key, the deployment module judges that these are duplicate data and can be ignored directly.

When the product is successfully deployed, the Market event listener receives a notification that an unexpected downtime occurs while preparing to update the database, etc. After the next service starts normally, the event listener will monitor from the last message offset and update the Event table .

The processing of the message recipient

Let's take a look at how the receiver deployment module of the message processes the messages received from the Kafka Broker?

The following is a flow chart of message processing by the deployment module, where a simplified schematic diagram is used for the deployment process of the deployment module. In actual scenarios, deploying actions and updating state is a complex process, and may rely on polling to complete operations.

The event listener of the deployment module, after receiving the notification, directly calls the deployed Service, updates the business logic in the Deploy_table table, and updates the message status in the Event_Table. On the other hand, the Event timer of the deployment module will also periodically read information from the Event_Table and publish the results to the Kafka Broker, and the market module will perform its own business operations after receiving the notification.

The principles and reasons for adopting this model are similar to those in the market sector, and will not be repeated here.

3. Introduce compensation + idempotent mechanism,

Guarantee the reliability of message delivery

As I mentioned just now, most of the message systems on the market such as Kafka cannot guarantee the reliability of message delivery. Therefore, we must also guarantee the unexpected situation of the message from the business point of view. Next, let's discuss how to ensure the absolute reliability of message delivery from a business perspective?

Here, we will introduce the compensation mechanism + idempotent operation. We have persisted the Event to the database in the previous steps. We also need the following steps to ensure the absolute reliability of the message from the business:

1. Improve the event table fields

We add two new fields count and updateTime to the Event table to identify the number of times this message is sent or retried. Under normal circumstances, count is 1, which means that it is only sent once.

2. Timing compensation plus error retry

At the same time , the abnormal event publishing timer is used, and every 2 minutes (this time is just an example, in practical applications, it should be greater than the time of normal business processing logic in the business) to query the message with the status of PUBLISHED in the Event table. If the corresponding message record is updated When the time is two minutes ago, we pessimistically believe that the message is lost, resend the message, update the field updateTime and increase the count by 1.

3. The last line of defense: reconciliation records, manual intervention

If it is found that the number of retransmissions has exceeded 5, it is considered that the message system cannot be used to complete the delivery of the message at this time, and the last guarantee is to record it and manually review it in the daily manual reconciliation.

Fourth, idempotent deduplication

What is idempotent? Due to the existence of retry and error compensation mechanisms, it is inevitable that there will be scenarios in which messages are received repeatedly in the system. The idempotent performance of the interface improves the consistency of data. In programming, an idempotent operation is characterized by its arbitrary multiple executions The resulting effects are all the same as the effects of a single execution.

Due to our timing compensation mechanism, the consumer of the message should also ensure that the operation of deploying the service is idempotent, that is, for the case where the same message is sent multiple times, we should ensure that the message is actually only executed once. Here, if it is found that the message is sent repeatedly, the execution result in the database is directly read out and the result is pushed to the broker, thereby ensuring the idempotency of the message.

Now let's analyze how this strategy guarantees the absolute delivery of messages:

The generation of each message will be recorded in the database to ensure that the message is not lost.

The exception message publishing timer will regularly check the exception message in the Event table. If there is no response data, the message will be considered lost, and the message will be compensated and resent. If it still fails for 5 consecutive times, it will be considered that an exception has occurred, and it will be recorded and manually intervened. account.

For the deployment module (the consumer of the message), if the message is lost, the market module will not be able to receive a response (the state in the corresponding Event table record will not be modified), and the final effect will be the same as in the case of #2. If the number of retransmissions exceeds the limit, the business logic of the reconciliation record will be triggered.

4. Summary

In this paper, by using the message system for communication between microservices and some design changes, it not only ensures the correct execution of the logic under normal circumstances (more than 99.9% of the cases), but also ensures the data consistency in extreme cases. It meets our business needs, and at the same time relies on the powerful functions of the message middleware on the market, which greatly improves the throughput of the system.

In response to the unreliability of Kafka itself, we modified the design of the business scenario to ensure the reliability of the message when the message is lost in extreme cases, and correspondingly ensure the reliability of the business. Here is just an example of Kafka. If you are concerned about the unreliability of Kafka's own messages, you can consider using popular messaging frameworks such as RabbitMQ or RocketMQ.

In a nutshell, this scheme mainly ensures the consistency of the following four dimensions:

Local transactions ensure the consistency of business persistence and message persistence.

The timer ensures the consistency of message persistence and message delivery.

The message middleware ensures the consistency of message delivery and consumption.

Business compensation + idempotency ensures consistency under message failure.

The disadvantage of using this solution is that the coding will be greatly increased, which will add a lot of extra workload to different microservices, and will generate more intermediate states. This solution is not suitable for time-critical scenarios in the business. (This is in line with the scenario exemplified in this article. Because of the opening of the product, it is necessary to operate the container, which is a time-consuming process in itself.)

Data consistency is a topic that is unavoidable but has to be considered in the design of microservice architecture. By ensuring the consistency of the final data, it is also a compromise solution to the CAP theory. The pros and cons of this solution cannot be simply stated, but should be determined according to the scene, and the one that is suitable is the best.

Therefore, when we divide the business of microservices, we try to avoid the design that "may cause consistency problems". If this design is too much, it may be time to consider changing the design.

About the author:

Li Xiaofei

Expert member of EAII-Enterprise Architecture Innovation Research Institute

Currently a senior development engineer of Puyuan Information, he is a member of Puyuan's new generation digital enterprise cloud platform development team, responsible for the support of the new generation cloud platform server. Worked in Emerson Network power and Tibco CDC, and served as Team Leader, during which he successfully led the research and development of several projects, and has rich experience in Cloud. Hobbies: Photography, playing ball, cycling, successfully cycling across the Sichuan-Tibet line.

About EAII

EAII (Enterprise Architecture Innovation Institute) Enterprise Architecture Innovation Institute, dedicated to software architecture innovation and practice, to accelerate the digital transformation of enterprises.

Data Consistency of Messaging System Communication Between Microservices

Guess you like