Thoughts on data synchronization between microservices

I'm bored on weekends, let's come to a blog about data synchronization between services (mainly about the issues of attention). There are no examples of specific business scenarios.

ps: This is purely personal nonsense, if there are mistakes or deficiencies, please point them out. Well, let's get down to business.

Business Process

The main business process is as follows:

  • The user operates and saves data to service A; after service A saves successfully, it then synchronizes some data to service B; service B receives the data, saves it successfully, and the process ends.

Here we discuss the problem of service A->service B data synchronization, we have to ensure the following two points:

1. The first is the accuracy of the data (you go to the bank to deposit 10,000, and your balance only increases by 100, why don’t you do it. Conversely, if you deposit 100, and the balance increases by 10,000, can the bank do it?)

Version 1 - Guaranteed data accuracy

Drawing pictures is too time-consuming, let’s draw pictures to explain the final version, and explain it in words here.

Service A pseudocode:

Start distributed transactions { 
		  //data verification, time-consuming 20ms 
          //save business to DB, time-consuming 10ms 

		  //get the data that needs to be synchronized, rpc remotely calls the save method of service B, time-consuming 40ms 

     } all succeed, submit the transaction ;fail, commit transaction

Service B pseudocode:

//Successful saving takes about 40ms 
 save method { 

 //Analyse the verification data, 30ms 

 //Verification is successful, save the data, 10ms 

 //Validation fails, returns an error, notifies service A that it failed to save itself 

 }Save successfully and submit the transaction, Operation failure rollback transaction
save method { 
  
  if (case 1) { 
    
  }else if (case 2){ 
  	
  }else if (case 3){ 
  	//Remotely call other, or get some supporting business data from the cache 
  	//Check data 
  	//Record various log 
  	//Save data 
  	//Update the status of xx table 
  } 
}Save the successfully committed transaction, and roll back the transaction if the operation fails

  In the end, this interface went from less than 100ms to more than 500ms. What's worse is that when the traffic peaks, many users can't operate, and the save keeps turning and turning, and even various timeouts occur, and the user experience is getting worse and worse (Take tomcat as an example, its internal processing request is Thread pool, resources are limited, previous request resources are not released, subsequent requests are either rejected or wait). A lot of servers were added in the middle, but problems still occurred from time to time, and complaint letters came one after another. One day, the boss called the technical manager to the office: "Can you do it, get out!". The technical manager wiped the sweat from his brow: "OK, OK." Boss: "Okay, let's not solve it"! (shows the seriousness of the matter)

Version 2 - integrated message middleware (here take rabbitMQ as an example)

The technical manager called a group of developers to the office, discussed for a long time, and decided to use a certain middleware, which can not only solve the problem of traffic peak clipping, but also decouple the code, and discuss various details and attention points clearly. The whole team worked overtime overnight for a few days, and implemented several similar business points one by one using certain message middleware. The code becomes roughly like this:

Service A pseudocode:

Start the transaction { 
	 //Data verification, time-consuming 20ms 
     //Save business to DB, time-consuming 10ms 
	//Data synchronization to message middleware, 20ms 
	if (send unsuccessful){ 
		//Throw an exception 
	} 
}Save successfully and submit the transaction , the operation fails to roll back the transaction

Pseudocode for Service B:

Fetch data from message middleware { 
   execute save method 
} 
save method { 
        //business logic 
} save and submit the transaction successfully, and roll back the transaction if the operation fails

Some consumer-side exception handling and MQ and other specific details are mentioned in the final version.

The first two days on the line, perfect! Back to the original 100ms, the whole team is very happy, this year's year-end award cannot be lost.

However, one day suddenly, I received a lot of complaints, saying that the data of xxx did not match. Xiao Wang, who was in charge of this business, was stunned. When the technical manager went to the rabbitMq console, he saw that the average production of this queue was 1000/s, and the consumption was 500/s. Tens of thousands of data had been accumulated. One after another, several other main businesses also had this problem, and a large number of them that needed to be synchronized were accumulated in mq.

Add server? Production: Consumption = 2:1, can the boss agree to this expense? And it's not a daily traffic peak, and the number of users increases. According to this ratio, the money is not a small amount. The technical manager is a technical manager, and immediately thought of using multithreading to process messages, and called the responsible developers together for a meeting. The final version is here. . . .

final version

The middleware here is rabbitMq as an example, and the database for saving data is mysql (innodb engine) as an example.

Note 1. Service A is sent to the middleware

The message middleware is used, and the production-consumption model is adopted to decouple the program code and solve the problem of traffic peak clipping, but it will also increase the difficulty of our program and the problem of data consistency. We have to consider the following problems.

Persistence of middleware data

What if the MQ crashes and the data is lost? At this time, we need to consider the persistence of the configuration queue data.

Whether to use publisher confirmation

It's not that we call a rabbitMq client sending method, and the data arrives in the queue. We can only ensure that the data is sent to the Broker when we call the API of the client, but not the data to the exchange (note that the exchange does not have the ability to persist), the exchange to the queue, and the data to the queue has been persisted. If mq hangs up at some point in the middle, messages may be lost.

In another case, we generally use the @RabbitListener annotation to automatically create a queue on the consumer side, and bind the queue to a switch (or use the @Bean method). In the initial stage of project initialization (just launched), if the consumer is not started, the producer produces messages, and the program that calls the api will not report an error. At this time, there is nothing in the queue, and the message you sent is lost. ===> I have tried this, of course we can create it manually.

Both of these situations can lead to data loss.

disk data loss

This is the extreme of extremes, for example, the disk of the server is damaged (it is luck to encounter it, but it has also happened to companies). Or the programmer, operation and maintenance accidentally deleted the persistent data by mistake.

In this case, a large amount of data may be lost. Generally speaking, we consider backing up a copy of the data first when the data is synchronized to mq on the production side.

If we use the timing task compensation mechanism, we can use the compensation mechanism to solve it.

Note 2. Service B consumes data from middleware

Whether to sign manually

By default, rabbitMq automatically signs for receipt. That is, if you fetch data from the consumer, the data will be deleted from the corresponding queue. If there is no compensation mechanism for scheduled tasks, it must be added (you cannot guarantee that your service will not hang, your data will not hang, or your code has other abnormalities), and if there is a problem, the data will be lost.

The number of messages fetched from mq each time (prefetch_count)

Each version of the customer value may be different, we can set this value on the consumer side

fair distribution

The machine configuration and performance of each server are different, and rabbitMQ uses polling mode by default. Assuming two consumers A, B, 100 messages, there will be 50 messages each.

50 each, fair, fifty for you, fifty for me, 50/50 (you are a junior programmer, let you and senior programmers split 50/50, do you feel comfortable? It's been several days since I finished it). At this time, it will be distributed according to work. Whoever finishes processing first will take over the task, and those who are capable will work more.

Repeat consumption

There is the following pseudocode:

//Step 1: Insert data 

//Step 2: Record log 

//Step 3: Update status 

//No error reported, sign manually

Let's consider the following situations:

1. Turn on exception retry.

Suppose we suddenly execute to 3 in the code, and when updating the data, there is a sudden problem with the updated database. When the exception is retried later, the database is fine again, and the insert operation is performed twice. We should take measures to avoid ensuring that the same message is consumed multiple times, resulting in repeated operations (for example: adding a message ID, consumption will not inserted again)

2. Manual receipt

When an error is reported in step 3, the message will not be manually signed, and the message will still exist in the message queue and be re-delivered (of course, here we specify the operation after the error is reported, for example: return to the queue for other consumers to consume, remove the message from the queue , message confirmation, etc., how to deal with an error requires button business processing)

3. Timing task compensation

If the synchronization status of service A is false, the message will be delivered to the message middleware again after a period of time. The database of the operation in step 3 fails for the first time, and the same piece of data will come again later.

Note 3, the task is handed over to the thread pool for processing

Thread pool parameter settings

First of all, we should set some main parameters of the thread pool reasonably, such as the number of core threads, the total number of threads, the queue size (the default maximum value of int), etc. These parameters cannot be set at one time. We combine the production environment and observe the threads Set a reasonable parameter size for pool rejection rate, thread utilization rate, and number of accumulated tasks (the configuration of each machine may be different, and specific analysis should be performed for specific machines), so as to maximize the utilization rate of machines without causing service downtime (For example, if you don’t set the queue size, too many tasks come at once, resulting in leaks)

Secondly, you must know a knowledge point: the rejection strategy of the thread pool (by default, it is the strategy of refusing to throw exceptions) (see the implementation class of java.util.concurrent.RejectedExecutionHandler). When the task exceeds the limit of the thread pool, the task will be rejected. If it is rejected, should we hand it over to the calling thread for processing (this is not bad, and the thread task will not be discarded), or discard the old task, or discard the new task, should we throw a rejection exception? We all have to think it through.

Note 4. The thread pool executes save asynchronously

Guarantee the order of data

Assuming that the user saves the data and synchronizes it to service B (assumption: the data of the same user can not only modify the value, but also increase and delete the number of data), we have the following pseudo code:

//Delete the original (delete) 

//Add a new one (insert)

We must consider a user's data, whether there are multiple threads executing at the same time, and the order of the data must be guaranteed during execution. For example:

1. There is an automatic save function in the user terminal, which is saved every 10 seconds. Just after the automatic save, it will be modified and saved immediately.

2. The user's last synchronization was not successful, and the scheduled task starts to synchronize, and the user is modifying the data at this time.

The above situation will lead to a very short difference between the new and the old, or even deliver to the message middleware at the same time (ps: here we use the two Chinese characters new and old to represent the new data and the original old data or the last data)

abnormal situation:

Machine 1 first gets "old", and machine 2 gets "new" (the thread pool corresponding to each machine has many other tasks). But at this time, the CPU of machine 1 is too high, the performance of the machine is not good, and the execution is slow, or the machine is unlucky to execute the "old" thread, and has not grabbed the time slice of the CPU for a long time, which leads them to execute the save method together, or even " "Old" execution to the new one. What will happen?

Assuming that the save method is executed together, the delete and insert of machine 1 and the delete and insert of machine 2 are executed together, the problem of permutation and combination, I don’t know what the hell the data has become

For example: 
Situation 1: 
Machine 1 (old data)-delete 
machine 2 (new data)-delete 
machine 1 (old data)-insert 
machine 2 (new data)-insert 
//Original new data, I deleted a line of old The data, you have added it to me now 

Situation 2: 
Machine 2 (new data)-delete 
machine 2 (new data)-insert 
machine 1 (old data)-delete 
machine 1 (old data)-insert 
//Make a Ghost, did I change it for nothing? You still keep the old data, 
other cases will not be listed

Assuming that the new one is executed first, it is the same as case 2 above.

How should we deal with this situation? You can only use distributed locks. The key is the data identifier, and the value can be the time size of the data to identify new and old data. If the old one grabs the lock, the new one just waits, and there is no problem with the data at all. If the new one grabs the lock first (executed first), compare the old data with this time value, and do not save this data if it is too small.

Description of locking: here we can only lock with a single user business ID. For example, when service B is a stand-alone machine, we cannot do this:

public synchronized void save(){....}

The granularity of the lock is too large, which locks irrelevant threads and affects the efficiency of the program.

Ensure that delete and update operations use the mysql index

First of all, we have to know that mysql will lock the table without using the index when performing delete and update operations. The index is not used, and the full table scan is slow when deleting or updating. You have locked the table, and other threads and even other services are waiting for you. How about playing snakes?

Query optimization is not discussed here.

Consider using an auto-incrementing primary key

> Every `InnoDB` table has a special index called a clustered index, which stores row data. Typically, a clustered index is synonymous with a primary key. To get the best performance from queries, inserts, and other database operations, it is important to understand how `InnoDB` uses clustered indexes to optimize common lookups and DML operations. 
> 
> - When defined on a `PRIMARY KEY` table, `InnoDB` uses it as a clustered index. A primary key should be defined for each table. If no logically unique and non-null column or set of columns uses the primary key, add an autoincrement column. Auto-increment column values ​​are unique and added automatically when a new row is inserted. 
> - If no `PRIMARY KEY` is defined for the table, `InnoDB` uses the first `UNIQUE` index and defines all key columns as `NOT NULL` clustered indexes. 
> - If the table has no indexed `PRIMARY KEY` or no suitable `UNIQUE` index, `InnoDB` generates a hidden clustered index named after the synthetic column `GEN_CLUST_INDEX` containing the row ID value. Rows are sorted by the row ID assigned by `InnoDB`. The row ID is a 6-byte field that increases monotonically as new rows are inserted. So rows sorted by rowid are physically in insertion order

Suppose you have an unordered primary key. Since the value of the primary key is approximately random each time, each new record must be inserted into the middle of the existing index page. At this time, MySQL has to insert the new record. Move the data to a suitable location (the B+TREE data of mysql's innodb is placed in the leaf node), and even the target page may have been written back to the disk and cleared from the cache. At this time, it must be read back from the disk. This adds a lot of overhead. At the same time, frequent moving and paging operations cause a lot of fragmentation, resulting in an index structure that is not compact enough. Afterwards, the table has to be rebuilt and the pages filled with optimization through OPTIMIZE TABLE

Regardless of whether it is b-tree (the data is placed in the node corresponding to the clustered index tree) or b+tree (the data is only placed in the clustered index tree corresponding to the leaf node), this problem will exist when using an unordered primary key.

Notice:

There are also many databases that use b-trees, such as mongo, postgrepSql, etc., all of which have this problem. We should all consider the problems brought about by the reconstruction of the disordered clustered index tree.

Data Deletion Optimization

If you really delete the data, there will be a problem of tree reconstruction again. It is not friendly to the database or our program efficiency. We can add a delete state, and delete only updates the state. If the data really needs to be deleted, we can use a scheduled task to delete it when we are free in the middle of the night.

Note 5. Update the synchronization result

In Service A, the sync state is set to false every time the user saves. After we have successfully synchronized, we will change the state value. When this kind of operation uses the index, the update is generally very fast. We can directly call the update remotely. If middleware is used, the difficulty of the program will be increased.

Note 6. Back up the data that needs to be synchronized

Backing up the data that needs to be synchronized can solve two problems:

1. MQ disk data loss data problem

2. Various other abnormalities lead to data not reaching service B, and the timing service calls service A to synchronize data regularly, so that service A has the problem that data can be synchronized.

For this part of the data that has been successfully synchronized, if the business is no longer needed, we can migrate the backup at intervals or delete it regularly.

Summarize

  • When synchronizing data, we consider various anomalies (code anomalies, machine failures), and whether anomalies will affect our synchronous data. In the final analysis, it is to ensure the accuracy of data synchronization. With the compensation mechanism of scheduled tasks, we can guarantee the final consistency of data. While ensuring the accuracy of the data, it is also necessary to consider the efficiency of the program and provide a user-friendly experience. When using new technologies, you need to know the pitfalls. Just like using mq: it may cause data loss and repeated consumption of data. When using multi-threading, you have to ensure the order of the data. When there is concurrent modification of the same data, you have to add a lock (cas or Lock, synchronized for a single machine). Locking can also control the granularity of the lock.

Guess you like

Origin blog.csdn.net/qq_41221596/article/details/132390578