Data consistency has been talked about thousands of times

Today, let’s talk about a common question and let’s look at a practical case:

In existing businesses, caching is often used to improve query efficiency and reduce database pressure. Especially in distributed high-concurrency scenarios, a large number of requests directly accessing MySQL can easily cause performance problems.

One day the boss finds you...

Boss: I heard that you can cache?

You: Come and see me operate.

You designed one of the most common caching solutions. Based on this solution, you started to optimize the user points function. However, while you were sleeping soundly, the system quietly performed the following operations:

1. Thread A will update the points with user ID 1 to 100 according to the business.

2. Thread B will update the points with user ID 1 to 200 according to the business.

3. At the database level, since the database uses locks to ensure ACID, there is no concurrency between thread A and thread B. Regardless of whether the final value in the database is 100 or 200, we assume it is correct.

4. Assume that thread B updates the database after A, then the value in the database is 200

5. During the write-back cache process of thread A and thread B, it is likely that thread A operates the cache after thread B (because of the uncertainty in network calls). At this time, the value in the cache will be updated to 100. An inconsistency between cache and database has occurred.

You received a user complaint the next morning, what should you do? Manually modify the points value or delete the database and run away?

Any two operations in different physical locations will encounter consistency problems if they operate on the same data, which is an inevitable pain point in distributed systems.

1 What is data consistency?

Data consistency usually talks about data storage systems, master-slave mysql, distributed storage systems, etc. How to ensure data consistency?

For example, master-slave consistency and copy consistency ensure that the data accessed when accessing this kind of master-slave database at different times or with the same request is consistent, and the result A of this access will not be result B the next time.

2 CAP theorem

When it comes to data consistency, we must talk about the CAP theorem.

The CAP theorem was proposed by Brewer in 2000. He believed that distributed systems face three core problems when designing and deploying them:

Consistency: Consistency. Database ACID operations constrain data in a transaction so that it remains in a consistent state after execution, and all users in a distributed system should read the latest values ​​when performing update operations.

Availability: Availability. Every operation can always return results within a certain time. The result can be success or failure, and a certain amount of time is given.

Partition Tolerance: Partition tolerance. Consider system performance and scalability and whether data partitioning is possible.

The CAP theorem holds that a storage system that provides data services cannot simultaneously satisfy data consistency, data availability, and partition tolerance.

Why? If partitioning is used, distributed nodes need to communicate. When it comes to communication, there will be a certain moment when this node only completes part of the business operations. During this period of time when the communication is completed, the data will be inconsistent. If consistency is to be ensured, the data must be protected during the time the communication is completed, making operations that access the data unavailable.

Thinking about it the other way around, if you want to ensure consistency and availability, then the data cannot be partitioned. A simple understanding is that all data must be stored in one database, and database splitting cannot be performed. This is unacceptable for Internet applications with large amounts of data and high concurrency.

3 Data consistency model

Based on the CAP theorem, some distributed systems improve the reliability and fault tolerance of the system by replicating data, that is, storing different copies of the data on different machines. Commonly used consistency models include:

Strong consistency: After the data update is completed, any subsequent access will return the latest data. This is almost impossible to achieve in a distributed network environment.

Weak consistency: The system does not guarantee that access after data update will get the latest data. Some special conditions need to be met before the client can obtain the latest data.

Final consistency: It is a special case of weak consistency, which ensures that users can eventually read the updates to specific data in the system caused by an operation.

4 How to ensure data consistency?

Regarding the initial problem, if you think about it, you may find that whether you write the MySQL database first and then delete the Redis cache; or delete the cache first and then write the database, data inconsistency may occur.

(1) Delete cache first

1. If the Redis cache data is deleted first, but before it has time to write to MySQL, another thread will read it;

2. At this time, if the cache is found to be empty, read the old data from the Mysql database and write it into the cache. At this time, the cache contains dirty data;

3. Then after the database was updated, it was discovered that Redis and Mysql had data inconsistencies.

(2) Delete cache later

1. If you write the library first and then delete the cache, unfortunately the thread writing the library hangs, resulting in the cache not being deleted;

2. At this time, the old cache will be read directly, which ultimately leads to data inconsistency;

3. Because writing and reading are concurrent and the order cannot be guaranteed, there will be data inconsistency between the cache and the database.

Solution 1: Distributed lock

In daily development, using distributed locks may be a common solution. Using distributed locks to encapsulate cache operations and database operations into one logical operation can ensure data consistency. The specific process is:

1. Each thread that wants to operate the cache and database must first apply for a distributed lock;

2. If the lock is successfully obtained, the database and cache operations are performed, and the lock is released after the operation is completed;

3. If the lock is not obtained, depending on the business, you can choose a blocking wait or rotation training, or a direct return strategy.

See the process below:

Using distributed locks is a solution to distributed transactions, but it will reduce the performance of the system to a certain extent, and the design of distributed locks must take into account the unexpected situations of machine down and deadlock.

Solution 2: Delay double deletion

Perform the redis.del(key) operation before and after writing the database, and set a reasonable timeout.

The pseudo code is as follows:

public void write( String key, Object data ){
  redis.delKey( key );
  db.updateData( data );
  Thread.sleep( 500 );
  redis.delKey( key );
}

Specific steps:

1. Delete cache first

2. Write the database again

3. Sleep for 500 milliseconds (this is determined based on the read business time)

4. Delete cache again

Let’s look at the previous case under this scheme:

The T1 thread deletes the cache and then updates the db. If the T2 thread reads the old data of the db before the T1 thread updates the db, it will write the old data to the Redis cache again.

At this time, the T1 thread delays for a period of time before deleting the Redis cache operation. When other threads read the cache and the cache is null, they will query the db for the latest data and cache it again, ensuring the data consistency between Mysql and Redis caches.

On this basis, the cache must also set an expiration time to ensure the consistency of the final data. As soon as the cache expires, read the database and re-cache.

The worst case scenario for this double-deletion + cache timeout strategy is that data inconsistencies occur within the cache expiration time, and writing time increases.

However, there is another problem with this solution. How to ensure that after writing to the library, the cache is successfully deleted again?

If deletion fails, data inconsistency may occur. At this time, a retry plan needs to be provided.

Solution 3: Asynchronous update cache (based on Mysql binlog synchronization mechanism)

1. When it comes to updated data operations, use Mysql binlog for incremental subscription consumption;

2. Send the message to the message queue;

3. Update incremental data to Redis through message queue consumption.

The effect is:

Read Redis cache: hot data is all on Redis;

Writing Mysql: additions, deletions, and modifications are all performed in Mysql;

Update Redis data: Mysql data operations are recorded in the binlog and updated to Redis in a timely manner through the message queue.

In this way, once new write, update, delete and other operations occur in MySQL, binlog-related messages can be pushed to Redis, and Redis will update Redis based on the records in the binlog.

In fact, this mechanism is very similar to MySQL's master-slave backup mechanism, because MySQL's master-slave backup also achieves data consistency through binlog.

For the retry plan in Plan 2, you can use Plan 3 to start a subscription program to subscribe to the binlog of the database, extract the required data and keys, and start a new code to obtain this information. If the attempt to delete the cache fails, send a message to the message queue, re-obtain the data from the message queue, and retry the deletion operation.

Reference documentation:

  • https://mp.weixin.qq.com/s/k38MZRAGmZ8EhDB5DcVhhQ
  • https://baijiahao.baidu.com/s?id=1678826754388688520&wfr=spider&for=pc
  • https://www.php.cn/faq/415782.html
  • https://blog.csdn.net/u013256816/article/details/50698167
  • https://blog.csdn.net/My_Best_bala/article/details/121977033?spm=1001.2101.3001.6661.1&utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1.pc_relevant_antiscan&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1.pc_relevant_antiscan&utm_relevant_index=1

Thanks for reading~

Author: JD Retail Li Zeyang

Source: JD Cloud Developer Community Please indicate the source when reprinting

Guess you like

Origin blog.csdn.net/JDDTechTalk/article/details/132579654