Cache usage that must be asked in interviews: how to ensure data consistency, cache design patterns

WeChat search [programmer 囧hui], pay attention to this programmer who insists on sharing technical dry goods.

My latest article: Analysis of the interview questions of the most hardcore Redis factory in the whole network (the latest version in 2021)

foreword

The use of caching is very common in current projects. While caching brings us convenience, it also brings some common problems. If not used carefully, it may bring unexpected results.

In the interview, various problems brought by the use of cache are also the points that the interviewer likes to investigate. Today I will discuss the following common problems with you:

  • How to ensure data consistency between database and cache?
  • Operate the database first or operate the cache first?
  • Invalidate cache or update cache?
  • What are the common design patterns for caching?

 

text

Cache query general process

I believe everyone is familiar with this cache query process, which should be the most widely used cache query process at present.

But most people may not know that this process actually has a name: Cache Aside Pattern, which is a type of cache design pattern.

The above figure is the query process of Cache Aside Pattern, and the update process is as follows.

 

This update process raises two questions:

1) Why operate the database first, can you operate the cache first?

2) Why is the cache invalid and can the cache be updated?

Next we analyze them one by one.

 

Operate the database first or operate the cache first

Operate the database first

The case is as follows, there are two concurrent requests, one write request and one read request, the process is as follows:

 

 

Dirty data time range: After updating the database, before invalidating the cache. This time frame is small, usually no more than a few milliseconds.

 

 

Cache first

The case is as follows, there are two concurrent requests, one write request and one read request, the process is as follows:

 

 

Dirty data time range: After the database is updated, before the next update to the data. This time frame is highly uncertain, as follows:

 

1) If the next update to the data comes soon, the cache will be invalidated, and the time of dirty data will be very short.

 

2) If the next update to the data takes a long time to arrive, then the cache has been storing dirty data during this period, and the time range is very long.

 

Conclusion: From the above case, it can be seen that there will be dirty data in the first operation of the database and the first operation of the cache. However, in contrast, it is more optimal to operate the database first, and then operate the cache. Even in extreme cases of concurrency, only a small amount of dirty data will appear.

 

 

Invalidate cache or update cache

 

refresh cache

The case is as follows, there are two concurrent write requests, the process is as follows:

 

 

Analysis: The data in the database is requested by B, the data in the cache is requested by A, and there is data inconsistency between the database and the cache.

 

 

Invalidate cache

The case is as follows, there are two concurrent write requests, the process is as follows:

 

 

Analysis: Since the cache is deleted, there is no data inconsistency.

 

Conclusion: From the above case, it can be clearly seen that invalidating the cache is a better way.

 

 

How to ensure data consistency between database and cache

 

In the above case, whether you operate the database first or the cache first, there will be dirty data. Is there a way to avoid it?

 

The answer is yes, because the database and cache are two different data sources, to ensure data consistency, it is actually a typical distributed transaction scenario, which can be solved by introducing distributed transactions. Common ones are: 2PC, TCC, MQ business news, etc.

 

However, the introduction of distributed transactions will inevitably have a performance impact, which is contrary to our original purpose of introducing cache to improve performance.

 

Therefore, in actual use, it is usually not to ensure the strong consistency of the cache and the database, but to make certain sacrifices to ensure the final consistency of the two data.

 

If it is really unacceptable for dirty data, a more reasonable way is to give up using the cache and go directly to the database.

 

Common solutions to ensure eventual consistency of database and cached data are as follows:

 

1) Update the database, and the database generates binlog.

 

2) Monitor and consume binlog, and perform invalidation cache operations.

 

3) If the invalid cache fails in step 2, a retry mechanism is introduced to retry the failed data through MQ, and at the same time consider whether an idempotent mechanism needs to be introduced.

 

 

Bottom line: When there is an unknown problem, it will be notified in time, and human intervention will be carried out.

 

Human intervention is the ultimate law. Those applications that look bright and beautiful on the outside are mostly backed by a group of hard-working programmers who are constantly repairing all kinds of dirty data and bugs.

 

 

 

Above we talked about Cache Aside in the cache design pattern, and extended common problems.

 

Next, let's talk about other cache design patterns: Read Through, Write Through, Write Behind Caching.

 

 

Read/Write Through

 

In Cache Aside, the application layer needs to deal with two data sources: cache and database, which increases the complexity of the application layer. Can it only deal with one data source?

 

Read/Write Through is used to solve this problem. In this mode, the application layer only deals with the cache, and the cache operates and maintains the database.

 

This pattern will make the application layer simpler and the code more concise.

 

 

Read Through 

 

When the application layer queries data, when the cache misses, the cache will query the database, write the result into the cache, and finally return the result to the application layer.

 

 

 

Write Through

 

When the application layer updates data, the cache is used to update the database. At the same time, when the cache hits, the write cache and the write database need to be controlled synchronously to ensure success at the same time. 

 

 

 

Write Behind Caching

 

Write Behind is also known as Write Back. From the perspective of the application layer, it is similar to Write Through. In this mode, the application layer only needs to deal with a cached data source. The differences are:

 

Write Through will immediately write data to the database synchronously. The advantage of this is that the operation is simple, but the disadvantage is that data modification needs to be written to the database at the same time, and the data writing speed will be slower.

 

Write Behind will asynchronously write data to the database in batches after a period of time. The advantages of doing so are: 1) The application layer only writes to the cache, and the application layer will feel that the operation is extremely fast; 2) The cache is asynchronously written to the database , which will combine multiple I/O operations into one, reducing the number of I/Os.

 

The disadvantages are: 1) High complexity; 2) When the updated data has not been written to the database, if the system is powered off at this time, the data will not be retrieved.

 

The core flowchart of Write Behind is as follows:

 

 

Due to its high complexity, the Write Back cache mode is rarely used in business applications, but due to the performance improvement it brings, there are still many excellent software that use this design mode, such as: page cache in linux , the InnoDB storage engine in MySQL.

 

The page cache (page cache) in linux adopts the write back mechanism: when the user writes, it just writes the data to the page cache and marks it as dirty, and does not actually write it to the hard disk. At some point the kernel will wirteback the dirty data in the page cache to the hard disk.

 

There is a flow chart of Write Back on wikipedia, as follows, the lower memory in the figure can be simply understood as a database (hard disk):

 

 

 

finally

 

Recently, I have sorted and classified my original articles and summarized them under this article: Original Summary , and subsequent original articles will also be added to this directory. Students who like my articles can collect them for future reference.

 

When your talent can't support your ambition, you should calm down and study, I hope you can gain something from me.

 

Originality is not easy. If you think this article is well written and helpful to you, please let me know through [Like] and support me in writing better articles.

 

Recommended reading

Two-year Java development work experience interview summary

4 years of Java experience, Ali Netease Pinduoduo interview summary, experience

5 years of Java experience, interview summary of the core departments of Byte, Meituan and Kuaishou (Real Question Analysis)

921 days, the road of Xianyu to Ali's cultivation of immortality

After 2 months of review, I won the offer from Meituan, what have I done?

How to write a resume that makes HR stand out (with template)

Interview with Ali, HashMap is enough

Interview must ask MySQL, do you understand?

The thread pool that must be asked in the interview, do you understand?

How to choose a company

How to prepare for a big factory interview

MySQL 8.0 MVCC core principle analysis (core source code)

Guess you like

Origin blog.csdn.net/v123411739/article/details/114803998