Multi-level cache architecture design

Said it in front

In the reader community (50+) of Nien, a 40-year-old architect , many small partners have obtained interview qualifications from first-line Internet companies such as Ali, Netease, Youzan, Xiyin, Baidu, Netease, and Didi. A very important interview question:

  • In the scenario of 20w QPS, how should the server architecture be designed?
  • In the scenario of 10w QPS, how should the cache architecture be designed?

Nien reminded that issues related to cache architecture, cache planning, cache elimination, and data consistency of multi-level caches are the core knowledge of the architecture and the key problems online.

In addition, Nien has been giving everyone guidance on resumes and structural transformation. A few days ago, I was coaching the resume of L9, a super boss of Meituan. He also talked about these problems of caching and needed to provide some solutions to him as:

  • First: learning materials
  • Second: Architectural wheels.

Based on the above reasons, Nien will give you a systematic and systematic review based on "Jingdong Server Application Multi-Level Cache Architecture Scheme" and "Youzan Transparent Multi-Level Cache Solution (TMC)". Therefore, during the interview, everyone can fully demonstrate their strong "technical muscles", so that the interviewer can't help but drool .

Also include this question and reference answers in our " Nin Java Interview Collection " V105 version, for the reference of later friends, and improve everyone's 3-high architecture, design, and development levels.

For the PDFs of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to obtain

High concurrency scenario analysis

Generally speaking, if it is 10Wqps or 20Wqps, distributed cache can be used to resist.

For example, a redis cluster has 6 masters and 6 slaves: the master provides reading and writing, and the slave acts as a backup, and never provides reading and writing services.

Under the 6 master and 6 slave architecture, one unit can withstand 3w-4W concurrency on average, and can also withstand 18Wqps -24Wqps.

Moreover, if the QPS reaches 1 million, the cache capacity and concurrent read and write capabilities can be expanded by increasing the number of machines in the redis cluster. The architecture of 6 masters and 6 slaves can be expanded to 30 masters and 30 slaves.

At the same time, cached data is shared among applications, with a master-slave architecture to achieve high availability.

Question: How to solve the cache hotspot (hot key) problem?

Once a cache hotspot occurs, for example, 10w traffic accesses the same Key and is concentrated on a certain Redis instance, which may cause the instance's CPU load to be too high.

In this case, even if the number of Redis clusters is increased, the problem cannot be fundamentally solved. So what is the effective way to solve the hot key problem? One of the very effective means is local caching. The main reasons are: Local caching avoids the high load of a single Redis cache server. At the same time, local memory cache has faster access speed because the data is stored directly in the application's memory without the need to transmit the data over the network.

The essence of local cache: it is multiple copies, trading space for time . By spreading out requests by replicating multiple cache copies, you can alleviate the pressure on a single cache server caused by cache hotspots.

In everything, there are pros and cons.

So, what problems will the introduction of local cache bring? The main issues are:

  • Data consistency issues
  • Local cache data pollution problem

Regarding the above two questions, Nien's previous article: "Scenario question: Suppose 100,000 people make a surprise visit, how can your system avoid avalanches?" " , based on Youzan's transparent multi-level caching solution (TMC), we have given you a comprehensive review.

However, as future super architects, we need to learn from the strengths of hundreds of schools of thought and broaden our technical horizons.

Therefore, here is another article for you based on the "Multi-level Cache Architecture Plan for JD Server Application". The original JD server application multi-level caching architecture solution | JD Cloud technical team.

Universal multi-level caching solution

The JD server uses a multi-level cache architecture solution, which is actually a commonly used level 2 cache architecture solution:

(1) L1 level cache: local cache guava

(2) L2 level 2 cache: distributed cache redis

Cache access process of level 2 cache architecture scheme:

  • Requests are prioritized to the application’s local cache
  • If the local cache does not exist, go to the redis cluster to pull it and cache it locally at the same time.

The above process is similar to cache aside cache mode. The specific cache access process is roughly as follows:

For details about the data consistency of the cache aside bypass cache mode between DB and Redis cache, please read Nin's " Java High Concurrency Core Programming Volume 3 Enhanced Edition ".

So, what problems will the introduction of local cache bring? The main issues are:

  • Data consistency issues
  • Local cache data pollution problem

Multi-level cache data consistency issues

How to solve the problem of multi-level cache data consistency? Mainly multi-level cache. Mainly use the publish-subscribe mode or the underlying component RPC communication mechanism to complete the data synchronization between the local cache and the Redis cache.

  • JD.com adopts a publish-subscribe model.
  • The underlying component RPC communication mechanism adopted by Youzan
  • J2cache uses a publish-subscribe model.

Let’s first look at publish and subscribe. If we go deeper, there are two modes:

  • Push mode: Each channel maintains a client list. When a message is sent, the list is traversed and the message is pushed to all subscribers.
  • Pull mode: The sender puts the message into a mailbox, and all clients subscribed to the mailbox can receive it at any time. The message will not be deleted until all clients have successfully received the entire message.

First, let’s take a look at JD.com’s data consistency issue: multi-level cache synchronization solution

  1. The operation background saves data, writes it to the Redis cache, and uses the publish and subscribe function of Redis to publish information.
  2. As a message subscriber, the business application cluster deletes the local cache after receiving the operational data message.
  3. When the C-side traffic request arrives, if the local cache does not exist, the cache is loaded from Redis to the local cache.
  4. To prevent the Redis cache from becoming invalid under extreme circumstances, data can be reloaded into the Redis cache through scheduled tasks.

Secondly, let’s look at Youzan’s data consistency issue: using the communication module to achieve data consistency between each node .

For a specific introduction, please see Nien’s secondary creation article: “Scenario question: Assume that 100,000 people make a surprise visit, how can your system avoid avalanches? " , based on Youzan's transparent multi-level caching solution (TMC), we have given you a comprehensive review.

In addition, there are some mature secondary cache middleware in the industry, mainly using message queue rocketmq/kafka to achieve data consistency between local cache and distributed cache. For details about this architecture solution, please refer to Nien’s architecture video "100Wqps Level 3 Cache Component Practical Operation"

JD publish-subscribe cache synchronization component selection

Jingdong uses the channel mechanism of redis to complete data synchronization between the local cache and the Redis cache. In the Redis channel mechanism, the publish-subscribe model is a push model.

  • By using the SUBSCRIBE command, you can subscribe to one or more channels to receive notifications when the related channels publish messages.
  • The PUBLISH command is used to send messages to one or more channels. When a message is published on a channel, all clients subscribed to the channel will receive corresponding notifications.

In addition, the publish-subscribe model of Redis is asynchronous. When a message is published to a channel, Redis will asynchronously push the message to all clients subscribed to the channel. This means that the client will not be blocked waiting for messages, but will continue to perform other tasks and only obtain messages when it needs to be received. This asynchronous approach helps improve system concurrency and efficiency.

What is cache pollution problem?

What problems will the introduction of local cache bring? The main issues are:

  • Data consistency issues
  • Local cache data pollution problem

We looked at the issue of data consistency earlier. Let’s look at the cache pollution problem again.

The cache pollution problem refers to the data that remains in the cache and will not actually be accessed again, but it occupies cache space.

If the volume of such data is large, or even fills the cache, every time new data is written into the cache, the data needs to be phased out of the cache, which will increase the time overhead of cache operations.

Therefore, to solve the problem of cache pollution, the most critical technology is to be able to identify the data that is accessed only once or with a small number of visits , and to filter out and eliminate them first when eliminating data. Therefore, the core strategy to solve cache pollution is called

The main cache elimination strategies commonly used in caching are:

  • random random
  • lru
  • lfu

(1) Random: Randomly selects data for elimination, mainly including volatile-random and allkeys-random. Random elimination, such as volatile-random and allkeys-random, cannot filter out data that is no longer accessed and may cause cache pollution.

(2) LRU: The basic idea of ​​the LRU algorithm is that when the cache space is insufficient, the least recently used cache item should be eliminated, that is, the data item with the longest access time should be eliminated. This ensures that the most commonly used data items are always kept in the cache, thereby improving system response speed and throughput. Since the LRU policy only considers the access timeliness of data, the LRU policy cannot quickly filter out data that is only accessed once.

(3) The LFU strategy is optimized on the basis of the LRU strategy. When filtering data, the data with a small number of visits is first screened and eliminated, and then for the data with the same number of visits, the data with the longest access time is screened and eliminated.

In actual business applications, both LRU and LFU strategies are applied.

The LRU and LFU strategies focus on different data access characteristics. The LRU strategy pays more attention to the timeliness of data, while the LFU strategy pays more attention to the frequency of data access .

Under normal circumstances, actual application loads have better time locality , so the LRU strategy will be more widely used.

However, in the application scenario of scan query, the LFU strategy can effectively deal with the cache pollution problem, and it is recommended that you use it first.

JD’s local cache uses guava, so the strategy is LRU. The LRU strategy pays more attention to the timeliness of data and has better time locality . It is used in most data scenarios.

For most local caches, it is recommended to use caffeine, so the strategy is LRU+LFU, which has good temporal locality and can be used in most data scenarios. It also pays attention to the access frequency of data to avoid data pollution problems in scan query application scenarios. For specific principles, please refer to Nien's "100Wqps L3 Cache Components" video, which gives an in-depth introduction to the internal principles and architecture of caffeine, and the performance of caffeine is also higher than that of guava.

Considerations for multi-level cache architectures

  1. Since the local cache occupies the JVM memory space of the Java process, it is not suitable for storing large amounts of data, and the cache size needs to be evaluated.
  2. If the business can accept data inconsistency within a short period of time, then local cache is more suitable for reading scenarios.
  3. In the cache update strategy, whether it is an active update or a passive update, the local cache should set a validity period.
  4. Consider setting up a scheduled task to synchronize the cache to prevent data loss in extreme cases.
  5. In RPC calls, it is necessary to avoid local cache contamination. This problem can be solved through a reasonable cache elimination strategy.
  6. When the application is restarted, the local cache will become invalid, so you need to pay attention to the timing of loading the distributed cache.
  7. When solving data consistency problems through publish/subscribe, if the publish/subscribe mode does not persist message data, if the message is lost, the local cache will fail to be deleted. Therefore, it is necessary to solve the high availability problem of publish and subscribe messages.
  8. When the local cache fails, you need to use synchronized locking to ensure that the Redis cache is loaded by one thread to avoid concurrent updates.

At the end: If you have any questions, you can seek advice from the old architecture.

The road to architecture is full of ups and downs

Architecture is different from advanced development. Architecture issues are open/development-oriented, and there are no standard answers to architecture issues.

Because of this, many friends, despite spending a lot of energy and money, unfortunately never complete the architecture upgrade in their lifetime .

Therefore, in the process of architecture upgrade/transformation, if you really can’t find an effective solution, you can come to the 40-year-old architect Nien for help.

Yesterday, a small partner, they were going to build the golden link structure of the e-commerce website . At first, they couldn't find an idea, but after Nien's 10-minute voice guidance, it suddenly became clear.

references

https://it.sohu.com/a/696701644_121438385

https://blog.csdn.net/crazymakercircle/article/details/128533821

recommended reading

" Ten billions of visits, how to design a cache architecture "

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency? "

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?" "

" NetEase side: Single node 2000Wtps, how does Kafka do it?" "

" Byte Side: What is the relationship between transaction compensation and transaction retry?" "

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?" "

" How to structure billion-level short videos? " "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!" "

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!" "

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?" "

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132637261