Comprehensive analysis of classic cache application problems

1 Introduction

As the Internet develops from simple one-way browsing requests to customized and socialized requests based on user personal information, this requires products to analyze and calculate massive data based on users and relationships. For back-end services, it means that each request of the user needs to query the user's personal information and a large amount of relationship information. In addition, in most scenarios, the above information needs to be aggregated, filtered, and sorted before being returned to the user.

The CPU is the final execution unit for information processing and program running. If its world also has the concept of "second", assuming that its clock jumps to one second, then what is the concept of time in the eyes of the CPU (a core of the CPU)?

It can be seen that the speed of I/O is several orders of magnitude worse than that of CPU and memory. If all the data is obtained from the database, a request involving multiple database operations will greatly increase the response time and cannot provide a good user experience.

For web applications in large-scale and high-concurrency scenarios, caching is more important, and a higher cache hit rate means better performance. The introduction of the cache system is the only way to improve system response delay and user experience. A good cache architecture design is also the cornerstone of a high-concurrency system.

The idea of ​​caching is based on the following points:

  • Time Limitation Principle Programs have a tendency to access the same block of data multiple times over a period of time. For example, a popular product or a popular news will be viewed by millions or even tens of millions of more users. Caching enables efficient reuse of previously retrieved or computed data.

  • Trading space for time For most systems, the full amount of data is usually stored in MySQL or Hbase, but their access efficiency is too low. Therefore, a high-speed access space will be created to speed up the access process. For example, the read speed of Redis is 110,000 times/s, and the write speed is 81,000 times/s.

  • Tradeoff of performance and cost The high-speed access space brings about an increase in cost, and performance and cost should be considered in system design. For example, at the same cost, SSD hard disk capacity will be 10 to 30 times larger than memory, but read and write latency is 50 to 100 times higher.

The introduction of caching will bring the following advantages to the system:

  • Improve request performance

  • Reduce network congestion

  • Reduce service load

  • Enhanced scalability

Similarly, introducing caching also brings the following disadvantages:

  • There is no doubt that it will increase the complexity of the system, and the complexity of development and operation and maintenance will increase exponentially.

  • High-speed access space will cost more than database storage.

  • Since one piece of data exists in the cache and the database at the same time, and there are even multiple data copies inside the cache, there will be inconsistencies in data double-writing for multiple copies of the data, and the cache system itself will also have availability and partition problems.

In the design architecture of the cache system, there are still many pitfalls and many open and dark arrows. If the design is not proper, it will lead to many serious consequences. Improper design can lead to slower requests and lower performance, or data inconsistency, lower system availability, and even cache avalanche, making the entire system unable to provide external services.

2. The main storage mode of the cache

The three modes have their own advantages and disadvantages, and are suitable for different business scenarios, and there is no optimal mode.

● Cache Aside

Write:  When updating the db, delete the cache, and drive the update of the cache when the database is read next time.

Read:  When reading, read the cache first, if the cache misses, then read the database, and plant the data back to the cache, and return the corresponding result at the same time

Features: The idea of ​​lazy loading is based on the data in the database. In a slightly more complex caching scenario, the cache is not simply fetched directly from the database. It may be necessary to query some data from other tables and then perform some complex calculations to finally calculate the value. This storage mode is suitable for businesses that require high data consistency, or businesses that require complex and expensive cache data updates. For example: a cache involves multiple fields of multiple tables and is modified 100 times within 1 minute, but this cache is read 1 time within 1 minute. If you use this storage mode to only delete the cache, then within 1 minute, the cache is only recalculated once, and the overhead is greatly reduced.

● Read/Write Through

Write:  the cache exists, update the database, the cache does not exist, update the cache and the database at the same time

Read:  Cache miss, the data is loaded by the cache service and written to the cache

Features:

Read-write pass-through is friendly to hot data, and is especially suitable for occasions where hot and cold data are distinguished.

1) Simplify application code

In the caching approach, the application code is still complex and directly dependent on the database, and there may even be code duplication if multiple applications process the same data. Read-through mode moves some data access code from the application to the cache layer, which greatly simplifies the application and abstracts database operations more clearly.

2) Has better read scalability

In most cases, after cached data expires, multiple parallel user threads end up hitting the database, and with millions of cache entries and thousands of parallel user requests, the load on the database increases significantly. Read-write passthrough ensures that the application never hits the database for these cache entries, which also keeps the database load to a minimum.

3) Has better write performance

Read-through mode allows the application to quickly update the cache and return, after which it allows the cache service to update the database in the background. When the execution speed of database write operations is not as fast as that of cache updates, you can also specify a current limiting mechanism to schedule database write operations during off-peak hours to reduce the pressure on the database.

4) Automatically refresh the cache when it expires

Read-write pass-through mode allows the cache to automatically reload objects from the database when they expire. This means that applications don't have to hit the database during peak hours because the latest data is always in cache.

● Write Behind Caching (asynchronous cache write)

Write: only updates the cache, and the cache service updates the database asynchronously.

Read: Cache miss The data is loaded by the packaged cache service and written to the cache.

Features: The highest write performance, regular asynchronous refresh of database data, high probability of data loss, suitable for scenarios with high write frequency and need to merge write operations. Using the asynchronous cache write mode, data is read and updated through the cache. Unlike the read-write pass-through mode, the updated data will not be transmitted to the database immediately. On the contrary, once an update operation is performed in the cache service, the cache service will track the list of dirty records and periodically refresh the current set of dirty records to the database. As an additional performance improvement, the cache service will merge these dirty records. Merging means that if the same record is updated, or marked as dirty data multiple times in the buffer, only the last update is guaranteed. For scenarios where values ​​are updated very frequently, such as stock prices in financial markets, this approach can greatly improve performance. If the stock price changes 100 times per second, that means 30 x 100 updates happen in 30 seconds, merging reduces this to just one.

3. Cache 7 classic problems

Common Solutions to Problems

  1   cache central invalidation

Centralized cache failure occurs in most cases when there is high concurrency. If a large amount of cached data fails within a certain period of time, query requests will hit the database, and the pressure on the database will become prominent. For example, when the same batch of train tickets and air tickets can be sold, the system will load them into the cache at one time, and the expiration time is set to a pre-configured fixed time. After the expiration time expires, the system will experience slow performance due to the lack of hits in the concentration of hot data.

solution:

  • Use base time + random time to reduce the repetition rate of expiration time and avoid collective failure. That is, when setting the cache expiration time for the same business data, a random value is added to the originally set expiration time to allow the data to disperse and expire. At the same time, requests to the database will also be dispersed to avoid excessive pressure on the database due to instantaneous all expiration.

  2   cache penetration

Cache penetration refers to some abnormal access, each time to query the key that does not exist at all, resulting in each request will hit the database. For example, query users who do not exist, and query product ids that do not exist. It's not a big problem if the user makes an occasional incorrect input. However, if some special users control a batch of broilers and continuously access keys that do not exist in the cache, it will seriously affect the performance of the system, affect the access of normal users, and may even cause the database to directly crash. When we design the system, we usually only consider normal access requests, so this situation is often easily overlooked.

solution:

  • The first solution is to query the database for the first time when querying non-existent data. Even if the database has no data, the key is still returned to the cache, and a specially agreed value is used to indicate that the value of the key is empty. When there is another request for this key later, null will be returned directly. For robustness, when setting an empty cache key, be sure to set an expiration time to prevent the key from being written into data later.

  • The second solution is to build a BloomFilter cache filter to record the full amount of data. In this way, when accessing data, you can directly use BloomFilter to determine whether the key exists. If it does not exist, you can just return it without querying the cache or database at all. For example, you can use a database-based incremental log parsing framework (Ali's canal) to write incremental data to BloomFilter filters through consumption. All operations of BloomFilter are also implemented in memory, and the performance is very high. To achieve a 1% false positive rate, an average single record takes up 1.2 bytes. At the same time, it should be noted that BloomFilter only adds but not deletes. For deleted keys, it can be used together with the above cache null value solution. Redis provides a Bloom worryer with custom parameters, which can be created using bf.reserve, and the parameters error_rate (error rate) and initial_size need to be set. The lower the error_rate is, the more space is required. The initial_size indicates the number of elements expected to be placed. When the actual number exceeds this value, the misjudgment rate will increase.

  3   Cache Avalanche

Cache avalanche means that all or part of the cache machine goes down for some reason, causing a large amount of data to fall into the database, and finally kills the database. For example, when a service happens to be down during the peak period of requests, the cache service crashes. All the requests that hit the cache will hit the database at this time. If the database can’t handle it, it will go down after an alarm. After restarting the database, new requests will kill the database again.

solution:

  • Beforehand: The cache adopts a high-availability architecture design, and redis uses a cluster deployment method. Add a switch to the database access of important business data. When the database is found to be blocked and the response slow exceeds the threshold, the switch is turned off, and part or all of the database requests are executed failfast.

  • In progress: Introduce a multi-level cache architecture and increase cache copies, such as adding a local ehcache cache. Introduce current limiting and downgrading components to monitor and alarm the cache in real time. Timely recovery is carried out through machine replacement and service replacement; various automatic failover strategies can also be used to automatically close abnormal interfaces, stop edge services, and stop some non-core function measures to ensure the normal operation of core functions in extreme scenarios.

  • Afterwards: Redis persistence supports two persistence methods at the same time. We can use AOF and RDB persistence mechanisms comprehensively, and use AOF to ensure that data is not lost, as the first choice for data recovery; use RDB to do different degrees of cold backup. When AOF files are lost or damaged and unavailable, RDB can also be used for fast data recovery. At the same time, the RDB data is backed up to the remote cloud service. If the server memory and disk data are lost at the same time, the data can still be pulled from the remote end for disaster recovery and recovery operations.

  4   The cache data is inconsistent

The same piece of data is both in the cache and in the database, and there will definitely be inconsistencies between the data in the database and the cache. If a multi-level cache architecture is introduced, there will be multiple copies of the cache, and cache inconsistencies will also occur between multiple copies. When the bandwidth of the cache machine is full, or when the computer room network fluctuates, the cache update fails, and the new data is not written into the cache, which will lead to data inconsistency between the cache and the DB. When rehash is cached, a certain cache machine is abnormal repeatedly, going online and offline multiple times, and updating requests for rehash multiple times. In this way, a piece of data exists in multiple nodes, and each rehash only updates a certain node, causing some cache nodes to generate dirty data. For another example, the data has changed, the cache is deleted first, and then the database is to be modified, but it has not been modified at this time. A request comes, reads the cache, finds that the cache is empty, queries the database, finds the old data before modification, and puts it in the cache. Then the data change program completes the modification of the database, and the data in the database and the cache are different.

solution:

  • Set the expiration time of the key as short as possible, let the cache expire earlier, and load new data from the db. This cannot guarantee the strong consistency of the data, but it can guarantee the final consistency.

    After the cache update fails, a retry mechanism is introduced. For example, after continuous retry failures, the operation can be written to the retry queue. When the cache service is available, these keys will be deleted from the cache. When these keys are re-queried, they will be replanted from the database.

    Delayed double deletion strategy, first delete the data in the cache, and then delete the cache again after writing to the database and sleeping for one second (the specific time needs to be adjusted according to the time-consuming of the specific business logic). In this way, all dirty data caused within one second can be deleted again.

    The final consistency of the cache decouples the client data from the cache, and the application directly writes data to the database. The database updates the binlog log, and uses the Canal middleware to read the binlog log. Canal sends data to MQ at a frequency by means of the current limiting component, and the application monitors the MQ channel and updates the MQ data to the Redis cache.

    When updating data, according to the unique identifier of the data, the operation is routed and sent to a jvm internal queue. When reading data, if it is found that the data is not in the cache, the operation of "reading data + updating cache" will be re-executed, routed according to the unique identifier, and sent to the same jvm internal queue. This solution implements a very mild asynchronization for read requests, and you must pay attention to the problem of read timeout when using it. Each read request must be returned within the timeout period. Therefore, you need to test according to your own business situation. You may need to deploy multiple services, and each service shares some data update operations. If 100 business data modification operations are actually squeezed in a memory queue, and each operation takes 10ms to complete, then the last read request may wait for 10 * 100 = 1000ms = 1s before getting the data, which will lead to long-term blocking of the read request.

  5   Competing Concurrency

When the online traffic of the system is particularly large, concurrent data competition will occur in the cache. In a high-concurrency scenario, if the cached data just expires and there is no coordination between the concurrent requests, the concurrent requests will hit the database, putting a lot of pressure on the data, and in severe cases, it may cause an "avalanche" of the cache. In addition, high concurrency competition can also lead to data inconsistency. For example, when multiple redis clients set the same key at the same time, the initial value of the key is 1, which was originally changed to 2, 3, 4, and finally 4 in order, but the order changed to 4, 3, 2, and finally changed to 2.

solution:

Distributed lock + timestamp

A distributed lock can be implemented based on redis or zookeeper. When a key is accessed with high concurrency, the request is allowed to grab the lock. You can also introduce message middleware, and put the Redis.set operation in the message queue. In short, change the parallel reading and writing to serial reading and writing to avoid resource competition. For the sequence problem of key operations, it can be solved by setting a timestamp. In most scenarios, the data to be written to the cache is queried from the database. When data is written into the database, a timestamp field can be maintained, so that the data will be queried with a timestamp. When writing the cache, you can judge whether the timestamp of the current data is newer than the timestamp of the data in the cache, so as to avoid the old data from overwriting the new data.

  6   hotspot key problem

For most Internet systems, data is divided into hot and cold, and keys with high access frequency are called hot keys, such as hot news and hot comments. However, when an emergency occurs, a large number of users will access the sudden hotspot information in an instant. The cache node where the sudden hotspot information is located reaches the limit of the network card, bandwidth, and CPU due to the large traffic, resulting in slow cache access, freezes, and even downtime. Then the data is requested to the database, which eventually makes the entire service unavailable. For example, hundreds of thousands or millions of users on Weibo go to eat a new melon at the same time, online promotions such as Seckill, Double 11, 618, and Spring Festival, and special emergencies such as celebrity marriage, divorce, and cheating.

solution:

To solve the problem of such extremely hot keys, we must first find out these hot keys. For important holidays and online promotional activities, possible hot keys can be evaluated in advance with experience. For emergencies, which cannot be evaluated in advance, Spark or Flink can be used to perform stream computing to discover newly released hot keys in a timely manner. For the things that have been issued before and gradually fermented into hot keys, Hadoop can be used to perform offline batch calculations to find out the high-frequency hot keys in the recent historical data. Statistics or reporting can also be performed through the client. After finding the hot key, there are many solutions. First, these hot keys can be distributed. The redis cluster has a fixed 16384 hash slots. Calculate the CRC16 value for each key, and then take the modulus of 16384 to obtain the hash slot corresponding to the key. For example, a hotkey named hotkey can be divided into hotkey#1, hotkey#2, hotkey#3, ... hotkey#n, these n keys will be scattered in multiple cache nodes, and then when the client requests, it will randomly access the hotkey with a certain suffix, so that the request for the hotkey can be broken up.

Secondly, the name of the key can also be kept unchanged, and the cache architecture design of multi-copy + multi-level combination can be carried out in advance for the cache. For example, use ehcache, or a HashMap. After you find the hot key, load the hot key into the JVM of the system, and then you can directly get the data from the jvm for the request of the hot key. Thirdly, if there are many hot keys, the monitoring system can also monitor the SLA of the cache in real time, and reduce the impact of hot keys through rapid capacity expansion.

  7   Key Questions

Sometimes developers make unreasonable designs, and large objects will be formed in the cache. These large objects will cause data migration to freeze. In addition, in terms of memory allocation, if a key is particularly large, when the capacity needs to be expanded, a larger piece of memory will be requested at one time, which will also cause freezes. If the large object is deleted, the memory will be reclaimed at one time, and the stuck phenomenon will happen again. In normal business development, try to avoid the generation of large keys. If it is found that the cache of the system fluctuates greatly, it is most likely caused by a large key. This requires developers to locate the source of the large key, and then refactor the relevant business code. Redis has officially provided related commands to scan large keys, which can be used directly.

solution: 

  • If data is stored in Redis, such as business data stored in set format, the set structure corresponding to a large key has tens of thousands of elements, which will consume a long time when writing to Redis, causing Redis to freeze. At this point, you can expand the new data structure, and at the same time let the client perform serialization construction before writing the cache with these large keys, and then write it once through restore.

  • Split a large key into multiple keys to minimize the existence of large keys. At the same time, once a large key penetrates into the DB, it takes a long time to load, so special care can be taken for these large keys, such as setting a longer expiration time, such as when the key is eliminated in the cache, under the same conditions, try not to eliminate these large keys.

Guess you like

Origin blog.csdn.net/2301_77700816/article/details/131889781