Dry Goods: Best Practices and Online Cases of Distributed Cache in Large Internet Companies

1. The core elements of cache design

When we decide to use a cache in an application, we usually need to carry out a detailed design, because the design of the cache architecture seems simple, but it is not. It contains many profound principles. If used improperly, it will cause many production accidents or even service avalanches class of serious problems.

1. Capacity planning

  • size of cached content

  • Amount of cached content

  • elimination strategy

  • cached data structure

  • peak reads per second

  • Write peaks per second

2. Performance optimization

  • threading model

  • Preheat method

  • cache shard

  • Ratio of hot and cold data

3. High availability

  • Copy the model

  • failover

  • Persistence strategy

  • Cache rebuild

4. Cache monitoring

  • Cache service monitoring

  • Cache capacity monitoring

  • Cache request monitoring

  • Cache response time monitoring

5. Matters needing attention

  • Is it possible for cache penetration to occur

  • whether there are large objects

  • Whether to use cache to implement distributed lock

  • Whether to use cache-backed scripts (Lua)

  • Is Race Condition avoided?

2. Best Practices for Cache Design

good practice 1

The cache system mainly consumes the memory of the server. Therefore, when using the cache, you must first evaluate the data size that the application needs to cache, including the cache data structure, cache size, cache quantity, and cache expiration time, and then according to the business situation. Calculate the capacity usage in a certain period of time in the future, and apply and allocate cache resources according to the results of the capacity evaluation. Otherwise, resources will be wasted or the cache space will be insufficient.

good practice 2

It is recommended to separate the services that use the cache, and use different cache instances for core services and non-core services to physically isolate them. If possible, use a separate instance or cluster for each service to reduce the number of applications between applications. the possibility of mutual influence. The author has often heard that some companies have used shared caches, resulting in online accidents where cached data is overwritten and cached data is disordered.

good practice 3

According to the memory size provided by the cache instance, the number of cache instances that the application needs to use is calculated. Generally, a cache management operation and maintenance team is established in the company. This team will virtualize the cache resources into multiple cache instances with the same memory size.

For example, an instance has 4GB of memory. When applying for an application, you can apply for a sufficient number of instances for use. Such an application needs to be sharded. For details, please refer to 4.4.3 in "Scalable Service Architecture: Framework and Middleware" content. It should be noted here that if we use the RDB backup mechanism and each instance uses 4GB of memory, our system needs to have more than 8GB of memory, because the copy-on-write mechanism is used during RDB backup, and a child process needs to be fork and copied. One copy of memory, so double the memory storage size is required.

Good Practice 4

The cache is generally used to speed up the read operation of the database. Generally, the cache is accessed first and then the database is accessed, so the setting of the cache timeout is very important. The author once encountered a situation in an Internet company where the cache timeout was set to a long time due to an operation and maintenance error, which dragged down the thread pool of the service and eventually led to a service avalanche.

good practice 5

All cache instances need to add monitoring, which is very important, we need to do reliable monitoring for slow queries, large objects, and memory usage.

Good Practice 6

We do not recommend that multiple businesses share a cache instance, but due to cost control reasons, this situation often occurs. We need to limit the keys used by each application to have a unique prefix through specifications, and carry out isolation design to avoid caches overlapping each other. The problem.

good practice 7

Any cached key must set the cache invalidation time, and the invalidation time cannot be concentrated at a certain point, otherwise the cache will be filled with memory or cache avalanche.

good practice 8

Data that is accessed infrequently should not be placed in the cache. As we said before, the main purpose of using the cache is to improve the read performance.

A small partner once designed a set of timed batch processing systems. Since the batch processing system needs to calculate a large data model, the small partner saves the data model in the local cache of each node, and sends messages through the The queue receives updated messages to maintain the real-time nature of the model in the local cache, but this model is only used once a month, so using the cache like this is wasteful.

Since it is a batch task, it is necessary to divide the task, perform batch processing, and use the method of divide and conquer and step-by-step calculation to obtain the final result.

good practice 9

The cached data is not easy to be too large, especially Redis, because Redis uses a single-threaded model, when the data of a single cache key is too large, it will block the processing of other requests.

Good Practice 10

For keys that store more values, try not to use set operations such as HGETALL, which will block requests and affect the access of other applications.

Good Practice 11

Cache is generally used to speed up the query in the transaction system. When there is a large amount of update data, especially when batch processing, please use the batch mode, but this scenario is rare.

Good Practice 12

If the performance requirements are not very high, try to use the distributed cache instead of the local cache, because the local cache is replicated between the various nodes of the service, and the replicas are inconsistent at a certain time. If this cache represents It is a switch, and the request in the distributed system may be repeated, which will cause the repeated request to go to two nodes. The switch of one node is on, and the switch of the other node is off. If the request processing is not idempotent, This results in duplication of processing and, in severe cases, financial losses.

Good Practice 13

When writing to the cache, it is necessary to write completely correct data. If part of the cached data is valid and part of the data is invalid, it is better to give up the cache than to write part of the data into the cache, otherwise it will cause null pointers, program exceptions, etc.

Good Practice 14

Under normal circumstances, the order of reading is the cache first, then the database; the order of writing is the database first, then the cache.

Good Practice 15

When using a local cache (such as Ehcache), be sure to strictly control the number of cached objects and the declaration period. Due to the characteristics of the JVM, too many cached objects will greatly affect the performance of the JVM, and even lead to memory overflow.

Good Practice 16

When using the cache, there must be downgrade processing, especially for key business links, when there is a problem with the cache or it is invalid, the source must be returned to the database for processing.

3. Online cases about common caching problems

Case 1

Phenomenon: The database load of an application increases instantaneously.

Reason: The same fixed invalidation time is set for a large number of cache keys used in the application. When the cache fails, the database will be accessed at the same time for a period of time, which will cause great pressure on the database.

Summary: When using the cache, it is necessary to design the cache. It is necessary to fully consider how to avoid common problems such as cache penetration, cache avalanche, and cache concurrency. Especially for the use of high-concurrency cache, the key expiration time needs to be set randomly, for example , set the expiration time to 10 seconds + random(2), that is, set the expiration time to 10 to 12 seconds randomly.

Case 2

Phenomenon: The core operations of the two systems before and after the migration are repeated.

Reason: During the migration process, repeated traffic entered different nodes. Due to the use of the local cache storage migration switch, the switch state of each node is inconsistent at the moment when the migration switch is turned on. Some are on and some are off. Therefore, for The processing of the traffic of different nodes is repeated, one has the logic of switching on and the other has the logic of switching off.

Summary: Avoid using a local cache to store migration switches, which should be marked on stateful orders.

Case 3

Phenomenon: A module is designed to use cache to accelerate the performance of database read operations, but it is found that the database load does not drop significantly.

Reason: Because the data requested by the user of this module does not exist in the database, it is illegal data, so the cache misses and penetrates into the database every time, and the magnitude is large.

Summary: When using the cache, it is necessary to design the cache. It is necessary to fully consider how to avoid common problems such as cache penetration, cache avalanche, and cache concurrency. Especially for the use of high-concurrency caches, it is necessary to cache invalid keys to resist malicious intent. Attacks or effects on invalid cache queries, either intentionally or unintentionally.

Case 4

Phenomenon: The monitoring system alarms, and the space occupied by a single hash key in Redis is huge.

Reason: The application system uses a hash key. The hash key itself has an expiration time, but each key-value pair in the hash key has no expiration time.

Summary: In the process of designing Redis, if there are a large number of key-value pairs to save, please use the database type of string keys, and set an expiration time for each key, please do not store a hash key without boundaries collection data. In fact, whether it is the design of cache, memory or database, if you use any set data structure, you must consider setting a maximum limit for it to avoid running out of memory. The most common is the memory overflow caused by set overflow. question.

Case 5

Phenomenon: The business logic of a business project is interrupted due to the downtime of the cache, and the data is inconsistent.

Reason: Redis performs active/standby switchover, which causes the application to connect to Redis abnormally in an instant, and the application does not downgrade the cache.

Summary: For core business, there must be a downgrade plan when using cache. A common downgrade solution is to reserve enough capacity at the database level. When there is a problem with a certain part of the cache, the application can be temporarily returned to the database to continue the business logic without interrupting the business logic. However, this requires strict capacity evaluation. Please Refer to Chapter 3 of "Distributed Service Architecture: Principle Design and Practice".

Case 6

Symptom: The system load of an application increases, the response becomes slow, and it is found that the application performs frequent GC, and even the error log of OutOfMemroyError: GC overhead limt exceed appears.

reason:

Because this project is a historical project, the Hibernate ORM framework is used, the second-level cache is enabled in Hibernate, and Ehcache is used; however, the number of cached objects is not controlled in Ehcache, and the number of cached objects increases, resulting in tight memory, so frequent GC operation.

Summarize:

When using local caches (such as Ehcache, OSCache, and application memory), be sure to strictly control the number of cached objects and the declaration cycle.

Case 7

Phenomenon: A normally running application suddenly alarms that the number of threads is too high, and then a memory overflow occurs soon after.

Reason: Because the number of cache connections reaches the maximum limit, the application cannot connect to the cache, and the timeout time is set too large, so that the services accessing the cache are waiting for the cache operation to return. Due to the high cache load, all requests cannot be processed, but these The service is waiting for the cache operation to return. The service is waiting at this time, and it cannot downgrade and continue to access the database without timeout. In BIO mode, the thread pool will be full, and the thread pool of the user will also be full; in NIO mode, the load of the service will increase, the service response will be slow, and even the service will be overwhelmed.

Summary: When using a remote cache (such as Redis, Memcached), it is very important to set the operation timeout. Generally, we design the cache as a means to speed up database reading, and also degrade the cache operation. Therefore, it is recommended to use a shorter cache timeout period. If a number must be given, it is expected to be within 100 milliseconds.

Case 8

Phenomenon: A project uses cache to store business data, and an error occurs after going online, and developers are helpless.

Cause: Developers don't know how to discover, troubleshoot, locate, and resolve caching issues.

Summary: There should be a downgrade solution when designing the cache. When encountering a problem, the downgrade method should be used first, and a complete monitoring and alarm function should be designed to help developers quickly find the cache problem, and then locate and solve the problem.

Case 9

Phenomenon: After a project uses the cache, the development test passes, but after it reaches the production environment, the service has unpredictable problems.

Reason: The cache key of this application conflicts with the cache key of other applications, which leads to overwriting each other and a logic error occurs.

Summary: When using cache, there must be an isolation design. You can use different cache instances for physical isolation, or you can use different prefixes for logical isolation through the cache keys of each application.

If your distributed technology is not strong enough, your experience is not enough, and you encounter bottle disease at work, and your technology cannot be improved, I can recommend a learning exchange group here: 697579751 There will be some videos recorded by senior architects shared. Video: Source code analysis of Spring, MyBatis, Netty, principles of high concurrency, high performance, distributed, and microservice architecture, and JVM performance optimization have become a must-have knowledge system for architects. You can also receive free learning resources, which are currently benefiting a lot

Here I am recommending a learning system for learning the architecture framework:

Source code analysis:

Performance optimization:

Microservice Architecture:

Distributed Architecture:

Project combat:

concurrent programming

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324434489&siteId=291194637