Nginx

For the middleware nginx is often used for traffic distribution, and nginx itself also has its own cache (limited capacity), we can use it to cache hot data, let user requests go directly to the cache and return, reducing the traffic to the server

1. Template engine

Usually we can use template engines such as freemaker/velocity to resist a large number of requests

Small systems may directly render all pages on the server side and put them in the cache, and subsequent requests for the same page can be returned directly without querying the data source or doing data logic processing
For a system with a large number of pages, when the template is changed, the above method needs to re-render all page templates, which is undoubtedly not desirable. Therefore, with nginx+lua (OpenResty), the template is stored in the nginx cache separately, and the data used for rendering is also stored in the nginx cache, but a cache expiration time needs to be set to ensure the real-time performance of the template as much as possible.

Two, double-layer nginx to improve cache hit rate

For deploying multiple nginx, if some data routing strategies are not added, the cache hit rate of each nginx may be very low. Therefore, it is possible to deploy double-layer nginx

The distribution layer nginx is responsible for the logic and strategy of traffic distribution, according to some rules defined by itself, such as hashing according to productId, and then modulo the number of back-end nginx to route the access request of a certain product to a nginx back-end server.
The backend nginx is used to cache some hot data to its own cache area (can the distribution layer only be configured with one)

Redis

The user's request, if the corresponding data is not cached in nginx, will enter the redis cache, redis can cache the full amount of data, and can improve concurrency and high availability through horizontal expansion.

1. Persistence mechanism

Persistence mechanism: The data in the redis memory is persisted to the disk, and then the disk files can be regularly uploaded to some cloud storage services such as S3 (AWS) or ODPS (Alibaba Cloud).

If both RDB and AOF persistence mechanisms are used at the same time, AOF will be used to rebuild the data when redis restarts. Because the data in AOF is more complete, it is recommended to enable both persistence mechanisms and use AOF to ensure data. No loss, as the first choice for data recovery; RDB is used for different degrees of cold backup, and data recovery can be performed quickly when AOF files are lost or damaged or unavailable.

Stepping on pits in actual combat: For those who want to restore data from RDB, and the AOF switch is also turned on, it has been unable to restore normally, because data is obtained from AOF first every time (if AOF is temporarily turned off, it can be restored normally). At this point, first stop redis, then close AOF, copy RDB to the corresponding directory, and after starting redis, hot modify the configuration parameter redis config set appendonly yes, at this time, an AOF file of the current memory data will be automatically generated, then stop redis again, and open the AOF configuration , start the data again and start normally

RDB

Periodic persistence is performed on the data in redis, and every moment of persistence is a snapshot of the full amount of data. Less impact on redis performance, fast exception recovery based on RDB
AOF

Write to a log file in append-only mode. When redis restarts, you can rebuild the entire dataset by replaying the write instructions in the AOF log. (In fact, the log data written each time will go to the Linux os cache first, and then redis will call the operating system fsync every second to write the data in the os cache to the disk). It has a certain performance impact on redis and can ensure the integrity of the data as much as possible. Redis uses the rewrite mechanism to ensure that the AOF file will not be too large, and can do appropriate instruction reconstruction based on the current memory data.

2. Redis cluster

replication

One master multi-slave architecture, the master node is responsible for writing and synchronizing data to other slave nodes (asynchronous execution), and the slave node is responsible for reading, which is mainly used for the horizontal expansion architecture of read-write separation. The master node data of this architecture must be persistent. Otherwise, when the master is down and restarted, the memory data will be emptied, and the empty data will be copied to the slave, causing all data to disappear.
sentinal sentinel

Sentinel is a very important component in the redis cluster architecture. It is responsible for monitoring whether the redis master and slave processes are working properly. When a redis instance fails, it can send a message alarm to notify the administrator. When the master node goes down, it can be automatically transferred to the slave. On the node, if a failover occurs, the client will be notified of the new master address. Sentinal needs at least 3 instances to guarantee its own robustness and is better able to do quorum voting to reach majority to perform failover.

The biggest feature of the first two architectures is that the data of each node is the same, and it is impossible to access massive data. Therefore, the sentinel cluster is used in the case of a small amount of data
redis cluster

Redis cluster supports multiple master nodes, and each master node can mount multiple slave nodes. If master dies, it will automatically switch a corresponding slave to master. It should be noted that the slave node in the redis cluster architecture is mainly used for high availability and faulty master-standby switchover. If the slave must be able to provide read capabilities, the configuration can also be modified (and the jedis source code needs to be modified to support this situation). read-write separation operation under). Under the redis cluster architecture, the master can be expanded arbitrarily, and the read and write throughput can be improved by directly expanding the master horizontally. Slave nodes can be automatically migrated (let the master node have slave nodes as evenly as possible), and the redundant slaves that overload the entire architecture can ensure higher system availability.

cache

The tomcat jvm heap memory cache is mainly to resist large-scale disasters in redis. If there is a large-scale downtime of redis, causing a large amount of nginx traffic to flow directly into the data production service, the final tomcat heap memory cache can also process some requests, avoiding all requests flowing directly to the DB

I specially sorted out the above technologies. There are many technologies that can’t be explained clearly in a few words, so I just asked a friend to record some videos. The answers to many questions are actually very simple, but the thinking and logic behind them are not simple. If you know it, you also need to know why. If you want to learn Java engineering, high performance and distributed, explain the profound things in simple language. Friends of microservices, Spring, MyBatis, and Netty source code analysis can add my Java advanced group: 694549689. In the group, there are Ali Daniel's live broadcast technology and Java large-scale Internet technology videos to share with you for free.

Cache data update strategy

For cached data with high timeliness requirements, when changes occur, the database and redis cache double-write scheme is directly adopted, so that the cache timeliness is the highest.
For the data with low timeliness, when the change occurs, adopt the method of MQ asynchronous notification, monitor the MQ message through the data production service, and then asynchronously pull the data of the service to update the tomcat jvm cache and redis cache, and the local cache of nginx expires After that, you can pull new data from redis and update it locally to nginx.

Classic cache + database read and write mode, cache aside pattern

When reading, read the cache first, if there is no cache, then read the database, then take out the data and put it in the cache, and return the response at the same time
When updating, delete the cache first, and then update the database

The reason for updating is to delete the cache, because for some complex and logical cached data, updating the cache every time the data changes will cause an extra burden. Just delete the cache and let the data be read next time it is used. The operation to re-cache, here is the lazy loading strategy. For example, if a field of a table involved in the cache is modified 20 times or 100 times in 1 minute, then the cache is updated 20 times or 100 times; but the cache is read 1 in 1 minute. Therefore, every time the cache is updated, there will be a lot of cold data. For the cache, it conforms to the 28 golden rules. 20% of the data occupies 80% of the traffic.

Database and redis cache double write inconsistency problem

The first level of cache inconsistency and solutions

Problem: If you modify the database first and then delete the cache, then when the cache deletion fails, the latest data will be in the database and the old data will still be in the cache, resulting in data inconsistency.

Solution: You can delete the cache first, and then modify the database. If the deletion of the cache is successful but the modification of the database fails, then the database is old data, and the cache is empty, and there will be no inconsistency.
Analysis of more complex data inconsistencies

Problem: For data changes, first delete the cache, and then modify the database. At this time, the data in the database has not been successfully modified. A concurrent read request comes to read the cache and finds that the cache is empty, and then go to the database to query the old data at this time. Put it in the cache, and then the previous modification to the database data is successful, which will cause data inconsistency

Solution: Asynchronously serialize database and cache updates and reads. When updating data, according to the unique identifier of the data, the update data operation is routed to a queue inside the jvm, one queue corresponds to one worker thread, and the thread serially obtains the operations in the queue and executes one by one. When the update data operation in the queue is executed, the cache is deleted, and then the database is updated. At this time, a read request comes before the update is completed. If an empty cache is read, the cache update request can be sent to the queue after the route. , at this time, there will be a backlog in the queue, and then synchronously wait for the cache update to complete. It is meaningless to string together multiple cache update requests of the same data in a queue, so filtering can be done. After waiting for the previous update data operation to complete the database operation, the next cache update operation will be performed. At this time, the latest data will be read from the database and then written into the cache. If the request is still within the waiting time range, it will continue to If the polling finds that the value in the cache can be retrieved, it can be returned directly (there may be multiple requests for the cached data being processed in this way); if the request waits for the event for more than a certain period of time, then this time the request directly reads the database old value of

There are a few issues to be aware of with this approach:

Long-term blocking of read requests: Since the read requests are performed very slightly asynchronously, special attention should be paid to the problem of timeout. If the timeout time is exceeded, the DB will be directly queried. If it is not handled properly, it will cause pressure on the DB. Therefore, it is necessary to test the peak period of the system. QPS to adjust the number of machines and the number of queues on the corresponding machine to finally determine a reasonable request waiting timeout
Request routing for multi-instance deployment: It is possible that multiple instances of this service will be deployed, so it must be ensured that the corresponding requests are routed to the same service instance through the nginx server
The routing of hot data leads to skewed requests: because the cache is emptied only when the product data is updated, and then concurrent reads and writes occur, so if the update frequency is not too high, the impact of this problem is not particularly large, but it is indeed possible The load on some machines will be higher

Distributed Cache Rebuild Concurrency Conflict Resolution

For the cache production service, it may be deployed on multiple machines. When the cache data corresponding to redis and ehcache expires and does not exist, the request from nginx and the request monitored by kafka may arrive at the same time, resulting in both eventually pulling data. And it is stored in redis, so the problem of concurrency conflict may occur, which can be solved by using distributed locks similar to redis or zookeeper, so that the requested passive cache rebuilding and monitoring active cache rebuilding operations avoid concurrency conflicts. When the old data is discarded by comparing the time field, the latest data is saved to the cache

Cache cold start and cache warm-up solutions

When the system is started for the first time, a large number of requests flood in, and the cache is empty at this time, which may cause the DB to crash and make the system unusable. Similarly, when all the cached data of redis is abnormally lost, this problem can also be caused. Therefore, data can be put into redis in advance to avoid the above-mentioned cold start problem. Of course, it cannot be full data. According to the specific access situation similar to the day, the hot data with high access frequency can be counted in real time, and the hot data here is also compared. Many services need to be distributed in parallel to read and write to redis (so it is based on zk distributed lock)

The access traffic is reported to kafka through nginx+lua, and Storm consumes data from kafka, and counts the number of visits of each product in real time. The number of visits is based on the storage scheme of the LRU (apache commons collections LRUMap) memory data structure, using LRUMap to store It is because of the high performance in memory and no external dependencies. When each storm task starts, it writes its own id into the same node of zk based on the zk distributed lock. Each storm task is responsible for completing its own hot data statistics. Traverse the map every once in a while, then maintain a list of the top 1000 data, then update the list, and finally start a background thread to synchronize the top 1000 hot data lists to zk, and store it in a znode corresponding to this storm task

Deploying a service with multiple instances, each time it starts, it will get the node data of the storm task id list maintained above, and then try to obtain the zk distributed lock of the znode corresponding to the taskid one by one according to the taskid. If it can be obtained Go to the distributed lock, then obtain the lock of taskid status and then query the preheating status. If it has not been preheated, then take out the hot data list corresponding to the taskid, and then query it from the DB and write it into the cache. The taskid distributed lock fails to be acquired, just throw an error and perform the next cycle to obtain the next taskid distributed lock. At this time, multiple service instances coordinate and parallelize the cache warm-up based on the zk distributed lock.

Cache hotspot causes system unusable solution

For the influx of a large number of requests for the same data in an instant, it may cause the corresponding application layer nginx to be overwhelmed after the data passes through the hash policy. If the request continues, it will affect other nginx, and eventually cause all nginx to malfunction and the whole system becomes unusable. use.

The traffic distribution strategy of the hotspot cache based on nginx+lua+storm is automatically degraded to solve the above problems. You can set the data with access times greater than n times the average value of the last 95% as hotspots, and directly send http requests to traffic distribution in storm Then, Storm will also send the complete cached data corresponding to the hotspot to all application nginx servers, and store it directly in the local cache.

For traffic distribution nginx, access the corresponding data, and immediately downgrade the traffic distribution policy if it is found to be a hotspot identifier. Access to the same data is downgraded from hash to one application layer nginx to distributed to all application layer nginx. Storm needs to save the last identified hotspot list and compare it with the currently calculated hotspot list. If it is no longer the hotspot data, it will send the corresponding http request to the traffic distribution nginx to cancel the hotspot identification of the corresponding data.

Cache Avalanche Solutions

The redis cluster completely collapsed, and the cache service waited for a large number of redis requests, occupying resources, and then a large number of requests from the cache service entered the source service to query the DB, causing the DB to collapse due to excessive pressure. At this time, a large number of requests to the source service also waited to occupy resources. A large amount of resources of the cache service are all spent accessing redis and source services to no avail, and finally make itself unable to provide services, which will eventually lead to the collapse of the entire website.

The solution in advance is to build a redis cluster cluster with a high-availability architecture, master-slave architecture, one master and multiple slaves. Once the master node goes down, the slave nodes will automatically keep up, and it is best to use a dual computer room to deploy the cluster.

The solution in the matter is to deploy a layer of ehcache cache, which can withstand part of the pressure when redis is fully implemented; isolate resources for access to redis cluster to avoid waiting for all resources, and deploy when access to redis cluster fails Corresponding circuit breaker strategy, downgrade strategy for deploying redis cluster; current limiting and resource isolation for source service access

The solution after the fact: the redis data can be restored directly after a backup, and the redis can be restarted; if the redis data fails completely or the data is too old, the cache can be warmed up quickly, and then the redis can be restarted. Then due to the half-open strategy of resource isolation, it is found that redis has been able to access normally, then all requests will be automatically recovered

Cache penetration solution

For there is no corresponding data in the multi-level cache, and the DB does not query the data, a large number of requests will directly reach the DB, resulting in the problem of high concurrency on the DB. To solve the problem of cache penetration, you can return data with an empty identifier to the data that the DB does not have, and then save it to caches at all levels. Because of the asynchronous monitoring of data modification, when the data is updated, the new data will be updated to Cache sink.

Nginx cache invalidation causes redis pressure to double

You can set the validity period of the random cache when caching data locally in nginx, so as to avoid the cache invalidation at the same time and a large number of requests directly entering redis

Nginx+Redis+Ehcache: Summary of Large-scale High Concurrency and High Availability Three-tier Cache Architecture