37_Analysis and solution design of cache + database double write inconsistency problem in high concurrency scenario


Start to develop business system immediately

From which step to start, start from the simpler block, and the real-time requirement requires the higher data cache to do

The data cache with high real-time performance, the choice is the inventory service

The inventory may be modified, and the cached data must be updated each time it is modified; each time the inventory data expires in the cache or is cleared, the front-end nginx service will send a request to the inventory service to obtain the corresponding data

Inventory this piece, when writing the database, directly update the redis cache

In fact, it is not so simple. Here, in fact, a problem is involved. The database and cache are double-written and the data is inconsistent.

Regarding and combining the real-time inventory service with high real-time performance, we will explain the inconsistency between the database and the cache double write and its solution.

Database and cache double write inconsistency, a very common problem, the first solution in a large cache architecture

After all the large-scale cache architecture is explained, the entire architecture is very complex, and the architecture can cope with various exotic and extreme situations.

There is also a possibility that it ’s not that the lecturer is Superman, almighty

Lecture, just like writing a book, it is likely to be wrong, there may be some places in the plan, I did not consider

It may also be said that some programs are only suitable for certain scenarios. In some scenarios, you may need to optimize and adjust the program to apply to your own project.

If you have any questions or insights about these solutions, you can contact me and communicate

If it is true that I have explained it incorrectly, or that it is not well thought out in some places, then I can make a supplement in the video and update it on the website

Too much forgiveness

 


1. The first-level cache inconsistency problem and its solution

Problem: Modify the database first, then delete the cache. If the deletion of the cache fails, it will cause new data in the database, old data in the cache, and data inconsistency.

Solutions

Delete the cache first, and then modify the database. If the deletion of the cache is successful, if the database modification fails, then the database is old data and the cache is empty, then the data will not be inconsistent

Because the cache is not available when reading, the old data in the database is read and then updated to the cache

2. Analysis of more complex data inconsistencies

The data has changed, first delete the cache, and then to modify the database, at this time has not been modified

A request came, read the cache, found that the cache was empty, went to query the database, found the old data before modification, and put it in the cache

The data change procedure completes the database modification

Finished, the data in the database and cache is different. . . .

3. Why does this problem occur in the cache under the scenario of hundreds of millions of traffic and high concurrency?

This problem may only occur when reading and writing a data concurrently

In fact, if your concurrency is very low, especially if the read concurrency is very low, and the number of visits is 10,000 times per day, then infrequently, the inconsistent scenes just described will appear.

But the problem is that if there are hundreds of millions of traffic per day, and tens of thousands of concurrent reads per second, as long as there are data update requests per second, the above database + cache inconsistency may occur

After high concurrency, the problem is many

4. Asynchronous serialization of database and cache update and read operations

When updating data, according to the unique identifier of the data, the operation is routed and sent to a queue inside jvm

When reading data, if it is found that the data is not in the cache, it will re-read the data + update the cache operation, after routing according to the unique identifier, it will also be sent to the same jvm internal queue

One queue corresponds to one worker thread

Each worker thread gets the corresponding operation serially, and then executes one by one

In this case, a data change operation is executed first, delete the cache, and then update the database, but the update has not been completed

At this time, if a read request comes over and an empty cache is read, you can send the cache update request to the queue first, then the backlog will be in the queue, and then wait for the cache update to complete

There is an optimization point here. In fact, in a queue, it is meaningless to string multiple update cache requests together, so you can do filtering. If you find that there is already a request to update the cache in the queue, then you do n’t need to put another update request The operation is in, just wait for the previous update operation request to complete

After the worker thread corresponding to that queue has completed the modification of the database of the previous operation, it will perform the next operation, that is, the operation of updating the cache. At this time, the latest value will be read from the database and then written to the cache

If the request is still in the waiting time range, continuous polling finds that the value can be obtained, then it returns directly; if the request waits for more than a certain length of time, then this time directly read the current old value from the database

5. In the scenario of high concurrency, the solution should pay attention to the problem

(1) The read request is blocked for a long time

Since read requests are very lightly asynchronous, we must pay attention to the problem of read timeouts. Each read request must be returned within the timeout period.

The biggest risk point of this solution is that the data may be updated frequently, resulting in a large number of update operations in the queue, and then there will be a large number of timeouts for read requests, and finally a large number of requests will go directly to the database.

Be sure to pass some simulated real tests to see how frequently the data is updated

In addition, because a queue may have a backlog of update operations for multiple data items, it needs to be tested according to its own business situation, and multiple services may need to be deployed, each service sharing some data update operations

If a memory queue actually squeezes 100 items of inventory modification operations, and every inventory modification operation takes 10ms to complete, then the last product read request may wait 10 * 100 = 1000ms = 1s to get data

This time leads to long-term blocking of read requests

Must do some stress tests according to the actual business system operation, and simulate the online environment to see how many update operations the memory queue may squeeze during the busiest period, which may cause the corresponding update operation How much time will the read request hang, if the read request returns in 200ms, if you calculate it, even if it is the busiest time, the backlog of 10 update operations, wait up to 200ms, then it is ok

If a memory queue may have particularly many backlogs of update operations, then you have to add machines to allow the service instance deployed on each machine to process less data, then the less backlog of update operations in each memory queue

In fact, according to the previous project experience, in general, the frequency of data writing is very low, so in fact, normally speaking, the backlog of updates in the queue should be very few

For read-high concurrency and read-cache architecture projects, general write requests are very, very few compared to reads, and it can be good if the QPS per second can reach hundreds.

One second, 500 write operations, 5 copies, every 200ms, 100 write operations

Single machine, 20 memory queues, each memory queue may have a backlog of 5 write operations, after each write operation performance test, generally completed in about 20ms

Then the read request for the data in each memory queue will hang for a while at most, and it will definitely return within 200ms

Write QPS is expanded by 10 times, but after the calculation just now, I know that it is no problem to write hundreds of QPS on a single machine, then expand the machine, expand the machine by 10 times, 10 machines, 20 queues per machine, 200 queues

In most cases, this should be the case. A large number of read requests come directly from the cache to get the data

In a small number of cases, you may encounter a conflict between read and data update. As mentioned above, if the update operation enters the queue first, then a large number of read requests for this data may come in an instant, but because of the deduplication optimization , So an update cache operation follows it

When the data update is completed, the cache update operation triggered by the read request is also completed, and then all the read requests temporarily waiting can read the data in the cache

(2) Concurrency of read requests is too high

There must also be a stress test to ensure that when it happens to happen to the above situation, there is also a risk that suddenly a large number of read requests will hang on the service with a delay of tens of milliseconds, depending on whether the service can resist. How many machines can resist the peak of the maximum limit case

But because not all data is updated at the same time, the cache will not be invalid at the same time, so each time may be a cache of a small amount of data, and then the read requests corresponding to those data come over, and the concurrent volume should not be special. Big

Calculate the read and write requests according to the ratio of 1:99, read QPS of 50,000 per second, there may be only 500 update operations

If there are 500 write QPS in one second, then you have to measure it well. There may be 500 pieces of data affected by the write operation. After the 500 pieces of data are invalidated in the cache, how many read requests may be caused. Cache

Generally speaking, 1: 1, 1: 2, 1: 3, there are 1000 read requests per second, will hang on the inventory service, how much time each read request hangs at most, 200ms will return

The most likely to live at the same time may be 200 read requests on a single machine, and live at the same time

Single machine hang200 read requests, still ok

1:20, update 500 data per second, the read request corresponding to these 500 seconds of data, there will be 20 * 500 = 10,000

All 10,000 read requests hang on the inventory service, and they are dead

(3) Request routing for multi-service instance deployment

It may be that multiple instances of this service are deployed, so it must be ensured that requests to perform data update operations and cache update operations are routed to the same service instance through the nginx server

(4) The routing problem of hot products leads to the inclination of requests

In case the read and write requests for a certain product are particularly high, all hit the same queue of the same machine, which may cause excessive pressure on a certain machine

That is to say, because the cache will be cleared only when the product data is updated, and then will cause concurrent reading and writing, so if the update frequency is not too high, the impact of this problem is not particularly large

But it is possible that the load of some machines will be higher

Guess you like

Origin www.cnblogs.com/hg-super-man/p/12747688.html