Reids avalanche of cache, cache penetration

1, the cache avalanche

Cache avalanche refers to the original cache data is present in large quantities in the cache expires, resulting in a time of high-volume concurrent requests to the database, the database resulting pressure surge can cause serious database downtime. Thus forming a chain reaction, causing the whole system to crash.

[Solutions]
From the above analysis it can be seen only when the cache while large quantities of expired cache avalanche situation will appear, so just find a way to cache expiration time dispersion can be opened. The method that the more, for example, the heat buffer time data is set longer, the expiration time of the cold data set shorter, additional data may be set different classification expiration time, or add a random scrambling factor.

2 cache penetration

Cache penetration refers to a query does not exist in the cache data does not exist in the database, then if malicious attacks, frequently request the missing data, and certainly every time a read cache read database, increasing the database pressure burden.

【solution】

  1. For the missing data, request the database after each finished, the cache will be updated, and set the corresponding value for the key blank. The next time such a request for the missing data corresponding to the key, can be directly determined by the cache, the request does not need to go to the database.
    - Cons: Obviously, when data are missing a lot of time, certainly a great waste of cache memory space.
  2. Take the bloom filter (Bloom filter), Bloom filter principle is as follows.

3, Bloom filter

First clear the problem we have to solve, namely how to quickly determine whether a data exists in the database. Then we naturally think of the idea is to take the hash, all the data only need to key in a database stored in a hash structure, you can quickly determine. But there is a problem that we are faced with actual data in general will have a significant, ranging from the amount of billions of data, and each occupied a number of key data bytes are likely to be significant. We assume that the data amount is 10 . 9 , the average number of bytes per key for the 8byte, this memory is needed to store key is * 10. 8 . 9 byte ≈ 8GB, this apparent to those redis memory database, such large memory consumption is impossible to bear. So we raised the question of whether there is a data structure that can rapidly search, but also to reduce the memory footprint as possible.

3.1 BitMap

To undertake the above problem, BitMap is such a data structure that is able to quickly find, but also to reduce the memory footprint as possible . The principle is that, with one bit to indicate a key, no matter how the key, can be hashed to a specific one bit by the hash . In this way the key once for each, are mapped by a first hash function, and then find the corresponding bit position on the BitMap is 0 or 1, which determines whether there is key. Here we use a specific example to illustrate.

Scene One: 2000000000 integers not find out unique integer number, the memory is not sufficient to accommodate the 2 billion integers. (Check large data weight)

  • First, the "memory space is insufficient to accommodate the 0.5 billion integer" We can quickly think of Bit-map. The key question is how to design our bottom Bit-map to represent these two billion digital state of the. In fact, this question is very simple, a number of state only three, were not present, only one duplicate. Therefore, we only need 2bits can be stored for a number of states, suppose we set a number 00 does not exist, there is a 01, and there is more than twice as 11. Then we probably need storage space about 2G.
  • The next task is to put two billion digital into it (storage), if the corresponding status bit is 00, it is changed to 01, indicates the presence of one; if the corresponding status bit is 01, then it becomes 11 He represents already have one that appears more than once; if it is 11, the corresponding status bit remains unchanged, still represent multiple occurrences.
  • Finally, statistics is the number of status bits 01, the number of digits does not get repeated, time complexity is O (n).

Scene Two: to one billion unique integer not be sorted. (Large data sorting)

  • For example, I want to {1,5,7,2} These four types of digital byte ordering to do, how to do it? We know accounted for 8 bit byte bit, in fact, we can value in the array as a key bit bit, value of "0" to identify the key whether there have been? The following Figure:
    Here Insert Picture Description
    Our wonderful to see from the figure, we have an array of byte values as the key, and finally I just traverse the corresponding bit is 1 bit on it, so naturally into ordered arrays of.

  • Some people may say, how do I add a 13? Very simple, one byte can store the number 8, as long as the two byte I can solve the problem.
    Here Insert Picture Description
    I can see a linear array into a bit-bit two-dimensional matrix, the final space we need is: 3.6G / 32 = 0.1G to

  • Note that the bitmap is not the sort of N, but depends on the maximum value of the array to be sorted on the practical application of the relationship is not large, for example, I read a thread open 10 byte array, then the complexity is: O (Max / 10).

Although BitMap has reduced the space as much as possible, but for the in-memory database, which is still great space consumption, data key 1G would need at least 1G of space, so consider the possibility of further reducing the space.

3.2 Bloom filter

In view of the above problems, we found only two states, and bit key requirements Bitmap one correspondence, and the influence of extreme values ​​by which space, we assume that memory space is Redis 2G, it can store up to 2G key . But if we let the key bit and the formation of many relationship, and not subject to extremes of influence, then it can store more key. For example: We assume that a bit with k = 3 is represented by a key, the bit may represent the 2G C 2 G 3 C_{2G}^3 Number, significantly greater than 2G space. The Bloom filter is taken on the idea of ​​a key by hashing a plurality of hash function, so that a key may correspond to a plurality of bit position, thereby further reducing the space.

Bitmap advantage so that the space complexity of the number does not change within the original collection of elements increases, and its disadvantages also from this point - the spatial complexity increases linearly with the increase in the maximum element of the set.

So then, we have to introduce another famous industrial achieve - Bloom filter (Bloom Filter). If Bitmap for each possible integer values, by mapping direct addressing mode, it is equivalent to using a hash function, that is the introduction of Bloom filters k (k> 1) independently of one another hash function, to ensure that a given space, the false positive rate, weight of the sentence elements to complete the process. The figure is a Bloom filter when k = 3.
Here Insert Picture Description

But we also found a problem, Bloom filter, false positives may be possible, such as different key due to the bit-bit hash may overlap, so there may be cause, although a key is not in the database, but it but all corresponding bit position is 1, thus causing false positives. However, Bloom filter can be guaranteed, if a key corresponding to the key is a failure, it does not necessarily exist in the database.

Reference Links: https://blog.csdn.net/zdxiq000/article/details/57626464

Bloom filter by introducing a certain error rate, so that the mass data judged acceptable lies in the cost of memory can be achieved. As can be seen from the above equation, as the elements of the collection continuously input filter (nnn is increased), the error will increase. However, when the size of the Bitmap mmm (index bit) is large enough, for example, but also more than 10 times larger than the number of all distinct elements that may occur, the probability of error is acceptable. Compared to a simple bitmap, this algorithm out of the space complexity of treatment depend sentenced element of the range, and instead rely on the total number of elements, this algorithm is more a project that can be achieved - the former set of numbers no matter how much you, the required memory space is constant; the higher the latter, the number of sets, want to achieve the same false positive rate, the greater the required memory space.

Published 69 original articles · won praise 10 · views 10000 +

Guess you like

Origin blog.csdn.net/JustKian/article/details/104240716