How to quickly determine whether a certain element exists from 1 billion data

Preface

When Redisused as a cache, its purpose is to reduce the frequency of database access, the database reduced pressure, but if the data does not exist in some of our Redisthem, then the request will still directly to the database, if large cache at the same time a failure or not The cached requests are accessed by malicious attacks, which will cause the database pressure to increase sharply. How can this be prevented?

Cache avalanche

Cache avalanche refers Redisamong a large number of caches all fail at the same time, and if it happens this time while there are a large number of requests is initiated, it will result in a request direct access to the database, the database will likely washed away.

Cache avalanche generally describes the data that is not in the cache but in the database, and because the time expires, the request goes directly to the database.

solution

There are many ways to solve the cache avalanche, and the following are commonly used:

  • Lock to ensure single-threaded access to the cache. In this way, there will not be many requests to access the database at the same time.
  • keyDo not set the same value for the expiration time. Typically, when initializing the warm-up data, a random time can be used when storing the data in the cache to ensure that there will not be a large number of cache failures at the same time.
  • If memory allows, the cache can be set to never invalidate.

Cache breakdown

Cache breakdown is very similar to cache avalanche. The difference is that cache breakdown generally refers to a single cache failure, and at the same time there are a lot of concurrent requests that need to access this key, which causes pressure on the database.

solution

The method to solve the cache breakdown is very similar to the method to solve the cache avalanche:

  • Lock to ensure single-threaded access to the cache. In this way, after the first request reaches the database, it will be rewritten into the cache, and subsequent requests can be directly read from the cache.
  • If memory allows, the cache can be set to never invalidate.

Cache penetration

The essential difference between the two phenomena cache penetration and above this is the data access time not only in Redisthe absence of, and does not exist in the database, so that if a concurrent too large will result in a steady stream of data arrives database, resulting in a very to the database huge pressure.

solution

For cache penetrating question, locked and can not play a good effect, because of their own keyjust do not exist, even if the access control for a few threads, but a steady stream of requests still come to the database.

To solve the cache penetration problem, the following solutions can generally be used in conjunction:

  • Checking the interface layer is found illegal keydirect return. Such as a database used in the self-energizing id, if so to a non-integer idor negative idcan be returned directly, or if the 32position uuid, then find idlength is not equal 32bits may be returned directly.
  • Cache the non-existent data, you can directly cache an empty or other agreed invalid value. With this scheme is best to keyset up a short-term expiration time, a large number of otherwise non-existent keyis stored Redisin will take up a lot of memory.

Bloom Filter

For the above buffer solution penetration, we think about: If a keycan bypass the first 1check method, but this time there does not exist a large number keybeing accessed (such as 1one hundred million or 10one hundred million), this time to all storage In memory, it is not realistic.

So is there a better solution? This is the Bloom filter we are going to introduce next. The Bloom filter can store as much data as possible with as little space as possible.

What is bloom filter

Bloom filter (Bloom Filter) by Bloom in 1970years presented. It is actually a very long binary vector (bitmap) and a series of random mapping functions (hash functions).

Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that it has a certain misrecognition rate and it is difficult to delete.

Bitmap

RedisOne of them is a bitmap data structure, wherein the Bloom filter implementation is important to achieve the bitmap, i.e. bit array, and each array in this position only 0and 1two states, each location only occupies 1bytes which 0means that no element is present, 1expressed elements are present. Shown below is a simple example of the Bloom filter ( a keyvalue after the hash operation and bit operation which can be derived position falls ):

Insert picture description here

Hash collision

As we found above, lonelyand wolffalls in the same position, the phenomenon that different key values ​​get the same value after hashing is called hash collision . After the hash collision occurs and then undergo a bit operation, then it will definitely fall in the same position in the end.

If too much hash collision occurs, it will affect the accuracy of judgment, so in order to reduce hash collisions, we generally consider the following 2factors:

  • Increase the size of the bitmap array (the larger the bitmap array, the more memory it occupies).
  • Increase the number of hash functions (with a keyvalue after 1a function equal, then after 2the probability of one or more compute a hash function, have been equal results will naturally be reduced).

The above two methods we need to consider: for example, increasing the number of bits set, then you need to consume more space, but through the more hash calculation will consume cpuaffect the final computation time, so much the number of bits set in the end, hash How many times the function times need to be calculated is appropriate for specific situations.

2 major features of bloom filters

The following figure is after a 2bloom filter times hash function obtained according to the following chart we can easily see, if we Redisdo not exist, but Redisafter 2two position after the second hash function has been obtained by 1the ( one is wolfthrough f2to get one is Nosqlthrough f1to get, which is what has happened hash collision, Bloom filter is possible miscarriage of justice reasons).

Insert picture description here

So by the above phenomenon, we can conclude that from the perspective of Bloom filter Bloom filter main 2features:

  1. If the Bloom filter judges that an element exists, then this element may exist .
  2. If the Bloom filter judges that an element does not exist, then the element must not exist .

The elements can be drawn from the perspective of the 2major features:

  1. If the element actually exists, then the Bloom filter must determine that it exists .
  2. If the element does not exist, the Bloom filter may determine that it exists .

PS: Note that, if after a Nsecond hash function, you need to get the Npositions are 1in order to determine the existence, as long as there is one 0, it can be judged that there is no element Bloom filter .

fpp

Because there is always a false positive rate in Bloom filters, because hash collisions are impossible to avoid 100%. Bloom filter calls this false positive probability as false positive probability , namely: False Positive Probability, abbreviated as False Positive Probability fpp.

In practice, you can define one yourself when using a Bloom filter fpp, and then you can calculate how many hash functions and how much bit array space you need based on the theory of the Bloom filter. Note that this fppcan not be defined 100%, because there is no guarantee percentage hash collision does not occur.

Implementation of Bloom Filter (Guava)

In the Guavapackage provides an implementation of the Bloom filter, by following on Guavato experience the Bloom filter application:

  1. The introduction of pomdependence
<dependency>
   <groupId>com.google.guava</groupId>
   <artifactId>guava</artifactId>
   <version>29.0-jre</version>
</dependency>
  1. Create a new test class BloomFilterDemo:
package com.lonely.wolf.note.redis;

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

import java.text.NumberFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

public class GuavaBloomFilter {
    
    
    private static final int expectedInsertions = 1000000;

    public static void main(String[] args) {
    
    
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8),expectedInsertions);

        List<String> list = new ArrayList<>(expectedInsertions);

        for (int i = 0; i < expectedInsertions; i++) {
    
    
            String uuid = UUID.randomUUID().toString();
            bloomFilter.put(uuid);
            list.add(uuid);
        }

        int mightContainNum1 = 0;

        NumberFormat percentFormat =NumberFormat.getPercentInstance();
        percentFormat.setMaximumFractionDigits(2); //最大小数位数

        for (int i=0;i < 500;i++){
    
    
            String key = list.get(i);
            if (bloomFilter.mightContain(key)){
    
    
                mightContainNum1++;
            }
        }
        System.out.println("【key真实存在的情况】布隆过滤器认为存在的key值数:" + mightContainNum1);
        System.out.println("-----------------------分割线---------------------------------");

        int mightContainNum2 = 0;

        for (int i=0;i < expectedInsertions;i++){
    
    
            String key = UUID.randomUUID().toString();
            if (bloomFilter.mightContain(key)){
    
    
                mightContainNum2++;
            }
        }

        System.out.println("【key不存在的情况】布隆过滤器认为存在的key值数:" + mightContainNum2);
        System.out.println("【key不存在的情况】布隆过滤器的误判率为:" + percentFormat.format((float)mightContainNum2 / expectedInsertions));
    }
}

The result after running is:

Insert picture description here
The first portion of the output mightContainNum1constant and is fora value equal to the circulation, i.e. matching hundred percent. That is to say, the principle is satisfied 1: if the element actually exists, then the Bloom filter will definitely determine that it exists .
False positive rate, i.e. the second portion of the output fppis always at 3%the left and right, and with the fornumber of cycles increases, the closer 3%. That is, the principle is satisfied 2: if the element does not exist, the Bloom filter may determine that it exists .

This 3%how to misjudgment rate is it? We enter to create a Bloom filter createmethod, we found that fpp is the default 0.03:

Insert picture description here

For this default 3%is fppneed much space and how many times the median group hash function to get it? In the BloomFiltercategory below two defaultmethods can be acquired and the number of bit array space hash function:

  • optimalNumOfHashFunctions: Get the number of hash functions
  • optimalNumOfBits: Get the size of the bit array

debug Go in and take a look:

Insert picture description here

The result is 7298440 bit=0.87Mthen subjected to a 5second hashed. You can find this footprint is very small, 100Wthe keyonly occupied 0.87M.

PS: Click here to enter the site to calculate bitthe number of array size and hash function.

How to delete Bloom filter

Bloom filter element is present judgment is to determine whether a corresponding position to 1be determined, but if you want to delete an element can not be directly 1replaced 0, because this position may be other elements, so if you want to support delete, then we should How to do it The easiest way is to add a counter, that is every bit bit array If there is 0, there are several elements to keep specific numbers, not just exist 1, this will be a problem, would have been saved 1is a to meet, but if you want to keep specific numbers for example 2, it would need to 2place, so the Bloom filter with a counter, take up more space .

Bloom filter with counter

The following is an example of a Bloom filter with a counter:

  1. pom File import dependency:
<dependency>
    <groupId>com.baqend</groupId>
    <artifactId>bloom-filter</artifactId>
    <version>1.0.7</version>
</dependency>
  1. Create a new Bloom filter with a counter CountingBloomFilter:
package com.lonelyWolf.redis.bloom;

import orestes.bloomfilter.FilterBuilder;

public class CountingBloomFilter {
    
    
    public static void main(String[] args) {
    
    
        orestes.bloomfilter.CountingBloomFilter<String> cbf = new FilterBuilder(10000,
                0.01).countingBits(8).buildCountingBloomFilter();

        cbf.add("zhangsan");
        cbf.add("lisi");
        cbf.add("wangwu");
        System.out.println("是否存在王五:" + cbf.contains("wangwu")); //true
        cbf.remove("wangwu");
        System.out.println("是否存在王五:" + cbf.contains("wangwu")); //false
    }
}

Construct a Bloom filter in front of 2several parameters a is the desired elements, a is the fppvalue of the latter countingBitsparameter is the counter occupied size, where transmission of a 8bit, i.e. a maximum of 255repetitions, if not pass, then where default 16bit size, which allows 65535repetitions.

to sum up

This paper describes the use of Redisthree kinds of problems: Cache avalanche breakdown cache and cache penetration. The solution to each problem is described separately, and finally the solution to cache penetration is introduced: Bloom filter. The native bloom filter does not support deletion, but a counter can be introduced to implement the bloom filter with counter to achieve the delete function. At the same time, it is also mentioned at the end that the bloom filter with counter will take up more space problem.

Guess you like

Origin blog.csdn.net/zwx900102/article/details/114119634