Redis Advanced - Bloom filter

Here Insert Picture Description


Pre

We Redis Advanced -Redis cache optimization talked about a solution to prevent the penetration of the cache: empty the cache than the value of a better way to solve the Bloom filter, here we explain in detail below.


Bloom can solve what problem?

For example: there are five billion phone number, you now 100,000 telephone numbers, how to quickly and accurately determine the presence or absence of these numbers?

Scheme A: DB? ----> 5 billion phone number, this query efficiency?
Program B: memory? -> Click a phone number 8 bytes 5,000,000,000 bytes * 8 = 40G memory ...
Program C: hyperloglog ----> accuracy a little low?

There are many similar problems, such as

  • Spam filtering
  • Word processing software (such as word) word error detection
  • Web crawlers duplicate URL detection
  • hbase line filter

The principle BloomFilter

1970 proposed by Burton. Bloom, with very little space to solve the problem

A long binary vector (you will be understood that it is the underlying data structure of a super-huge array lasts only 0 and 1), plus a number of hash functions

Here Insert Picture Description

We have k hash functions for the k-th calculation, calculation of the hash results for each set to the corresponding position, and then retrieved when the recalculated hash function again, if there is not a 1, that is the Bloom filter the number does not exist, are all only 1 was present.

Deposit taking and calculation methods must be the same, otherwise Xiecai. . . .


Construction Bloom filter

Here Insert Picture Description

Parameters: m binary vector, n a preliminary data, k hash functions

Construction Bloom filter: n th data ready to go again the above procedure

Determining the presence or absence of elements: this data, the build process again to re-take (k-times the hash function), if both 1, it indicates the presence, absence and vice versa.


Construction of the error rate Bloom

Outset, using Bloom filters have to accept an error, the presence of possible errors. Certainly there is an error, that is just all hit

For example, there are two values ​​after the k hash, calculates the value is 1, this time in fact, your only one value of the underlying array, and Bloom tell you there is another value

Parameters: m binary vector, n a preliminary data, k hash functions

: Intuitive factors in a number ratio of m / n, hash function

Suppose you small binary vector m for storing a mapping relationship provided, such as 1000 ( in Guava an example, provided only 1000 storage 1000 is not to say, a large number of Guava underlying data calculation, the error rate in conjunction with your set calculates the length of a super array), you n (amount of data) and super multi, such as 100 million, there are three hash function used to calculate.

This time I have a data "artisan",

After the first hash function operation is stored to the location of the underlying array of five elements
via the second hash function calculation is stored into the location of the underlying array of 100 elements
over a third hash function operation is stored into the underlying array the location of the 1024 elements

This time you have to determine the presence or absence artisan, only need to recalculate the three hash functions, as long as the first 51,001,024 elements corresponding to the position value is 1, then that it exists. As long as there is a 0, it does not exist.

I suppose there is another piece of data xxxx, after three hash calculation, if he happened to have landed the first 51,001,024 elements position, that Bloom told you xxx exist in reality? In fact the first 51,001,024 This value is calculated from the artisan, rather than xxx, which resulted in inaccurate data, you have to accept the possibility of error.


m / n is inversely proportional to the error rate, k is inversely proportional to the error rate

m / n is inversely proportional to the error: m binary vector, a n-ready data, you to store m binary array larger, the smaller the data you need to store the actual n, then m / n is greater than? That the error rate is correspondingly low.

k and the error rate is inversely proportional: Ye Hao understand this, suppose you have only one hash function, is not you repeat the probability is much higher? So the larger the k, the lower the error rate.


The actual error rate projections

Parameters: m binary vector, n a preliminary data, k hash functions

  • 1) an element, a hash function, the probability of any bit 1 is 1 / m, the probability is 0 1- 1 / m

  • 2) k functions, probability of 0 (1-1 / m) of the k-th power, n-elements, still probability of 0 (1-1 / m) th power of nk

  • 3) is set to 1, the probability is 1 - (1-1 / m) th power of nk

  • 4) the probability of a new element in the whole of Here Insert Picture Description

Commonly used hash function value of the error rate at:
Here Insert Picture Description


Bloom filter (JVM level)

BitMap can first understand the principles and application of special algorithms Algorithms_ _Bitmap

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;


public class GuavaBloomFilterTest {
    // BloomFilter 容量 (实际上Guava通过计算后开辟的空间远远大于capacity)
    private static final int capacity = 100000;

    // 构建 BloomFilter
    private static BloomFilter bloomFilter = BloomFilter.create(Funnels.integerFunnel(), capacity);

    // 模拟初始化数据
    static{
        for (int i = 0; i < capacity; i++) {
            bloomFilter.put(i);
        }
    }



    public static void main(String[] args) {

        int testData =  999;
        long startTime = System.nanoTime();

        if (bloomFilter.mightContain(testData)){
            System.out.println(testData + " 包含在布隆过滤器中 ");
        }
        System.out.println("消耗时间: " + (System.nanoTime() - startTime) + " 微秒");

        // 错误率判断
        double errNums = 0;
        for (int i = capacity + 1000000; i < capacity + 2000000; i++) {
            if (bloomFilter.mightContain(i)) {
                ++errNums;
            }
        }

        System.out.println("错误率: " + (errNums/1000000));
    }
}

Local Bloom filter problem

  • Container capacity is limited local memory, such as the tomcat jvm
  • A plurality of applications plurality Bloom filters, complex synchronization construct (the session analogy, understanding the like)

Here Insert Picture Description

  • Restart the application cached content needs to be rebuilt

For malicious attacks, a large number of requests to the server cache data does not exist due to penetration can also do first filtered through a Bloom filter, for Bloom filter data do not exist are generally able to filter out, do not let the request go down send back end.

When the Bloom filter returns a value exists, the value may not exist; when it say there, it certainly does not exist.

Here Insert Picture Description

Bloom filter is a large number of bits are not the same group and unbiased hash function.

Unbiased is called the hash value of the element can be calculated relatively uniform.

When the key is added to the Bloom filter, will be used for a plurality of hash function to the key hash value is then regarded as an integer index to the number of bits of length modulo operation to obtain a position, each hash functions will be considered a different location. Then the position of these bits are set to one group is completed add operation.

When asked whether there is key to the Bloom filter, like add, also the location of several hash are calculated to see if the number of bits in these locations are set to 1, as long as a bit is 0, then that Bloom filter in this key does not exist.

If you are one, this does not mean that this key must exist, but there is most likely because these bits are set to 1 may be due to the presence of other key due.

If this bit sparse group, this probability will be great, if this bit crowded group, this probability is reduced.

This method is not suitable for high data hit, relatively fixed data, real-time low (typically larger dataset) application scenarios, code maintenance more complex, but the cache takes up very little space .


Pseudo code

Carrying bag can guvua Bloom filter, introduced dependence

<dependency>
	<groupId>com.google.guava</groupId>
	<artifactId>guava</artifactId>
	<version>22.0</version>
</dependency>
import com.google.common.hash.BloomFilter;


//初始化布隆过滤器 

//1000:期望存入的数据个数,0.001:期望的误差率
BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.forName("utf‐8")), 1000, 0.001);


//把所有数据存入布隆过滤器
void init(){
	for (String key: keys) {
       bloomFilter.put(key);
    } 
 }


String get(String key) {
	// 从布隆过滤器这一级缓存判断下key是否存在
	Boolean exist = bloomFilter.mightContain(key);
	if(!exist){
		return "";
	}
	// 从缓存中获取数据
	String cacheValue = cache.get(key);
	// 缓存为空
	if (StringUtils.isBlank(cacheValue)) {
		// 从存储中获取
		String storageValue = storage.get(key);
		cache.set(key, storageValue);
		// 如果存储数据为空, 需要设置一个过期时间(300秒)
		if (storageValue == null) {
		cache.expire(key, 60 * 5);
	 }
	 	return storageValue;
	 } else {
		 // 缓存非空
		 return cacheValue;
	 }
 }




Bloom filter (distributed)

We analyzed the shortcomings of the local Bloom filter, a single application, synchronization difficulties Bloom filter between multiple applications exist only, and once the restart of the application, cache misses.

For a distributed environment, it may be utilized to build distributed Bloom filter Redis

Use redisson framework

https://github.com/redisson/redisson/wiki/6.-distributed-objects#68-bloom-filter

RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");
// initialize bloom filter with 
// expectedInsertions = 55000000
// falseProbability = 0.03
bloomFilter.tryInit(55000000L, 0.03);

bloomFilter.add(new SomeObject("field1Value", "field2Value"));
bloomFilter.add(new SomeObject("field5Value", "field8Value"));

bloomFilter.contains(new SomeObject("field1Value", "field8Value"));
bloomFilter.count();

Bloom filters are redis solved scene cache penetration, resulting in a large number of requests fall on the DB, crushed DB.

Bloom filter should so redis between cache and DB.

Bloom tells you that there is not necessarily exist, there must not exist. So when you have not found the value from the DB, you should put this key update to the Bloom filter, the next time this key and then over time, there is no direct return, the need to query the DB again.

Pseudo code

public String getByKey(String key) {
    String value = get(key);
    if (StringUtils.isEmpty(value)) {
        logger.info("Redis 没命中 {}", key);
        if (bloomFilter.mightContain(key)) {
            logger.info("BloomFilter 命中 {}", key);
            return value;
        } else {
            if (mapDB.containsKey(key)) {
                logger.info("更新 Key {} 到 Redis", key);
                String valDB = mapDB.get(key);
                set(key, valDB);
                return valDB;
            } else {
                logger.info("更新 Key {} 到 BloomFilter", key);
                bloomFilter.put(key);
                return value;
            }
        }
    } else {
        logger.info("Redis 命中 {}", key);
        return value;
    }
}

The disadvantage of Bloom Filter

bloom filter at the expense of accuracy of judgment, convenience, deleted, only to achieve efficiency in time and space is relatively high, because

  • That a false determination may be found in the elements not in the container, but k positions obtained after the hash values ​​are 1. If the bloom filter is stored in a blacklist, you can store elements may be a miscarriage of justice through the establishment of a white list.

  • delete data. Element into a container k mapped to bit positions of the array is 1, can not simply be deleted when directly set to 0, may affect the determination of other elements. Counting Bloom Filter can be considered

Published 831 original articles · won praise 2074 · Views 4.23 million +

Guess you like

Origin blog.csdn.net/yangshangwei/article/details/105107779