Bloom filter redis

Bloom filter

definition

Bloom filter (Bloom Filter) in 1970 by the Bloom raised. It is actually a very long binary vector and a series of random mapping function.

usefulness

Bloom filter can be used to retrieve if an element in a set. Specific use are:

  1. Web crawlers go to a URL heavy, avoid crawling the same URL address

  2. Anti-spam, junk mail to determine whether a mailbox (Similarly, spam messages) from the billions of junk e-mail list

  3. Penetration cache, all possible data cache into the Bloom filter, when hackers from accessing the cache does not exist in cache, and quickly return to avoid hang DB

About Cache penetration:

We usually order to optimize the efficiency of business inquiries, usually choose a category such as redis cache, data cache, if there is a query directly through the cache to pick up, no or expired key, then go to the database. After finding the data is then added to the cache If there is such a scenario, a large number of users requesting data id does not exist at this time, because the cache

No, it is all full swing a database, so the database is likely to lead to shoot down. At the same time, all data obtained directly from the persistence layer, cache hit ratio parameter loses its meaning, the cache has lost its meaning. Such cases, called cache penetration .

advantage

The advantage is space efficiency and query time than the general algorithm is much better

Shortcoming

The disadvantage is that there is a certain error recognition rate and remove difficulties, but fundamental, his good enough for us to choose it as a tool to improve query performance.

principle

Bloom filter interior maintains a full array of bit 0, a description that the concept of a Bloom filter false positive rate, the lower the false positive rate, the longer the array, the greater the space occupied. The higher the rate of false positives array, the smaller the space occupied.

Because it is a bit array, either 0 or 1, where we initialize an array of 16-bit all-zero:

To simplify the understanding that facilitate Here, we set the number of hash function to 3, respectively hash1 (), hash2 (), hash3 ()

bit array of 16-bit length arrLength

Data data1, three were used for the hash function thereof, for example where hash1 (), the other two are the same

hashX (data1), and the binary operations by hash algorithm, a hash value is then processed% arrLentgh, obtained array index, index = 3 is assumed,

FIG array subscript we will set to 1:

Similarly, after processing assuming three functions as shown below:

Thus, it takes little space, it is possible to store this data, there is a case, when the data request over the same, because the characteristics of the hash function after the three hash functions,

By determining whether the three bits are 1, one can know if the same data (???)

So, the situation is really so simple?

In fact, the Bloom filter has such a characteristic, that is: If all the bits are repeated does not represent a duplication of data, even if there is not a repeat is certainly not duplicate data

Because the same hash value, not necessarily the same data, this is easy to understand, right?

The hash values ​​are different, certainly not of the same data. Therefore, we know that the Bloom filter to determine whether to repeat, there is a false positive rates. This is something we need to know.

achieve

Implementation 1: Google guaua frame (in this regard please the reader to look at Baidu)

Implementation 2: With redis

code show as below:

package com.example.demo.test;

import com.google.common.hash.Funnels;
import com.google.common.hash.Hashing;
import lombok.AllArgsConstructor;
import lombok.Data;
import redis.clients.jedis.Jedis;

import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;


/**
 * redis布隆过滤器 (布隆过滤器规则: 如果所有位都重复不代表是重复数据,如果有哪怕一位不重复,则肯定不是重复数据)
 * <p>
 * 新增数据处理后id填充布隆过滤器(得HASH,设置bitmap的位) ->
 * 当新的请求来对比id , 看看是不是能在布隆过滤器中找到重复数据 ->
 * true:判定为重复数据则进缓存找,如果没有,则是系统误判, 此时进入数据库
 * false: 判定为非重复数据则直接进数据库
 */

public class RedisBloomFilter {
    static final int expectedInsertions = 100;//要插入多少数据
    static final double fpp = 0.01;//期望的误判率

    //bit数组长度
    private static long numBits;

    //hash函数数量
    private static int numHashFunctions;

    private static Jedis jedis = new Jedis("127.0.0.1", 6379);
    private static Map<String,Object> map = new HashMap<>();
    static {
        numBits = optimalNumOfBits(expectedInsertions, fpp);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
        //数据模拟0(对象,需要用到序列化知识,篇幅过长,大家自己尝试一下)
        //map.put("10000001", new Goods("10000001","雕牌洗衣粉",6.25,"洗衣粉" ));
        //map.put("10000002", new Goods("10000002","小米空调",3006,"小米空调" ));
        //map.put("10000003", new Goods("10000003","任天堂switch",1776.99,"任天堂switch" ));
        //map.put("10000004", new Goods("10000004","联想笔记本电脑",6799,"联想笔记本电脑" ));

        //数据模拟1(这里只缓存价格)
        map.put("10000001", 6.25);
        map.put("10000002", 3006);
        map.put("10000003", 1776.99);
        map.put("10000004", 6799);
    }

    public static void main(String[] args) {
        
        //模拟入缓存的数据
        map.forEach((k,v)->{
            jedis.set(k, String.valueOf(v));
            long[] indexs = getIndexs(String.valueOf(k));
            for (long index : indexs) {
                jedis.setbit("codebear:bloom", index, true);
            }

        });
        
        //模拟用户请求的数据
        String userInput1 = "10000001";
        String userInput2 = "10000005";
        String[] arr = {userInput1, userInput2};
        for (int j = 0; j < arr.length; j++) {
            boolean repeated = true;
            long[] indexs = getIndexs(String.valueOf(arr[j]));
            for (long index : indexs) {
                Boolean isContain = jedis.getbit("codebear:bloom", index);
                if (!isContain) {
                    System.out.println(arr[j] + "肯定没有重复!");
                    repeated = false;
                    //从数据库获取数据
                    String retVal = getByDb(arr[j]);
                    System.out.println("数据库获取到的数据为"+retVal);
                    break;
                }
            }
            if (repeated) {
                System.out.println(arr[j] + "有重复!");
                //尝试从缓存获取
                String retVal = getByCache(arr[j]);
                if (retVal == null) {
                    //从数据库获取
                    retVal = getByDb(arr[j]);
                    System.out.println("数据库获取到的数据为"+retVal);
                    break;

                }
                System.out.println("缓存获取到的数据为"+retVal);

            }
        }
        
    }


    /**
     * 从缓存获取数据
     */
    public static String getByCache(String key){
        return jedis.get(key);
    }

    /**
     * 从数据库获取数据
     */
    public static String getByDb(String key){
        //从数据库获取数据逻辑没有实现
        return "";
    }

    /**
     * 根据key获取bitmap下标
     */
    private static long[] getIndexs(String key) {
        long hash1 = hash(key);
        long hash2 = hash1 >>> 16;
        long[] result = new long[numHashFunctions];
        for (int i = 0; i < numHashFunctions; i++) {
            long combinedHash = hash1 + i * hash2;
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            result[i] = combinedHash % numBits;
        }
        return result;
    }

    private static long hash(String key) {
        Charset charset = Charset.forName("UTF-8");
        return Hashing.murmur3_128().hashObject(key, Funnels.stringFunnel(charset)).asLong();
    }

    //计算hash函数个数
    private static int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
    }

    //计算bit数组长度
    private static long optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }
}

 

 
 
Released nine original articles · won praise 9 · views 4153

Guess you like

Origin blog.csdn.net/AdmiPyon/article/details/104390553