Bloom Filter

Bloom Filter

1. What is a Bloom filter

I have had this kind of demand before. Abstractly speaking, it is: two sets S1 and S2 intersect. What are the elements of the following three parts of ABC?

Enter image description

The method is: construct two maps and traverse them. If the amount of data is small, it is OK, but if the amount of data is large, it is not. Until I saw the concept of "Bloom Filter"

1.1 Definitions

Bloom Filter was proposed by Bloom in 1970. It is actually a long binary vector and a series of random mapping functions (Hash functions). Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far more than the general algorithm, and the disadvantage is that there is a certain misrecognition rate and deletion difficulty. Bloom Filter is widely used in various occasions that need to be queried, such as Orocle's database, Google's BitTable also uses this technology.

1.2 Features

  1. There is no false negative (False Negative), that is, an element in a certain set can definitely be reported.
  2. There may be false positives (False Positive), that is, an element that is not in a certain set may also be exposed.
  3. The cost of determining whether an element is in a set is independent of the total number of elements.

2. Realization

2.1 java implementation

import java.util.BitSet;

public class BloomFilter {

	// 位数组bits,长度为m,初值置为0
	// hash函数f,长度为k
	// 集合S,长度为n

  	// 布隆过滤器的比特长度 2^25
	private static final int DEFAULT_SIZE = 2 << 24;
    // 这里要选取质数,能很好的降低错误率
	private static final int[] seeds = { 3, 5, 7, 11, 13, 31, 37, 61 };
	
  	// 位数组
	private static BitSet bits = new BitSet(DEFAULT_SIZE); 
    // hash函数
	private static SimpleHash[] func = new SimpleHash[seeds.length]; 

	public static void addValue(String value) {
		// 将字符串value哈希为8个或多个整数,然后在这些整数的bit上变为1
      	for (SimpleHash f : func) {
			bits.set(f.hash(value), true);
        }
	}

	public static void add(String value) {
		if (value != null) {
			addValue(value);
		}
	}

	public static boolean contains(String value) {
		if (value == null) {
			return false;
		}
		boolean ret = true;
		for (SimpleHash f : func) { 
          	 // 这里其实没必要全部跑完,只要一次ret==false那么就不包含这个字符串
			ret = ret && bits.get(f.hash(value));
			if (!ret) {
				return ret;
			}
		}
		return ret;
	}

	public static void main(String[] args) {
		System.out.println(DEFAULT_SIZE);
		String value = "www..net";
		// 初始化hash函数
		for (int i = 0; i < seeds.length; i++) {
			func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
		}
		add(value);
		System.out.println(contains(value));
	}
}

class SimpleHash {// 这玩意相当于C++中的结构体
	private int cap;
	private int seed;

	public SimpleHash(int cap, int seed) {
		this.cap = cap;
		this.seed = seed;
	}

	public int hash(String value) {// 字符串哈希,选取好的哈希函数很重要
		int result = 0;
		int len = value.length();
		for (int i = 0; i < len; i++) {
			result = seed * result + value.charAt(i);
		}
		return (cap - 1) & result;
	}
}

2.2 C++ implementation

/*http://www.cnblogs.com/dolphin0520/archive/2012/11/10/2755089.html*/
/*布隆过滤器简易版本 2012.11.10*/

#include<iostream>
#include<bitset>
#include<string>
#define MAX 2<<24
using namespace std;

bitset<MAX> bloomSet;           //简化了由n和p生成m的过程 

int seeds[7]={3, 7, 11, 13, 31, 37, 61};     //使用7个hash函数 



int getHashValue(string str,int n)           //计算Hash值 
{
    int result=0;
    int i;
    for(i=0;i<str.size();i++)
    {
        result=seeds[n]*result+(int)str[i];
        if(result > 2<<24)
            result%=2<<24;
    }
    return result;
}


bool isInBloomSet(string str)                //判断是否在布隆过滤器中 
{
    int i;
    for(i=0;i<7;i++)
    {
        int hash=getHashValue(str,i);
        if(bloomSet[hash]==0)
            return false;
    }
    return true;
}

void addToBloomSet(string str)               //添加元素到布隆过滤器 
{
    int i;
    for(i=0;i<7;i++)
    {
        int hash=getHashValue(str,i);
        bloomSet.set(hash,1);
    }
}


void initBloomSet()                         //初始化布隆过滤器 
{
    addToBloomSet("http://www.baidu.com");
    addToBloomSet("http://www.cnblogs.com");
    addToBloomSet("http://www.google.com");
}


int main(int argc, char *argv[])
{
    
    int n;
    initBloomSet();
    while(scanf("%d",&n)==1)
    {
        string str;
        while(n--)
        {
            cin>>str;
            if(isInBloomSet(str))
                cout<<"yes"<<endl;
            else
                cout<<"no"<<endl;
        }
        
    }
    return 0;
}

3. Application scenarios

How to generate tens of millions of non-repeating fixed-length strings?

an implementation

4. Reference

Most of the articles on the Internet are excerpted from the following articles.

  1. Mathematical Beauty Series 21 - Bloom Filter
  2. Bloom filter - Wikipedia
  3. Bloom filter
  4. Java Implementation of Bloom Filter (Bloom Filter)
  5. Hash and Bloom Filter

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325258250&siteId=291194637