Data Structures and Algorithms - Bloom filter

 

Introduced

Bloom filter required under what circumstances? Let's look at a few of the more common examples:

  • Word processing software, it is necessary to check whether an English word spelled correctly
  • In the FBI, a suspect's name is already on the list of suspects
  • In the crawler where a URL is being visited
  • yahoo, gmail and other mail spam filtering

These few examples have one thing in common: how to determine whether there is an element in a collection?

 

Conventional thinking and Limitations

If you want to determine if an element is not in a collection, the general thought is to save up all the elements of the collection, and then determined by comparison. Linked lists, trees, hash tables (also known as a hash table, Hash table) is a data structure like this idea. But with the increase in the collection of elements, we need more and more storage space. Meanwhile retrieval speed is getting slower and slower.

  • Array
  • List
  • Tree, balanced binary tree, Trie
  • Map (red-black tree)
  • Hash table

 Although these types of data structures described above with the common sorting, binary search quickly and efficiently processing element determines whether the majority of the set of requirements exist. But when the number of elements in the collection is large enough inside, if there are 5 million or even 100 million records record it? This time the problem of conventional data structures on the highlights out.

 

The contents of the data structure stored elements in the array, linked lists, trees, etc., once the data is too large, memory consumption will also show linear growth, and ultimately achieve a bottleneck.

Some students may ask the high efficiency of the hash table is not it? Query efficiency can achieve O (1). But the hash table consumes memory remains high. Use one hundred million consume junk email address hash table is stored? Hash table approach: First, a hash function to the email address mapping 8-byte fingerprint information; consideration of the hash table storage efficiency is typically less than 50% (hash collisions); thus the memory consumption: 8 2 100 000 000 word section = 1.6G memory, ordinary computer is unable to provide such a large memory. This time, Bloom filter (Bloom Filter) came into being. While continuing to introduce the principle of the Bloom filter, prior knowledge of the hash function at first explain.

 

Hash function

Concept of a hash function is: converting data of any size as a function of data for a particular size, referred to as a converted data hash value or hash code.

One application is Hash table (hash table, also called a hash table), based on a hash value (Key value) to directly access a data structure. In other words, it records accessed by the hash value mapped to a table in a position to speed up lookups . Here is a typical hash function / of intentions:

 

 Can clearly be seen, after the original data is referred to as a hash function mapping one hash coding, data is compressed. Hash function and the hash table is the basis Bloom filter.

Hash function has the following two characteristics:

  • If the two hash values ​​are not the same (according to the same function), then the two original input hash value is not the same.
  • It is not the only correspondence relationship between the input and output of the hash function, if the two hash values identical, the two input values are likely to be the same. But it may be different, this is called a " hash collision " (or "hash collision").

Drawback : Reference "mathematical beauty" in the words of Dr. Wu, space efficiency of the hash table is not high enough. If one hundred million junk e-mail addresses stored in a hash table, each email address corresponds 8bytes, and storage efficiency of the hash table is generally only 50%, so take up an email address 16bytes. Therefore, one hundred million email addresses occupied 1.6GB, If the storage billions of email address is required hundreds of GB of memory. Unless it is a super computer, a server is generally not stored.

So to be introduced following the Bloom Filter.

 

 

Bloom filter (Bloom Filter)

Bloom filters (English: Bloom Filter) in 1970 proposed by Bloom. It is actually a very long binary vector and a series of random mapping function . Bloom filter can be used to retrieve whether an element in a set . The advantage is space efficiency and query time is far more than the general algorithm, the disadvantage is a certain error recognition rate and remove difficulties.

 

1, the principle

Bloom filter core (Bloom Filter) implementation is a large group of bits and the number of hash functions . Suppose the length of bit array is the number of m, the hash function is to be k. FIG case is K = . 3 Bloom filter when.

 

FIG above as an example, the detailed operation process: Suppose there are three elements set {x, y, z} number, the hash function is 3. First, the initial set of bits of the bit which each bit set to zero.

For each set of elements inside the elements sequentially by three hash function maps, each map will generate a hash value, this value corresponds to a point above the bit array, and then set the number of bits corresponding to the position marked 1 . Query whether the element W collection when the same method W 3 points on a hash map in place by an array of existence. If three points are a point is not 1, it can be judged that the element must not exist in the collection . Conversely, if the three points are 1, then the element may be present in the collection .

Note: This can not determine whether the element must exist in the collection, there may be some false positive rate. It can be seen from the figure: by mapping assuming an element corresponding to subscript 4,5,6 three points. Although these three points are 1, but obviously these three points are different elements obtained through Hash positions, so this case, although not in the set of elements described, are also possible to the corresponding 1, which is a false positive rate the reason exists.

 

Then the error Bloom filter how much? We assume that all the hash function hash is sufficiently uniform probability Bitmap each position fell after hashing equal. Bitmap size is m, the number of the original set of size n, the number of hash function K :

  • 1个散列函数时,接收一个元素时Bitmap中某一位置为0的概率为:  11/m
  • k个相互独立的散列函数,接收一个元素时Bitmap中某一位置为0的概率为: (11/m)k
  • 假设原始集合中,所有元素都不相等(最严格的情况),将所有元素都输入布隆过滤器,此时某一位置仍为0的概率为:(1−1/m)nk , 某一位置为1的概率为: 
    1(11/m)nk

 

  • 当我们对某个元素进行判重时,误判即这个元素对应的k个标志位不全为1,但所有k个标志位都被置为1,误判率ε约为: 

                   ε≈[1−(1−1/m)nk]k

 

 

算法:
1. 首先需要k个hash函数,每个函数可以把key散列成为1个整数
2. 初始化时,需要一个长度为n比特的数组,每个比特位初始化为0
3. 某个key加入集合时,用k个hash函数计算出k个散列值,并把数组中对应的比特位置为1
4. 判断某个key是否在集合时,用k个hash函数计算出k个散列值,并查询数组中对应的比特位,如果所有的比特位都是1,认为在集合中。

 

2、添加与查询

布隆过滤器添加元素

  • 将要添加的元素给k个哈希函数
  • 得到对应于位数组上的k个位置
  • 将这k个位置设为1

布隆过滤器查询元素

  • 将要查询的元素给k个哈希函数
  • 得到对应于位数组上的k个位置
  • 如果k个位置有一个为0,则肯定不在集合中
  • 如果k个位置全部为1,则可能在集合中

 

 4、优点

It tells us that the element either definitely is not in the set or may be in the set.

相比于其它的数据结构,布隆过滤器在空间和时间方面都有巨大的优势。布隆过滤器存储空间和插入/查询时间都是常数(O(k))。另外,散列函数相互之间没有关系,方便由硬件并行实现。布隆过滤器不需要存储元素本身,在某些对保密要求非常严格的场合有优势。

布隆过滤器可以表示全集,其它任何数据结构都不能;

 

4、缺点

但是布隆过滤器的缺点和优点一样明显。误算率是其中之一。随着存入的元素数量增加,误算率随之增加。但是如果元素数量太少,则使用散列表足矣。

误判补救方法是:再建立一个小的白名单,存储那些可能被误判的信息。

另外,一般情况下不能从布隆过滤器中删除元素. 我们很容易想到把位数组变成整数数组,每插入一个元素相应的计数器加 1, 这样删除元素时将计数器减掉就可以了。然而要保证安全地删除元素并非如此简单。首先我们必须保证删除的元素的确在布隆过滤器里面. 这一点单凭这个过滤器是无法保证的。另外计数器回绕也会造成问题。

 

5、实例

可以快速且空间效率高的判断一个元素是否属于一个集合;用来实现数据字典,或者集合求交集。

Google chrome 浏览器使用bloom filter识别恶意链接(能够用较少的存储空间表示较大的数据集合,简单的想就是把每一个URL都可以映射成为一个bit)

又如: 检测垃圾邮件

假定我们存储一亿个电子邮件地址,我们先建立一个十六亿二进制(比特),即两亿字节的向量,然后将这十六亿个二进制全部设置为零。对于每一个电子邮件地址 X,我们用八个不同的随机数产生器(F1,F2, …,F8) 产生八个信息指纹(f1, f2, …, f8)。再用一个随机数产生器 G 把这八个信息指纹映射到 1 到十六亿中的八个自然数 g1, g2, …,g8。现在我们把这八个位置的二进制全部设置为一。当我们对这一亿个 email 地址都进行这样的处理后。一个针对这些 email 地址的布隆过滤器就建成了。

再如:

A,B 两个文件,各存放 50 亿条 URL,每条 URL 占用 64 字节,内存限制是 4G,让你找出 A,B 文件共同的 URL。如果是三个乃至 n 个文件呢?

分析 :如果允许有一定的错误率,可以使用 Bloom filter,4G 内存大概可以表示 340 亿 bit。将其中一个文件中的 url 使用 Bloom filter 映射为这 340 亿 bit,然后挨个读取另外一个文件的 url,检查是否与 Bloom filter,如果是,那么该 url 应该是共同的 url(注意会有一定的错误率)。”

 

6、实现

import mmh3   #mmh3 非加密型哈希算法,一般用于哈希检索操作
from bitarray import bitarray

class Zarten_BloomFilter():

    def __init__(self):
        self.capacity = 1000
        self.bit_array = bitarray(self.capacity)
        self.bit_array.setall(0)

    def add(self, element):
        position_list = self._handle_position(element)
        for position in position_list:
            self.bit_array[position] = 1

    def is_exist(self, element):
        position_list = self._handle_position(element)

        result = True
        for position in position_list:
            result = self.bit_array[position] and result
        return result

    def _handle_position(self, element):
        postion_list = []
        for i in range(41, 51):
            index = mmh3.hash(element, i) % self.capacity
            postion_list.append(index)
        return postion_list

if __name__ == '__main__':
    bloom = Zarten_BloomFilter()
    a = ['when', 'how', 'where', 'too', 'there', 'to', 'when']
    for i in a:
        bloom.add(i)

    b = ['when', 'xixi', 'haha']
    for i in b:
        if bloom.is_exist(i):
            print('%s exist' % i)
        else:
            print('%s not exist' % i)

 

 

 

参考与推荐:

1、https://zhuanlan.zhihu.com/p/50587308

2、https://blog.csdn.net/zdxiq000/article/details/57626464

3、https://www.cnblogs.com/liyulong1982/p/6013002.html

Guess you like

Origin www.cnblogs.com/lisen10/p/10929092.html