Description and Application bloom filter

A Bloom filter

1, Wikipedia

  Bloom filter ( Bloom Filt ER) in 1970 proposed by Bloom.

  It is actually a long series of random binary vector and mapping functions. Bloom filter can be used to retrieve whether an element in a set .

  The advantage is no need to store key, space-saving, space efficiency and query time is far more than the general algorithm, the disadvantage is a certain error recognition rate and remove difficulties.

2, the principle concept

  If you want to determine if an element is not in a collection, the general thought is to save up all the elements of the collection, and then determined by comparison.

  Linked lists, trees, hash table (hash table) , and so the data structure is this line of thinking, but with the increase in the collection of elements, required storage space is growing; while retrieving speed is getting slower and slower, retrieval time are the complexity O (n), O (log n), O (1).

  Bloom filter principle is that when an element is added to the collection, the K hash function maps the element to a set of K bits points (Bit array) is, they set to 1. When searching, just look at these points is not all one knew whether the elements in the collection; if there is any point 0, the subject has elements must not; if we are all one, then the subject is likely to elements in (the so " it may " exist error).

3, self-understanding

  Intuitively, Bloom algorithm similar to a HashSet (derived elements by a hash address hashing algorithm, by Bi Haxi address can be determined whether the two objects to the same address) to determine if an element (key) whether in a collection.

  And general HashSet is different, Bloom Filter algorithms without key value stored, for each key, only the k bits, each storing a flag for judging whether the key in the collection.

 

Second, parsing algorithm

1, BloomFilter process

  1. First need of k hash functions to each key the hash function may be an integer;

  2. initialized, a required length of an array of n bits, each bit is initialized to 0;

  3. If a key set is added, the k hash functions used to calculate the k hash values, and the corresponding bit position in the array is 1;

  4. When determining whether a key set with the k hash functions to calculate the k hash values, and queries the corresponding bit array, if all the bits are 1, that in the collection.

2, on hash collision

  Hash function is a good assumption, if we bit length of m array point, if we want to reduce the collision rate of 1% e.g., the hash table can only accommodate m / 100 elements. Obviously, this would not be called the space efficiency (Space-efficient) a. Solution is to use multiple Hash, if they have a say element is not set, it is certainly not in the. If they say in, although there is a certain possibility that they are lying, but intuitively judge the probability of this kind of thing is relatively low. --- As BloomFilter process

  A Bloom Filter is based on a m-bit bit vector (b1, ... bm), the initial value of 0 bit vector. Further, a series of hash function (h1, ..., hk), the hash function belonging range 1 ~ m.

3, a schematic algorithm

  A bloom filter insert {x, y, z}, and a value w is determined whether the data set:

  解析:m=18,k=3;插入 x 是,三个 hash 函数分别得到蓝线对应的三个值,并将对应的位向量改为1,插入 y,z 时,类似的,分别将红线,紫线对应的位向量改为1。查找时,当查找 x 时,三个 hash 值对应的位向量都为1,因此判断 x 在此数据集中。y,z 也是如此。但是查找 w 时,w 有个 hash 值对应的位向量为0,因此可以判断不在此集合中。但是,假如 w 的最后那个 hash 值是1,这时就会认为 w 在此集合中,而事实上,w 可能不在此集合中,因此可能出现误报。显然的,插入数据越多,1的位数越多,误报的概率越大。

  Wiki的Bloom Filter词条有关于误报的概率的详细分析:Probability of false positives。从分析可以看出,当 k 比较大时,误报概率还是比较小的

 

三、BloomFilter 的应用

1、一些应用场景

  黑名单:比如邮件黑名单过滤器,判断邮件地址是否在黑名单中。

  排序(仅限于 BitSet) 。

  网络爬虫:判断某个URL是否已经被爬取过。

  K-V系统快速判断某个key是否存在:典型的例子有 Hbase,Hbase 的每个 Region 中都包含一个 BloomFilter,用于在查询时快速判断某个 key 在该 region 中是否存在,如果不存在,直接返回,节省掉后续的查询。

2、一致性校验(ConsistencyCheck)

  Background:Database migration(SQL Server migrate to MySQL),迁移后的数据一致性校验。

  Design:使用 BloomFilter 进行 ConsistencyCheck

  Process:

    ① Migrate

    ② Hash the MySQL tables to BloomFilter

    ③ Use the SQL Server tables data to check

3、Python Code:

 1 import pymysql
 2 import pymssql
 3 import time
 4 from bloompy import ScalableBloomFilter
 5 
 6 def timenow():
 7     timestr = time.localtime(int(time.time()))
 8     now = time.strftime("%Y-%m-%d %H:%M:%S", timestr)
 9     return now 
10 
11 #configure sql server connect
12 def mssql_conn(): 
13     conn = pymssql.connect(
14                 server="***", 
15                 user="***", 
16                 password="***", 
17                 database="***")
18     return conn     
19     
20 #configure mysql connect
21 def mysql_conn(): 
22     conn = pymysql.connect(
23                 host="***",
24                 port=3306,
25                 user="***", 
26                 password="***", 
27                 database="***")
28     return conn 
29 
30 def bloomf():
31     bloom = ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=ScalableBloomFilter.LARGE_SET_GROWTH) 
32     conn = mysql_conn() 
33     cur = conn.cursor()    
34     print('*** Target table data add to BloomFilter ***\n...')
35     try:
36         cur.execute(t_sql)
37         result = cur.fetchone()
38         while result != None:
39             bloom.add(result)
40             result = cur.fetchone()
41     except:
42         print ("Error: unable to fetch data.")    
43     finally:
44         print('Finished add.\n')
45         cur.close() 
46         conn.close() 
47 
48     print(timenow(),'\n*** Compare source to target data ***\n...')        
49     conn = mssql_conn() 
50     cur = conn.cursor()        
51     try:
52         cur.execute(s_sql)
53         num = 0
54         result = cur.fetchone()
55         while result != None:        
56             if result in bloom:
57                 pass
58             else:
59                 print('{} is not in the bloom filter,not in Target table {}.'.format(result,tab))
60                 num += 1
61             result = cur.fetchone()
62         if num == 0:
63         
64             print('Result: {} ==> Target table data matches source table data.'.format(tab))
65         else:
66             print('\nResult: Need to compare output to repair data.')
67     except:
68         print ("Error: unable to fetch data.")
69     finally:
70         cur.close() 
71         conn.close() 
72         
73 
74 if __name__ == '__main__': 
75     tab  ='***'
76     t_sql='select concat(***, ***, ***, UpdateDate) from ***;'
77     s_sql="select convert(varchar(20),***)+convert(varchar(20),***)+convert(varchar(20),***,20)+convert(varchar(25),UpdateDate,21)+'000' from ***"
78     print('#Start:',timenow(),'\n')
79     bloomf()
80     print('\n#End:',timenow())

 

Guess you like

Origin www.cnblogs.com/geaozhang/p/11373241.html