A Bloom filter
1, Wikipedia
Bloom filter ( Bloom Filt ER) in 1970 proposed by Bloom.
It is actually a long series of random binary vector and mapping functions. Bloom filter can be used to retrieve whether an element in a set .
The advantage is no need to store key, space-saving, space efficiency and query time is far more than the general algorithm, the disadvantage is a certain error recognition rate and remove difficulties.
2, the principle concept
If you want to determine if an element is not in a collection, the general thought is to save up all the elements of the collection, and then determined by comparison.
Linked lists, trees, hash table (hash table) , and so the data structure is this line of thinking, but with the increase in the collection of elements, required storage space is growing; while retrieving speed is getting slower and slower, retrieval time are the complexity O (n), O (log n), O (1).
Bloom filter principle is that when an element is added to the collection, the K hash function maps the element to a set of K bits points (Bit array) is, they set to 1. When searching, just look at these points is not all one knew whether the elements in the collection; if there is any point 0, the subject has elements must not; if we are all one, then the subject is likely to elements in (the so " it may " exist error).
3, self-understanding
Intuitively, Bloom algorithm similar to a HashSet (derived elements by a hash address hashing algorithm, by Bi Haxi address can be determined whether the two objects to the same address) to determine if an element (key) whether in a collection.
And general HashSet is different, Bloom Filter algorithms without key value stored, for each key, only the k bits, each storing a flag for judging whether the key in the collection.
Second, parsing algorithm
1, BloomFilter process
1. First need of k hash functions to each key the hash function may be an integer;
2. initialized, a required length of an array of n bits, each bit is initialized to 0;
3. If a key set is added, the k hash functions used to calculate the k hash values, and the corresponding bit position in the array is 1;
4. When determining whether a key set with the k hash functions to calculate the k hash values, and queries the corresponding bit array, if all the bits are 1, that in the collection.
2, on hash collision
Hash function is a good assumption, if we bit length of m array point, if we want to reduce the collision rate of 1% e.g., the hash table can only accommodate m / 100 elements. Obviously, this would not be called the space efficiency (Space-efficient) a. Solution is to use multiple Hash, if they have a say element is not set, it is certainly not in the. If they say in, although there is a certain possibility that they are lying, but intuitively judge the probability of this kind of thing is relatively low. --- As BloomFilter process
A Bloom Filter is based on a m-bit bit vector (b1, ... bm), the initial value of 0 bit vector. Further, a series of hash function (h1, ..., hk), the hash function belonging range 1 ~ m.
3, a schematic algorithm
A bloom filter insert {x, y, z}, and a value w is determined whether the data set:
解析:m=18,k=3;插入 x 是,三个 hash 函数分别得到蓝线对应的三个值,并将对应的位向量改为1,插入 y,z 时,类似的,分别将红线,紫线对应的位向量改为1。查找时,当查找 x 时,三个 hash 值对应的位向量都为1,因此判断 x 在此数据集中。y,z 也是如此。但是查找 w 时,w 有个 hash 值对应的位向量为0,因此可以判断不在此集合中。但是,假如 w 的最后那个 hash 值是1,这时就会认为 w 在此集合中,而事实上,w 可能不在此集合中,因此可能出现误报。显然的,插入数据越多,1的位数越多,误报的概率越大。
Wiki的Bloom Filter词条有关于误报的概率的详细分析:Probability of false positives。从分析可以看出,当 k 比较大时,误报概率还是比较小的。
三、BloomFilter 的应用
1、一些应用场景
黑名单:比如邮件黑名单过滤器,判断邮件地址是否在黑名单中。
排序(仅限于 BitSet) 。
网络爬虫:判断某个URL是否已经被爬取过。
K-V系统快速判断某个key是否存在:典型的例子有 Hbase,Hbase 的每个 Region 中都包含一个 BloomFilter,用于在查询时快速判断某个 key 在该 region 中是否存在,如果不存在,直接返回,节省掉后续的查询。
2、一致性校验(ConsistencyCheck)
Background:Database migration(SQL Server migrate to MySQL),迁移后的数据一致性校验。
Design:使用 BloomFilter 进行 ConsistencyCheck
Process:
① Migrate
② Hash the MySQL tables to BloomFilter
③ Use the SQL Server tables data to check
3、Python Code:
1 import pymysql 2 import pymssql 3 import time 4 from bloompy import ScalableBloomFilter 5 6 def timenow(): 7 timestr = time.localtime(int(time.time())) 8 now = time.strftime("%Y-%m-%d %H:%M:%S", timestr) 9 return now 10 11 #configure sql server connect 12 def mssql_conn(): 13 conn = pymssql.connect( 14 server="***", 15 user="***", 16 password="***", 17 database="***") 18 return conn 19 20 #configure mysql connect 21 def mysql_conn(): 22 conn = pymysql.connect( 23 host="***", 24 port=3306, 25 user="***", 26 password="***", 27 database="***") 28 return conn 29 30 def bloomf(): 31 bloom = ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=ScalableBloomFilter.LARGE_SET_GROWTH) 32 conn = mysql_conn() 33 cur = conn.cursor() 34 print('*** Target table data add to BloomFilter ***\n...') 35 try: 36 cur.execute(t_sql) 37 result = cur.fetchone() 38 while result != None: 39 bloom.add(result) 40 result = cur.fetchone() 41 except: 42 print ("Error: unable to fetch data.") 43 finally: 44 print('Finished add.\n') 45 cur.close() 46 conn.close() 47 48 print(timenow(),'\n*** Compare source to target data ***\n...') 49 conn = mssql_conn() 50 cur = conn.cursor() 51 try: 52 cur.execute(s_sql) 53 num = 0 54 result = cur.fetchone() 55 while result != None: 56 if result in bloom: 57 pass 58 else: 59 print('{} is not in the bloom filter,not in Target table {}.'.format(result,tab)) 60 num += 1 61 result = cur.fetchone() 62 if num == 0: 63 64 print('Result: {} ==> Target table data matches source table data.'.format(tab)) 65 else: 66 print('\nResult: Need to compare output to repair data.') 67 except: 68 print ("Error: unable to fetch data.") 69 finally: 70 cur.close() 71 conn.close() 72 73 74 if __name__ == '__main__': 75 tab ='***' 76 t_sql='select concat(***, ***, ***, UpdateDate) from ***;' 77 s_sql="select convert(varchar(20),***)+convert(varchar(20),***)+convert(varchar(20),***,20)+convert(varchar(25),UpdateDate,21)+'000' from ***" 78 print('#Start:',timenow(),'\n') 79 bloomf() 80 print('\n#End:',timenow())