[C++]-- Bloom filter for hash application

Table of contents

1. Introduction to Bloom filter

2. Bloom filter implementation

1. Bloom filter

2. Three hash functions

(1) BKDR Hash 

(2)AP Hash

(3) DJB hash

3.Identification

4. Check whether you are there or not 

5.Delete 

 3. Complete code snippet

4. Advantages and Disadvantages of Bloom Filter

1.Advantages

2. Disadvantages 

5. Bloom filter application

1. Find file intersections 

(1) Approximation algorithm

(2) Accurate algorithm

2. Extended Bloom filter

6. Hash Application

1. Find the IP with the most occurrences

2. Find the IP of top K

3. Use Linux system commands to find the IP of top K


1. Introduction to Bloom filter

        Bitmaps have the advantages of being easy to use, saving space, and being highly efficient. The disadvantage of bitmaps is that they can only handle shaping.

        If you want to see whether a string is occupied when giving a nickname, use a bit to identify it. When hashing resolves conflicts, subsequent conflicting elements at the same position can be hung up to form a linked list. But now, if you want to use a bitmap to store a string, the bit bit cannot store a pointer, cannot be hung up, and cannot handle hash conflicts. If you use hash storage, it will waste space.

        Therefore, can we consider combining hash and bitmap for non-integer types such as strings, and design a space-saving data structure like a bitmap that can determine whether the key is present? Yes – Bloom filter

Bloom filter is a compact and clever probabilistic data structure that can efficiently insert queries to determine the presence or absence of an element. It uses multiple hash functions to map a data into a bitmap, which not only improves query efficiency , and also save space

 When multiple bits are mapped, misjudgments may also occur in this case, but the probability of misjudgment is low, because conflicts will occur when multiple mapped bits are occupied, leading to misjudgments. For example, "Huashan" in the picture above has not been stored yet, and the several bits to be mapped have all become 1, which will cause "Huashan" to be misjudged. However, the probability of this misjudgment transmission is relatively low, and it only occurs when several bits are all occupied.

2. Bloom filter implementation

1. Bloom filter

 Only one member of the bitmap is needed:

#define  _CRT_SECURE_NO_WARNINGS  1
#pragma once
#include "BitSet.h"
#include <string>
using namespace std;

template<size_t N, class K = string, class Hash1 = HashBKDR, class Hash2 = HashAP, class Hash3 = HashDJB>
class BloomFilter
{
private:
	delia::BitSet<N> _bitset;
};

2. Three hash functions

Since three different hash algorithms need to be used for calculations to reduce collisions, three different hash algorithms can be selected:

(1) BKDR Hash 

struct HashBKDR
{
	size_t operator()(const string& s)
	{
		size_t value = 0;
		for (auto e : s)
		{
			value += e;
			value *= 131;
		}

		return value;
	}
};

(2)AP Hash

struct HashAP
{
	size_t operator()(const string& s)
	{
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; i < s.size(); i++)
		{
			ch = s[i];
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~(hash << 11) ^ ch ^ (hash >> 5));
			}
		}

		return hash;
	}
};

(3) DJB hash

struct HashDJB
{
	size_t operator()(const string& s)
	{
		register size_t hash = 5381;
		for (auto e : s)
		{
			hash += (hash << 5) + e;
		}

		return hash;
	}
};

3.Identification

 Use three hash functions to calculate the corresponding bits, and set all three bits to 1:

	void Set(const K& key)
	{
		size_t i1 = Hash1()(key) % N;//也可以写成Hash1 hf1; size_t i1 = hf1(key) % N;
		size_t i2 = Hash2()(key) % N;
		size_t i3 = Hash3()(key) % N;

		cout << i1 << " " << i2 << " " << i3 << endl;

		_bitset.set(i1);
		_bitset.set(i2);
		_bitset.set(i3);
	}

4. Check whether you are there or not 

Three hash functions are used to calculate three bits respectively. If one bit is detected to be absent, then return:

	bool Tests(const K& key)
	{
		size_t i1 = Hash1()(key) % N;
		if (_bitset.test(i1) == false)
		{
			return false;
		}

		size_t i2 = Hash2()(key) % N;
		if (_bitset.test(i2) == false)
		{
			return false;
		}

		size_t i3 = Hash3()(key) % N;
		if (_bitset.test(i3) == false)
		{
			return false;
		}

		return true;//可能存在误判,如"华山"
	}

5.Delete 

Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.

For example: to delete the "Bell Tower" element, if you directly set the binary bit position corresponding to this element to 0, the "Huashan" element will also be deleted, because the two elements happen to overlap in the bit positions calculated by multiple hash functions. .

 3. Complete code snippet

BloomFilter.h: 

#pragma once
#include "BitSet.h"
#include <string>
using namespace std;

//BKDR哈希
struct HashBKDR
{
	size_t operator()(const string& s)
	{
		size_t value = 0;
		for (auto e : s)
		{
			value += e;
			value *= 131;
		}

		return value;
	}
};

//AP哈希
struct HashAP
{
	size_t operator()(const string& s)
	{
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; i<s.size(); i++)
		{
			ch = s[i];
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~(hash << 11) ^ ch ^ (hash >> 5));
			}
		}

		return hash;
	}
};

//DJB哈希
struct HashDJB
{
	size_t operator()(const string& s)
	{
		register size_t hash = 5381;
		for (auto e : s)
		{
			hash += (hash << 5) + e;
		}

		return hash;
	}
};

template<size_t N, class K = string, class Hash1 = HashBKDR, class Hash2 = HashAP, class Hash3 = HashDJB>
class BloomFilter
{
public:
	void Set(const K& key)
	{
		size_t i1 = Hash1()(key) % N;//也可以写成Hash1 hf1; size_t i1 = hf1(key) % N;
		size_t i2 = Hash2()(key) % N;
		size_t i3 = Hash3()(key) % N;

		cout << i1 << " " << i2 << " " << i3 << endl;

		_bitset.set(i1);
		_bitset.set(i2);
		_bitset.set(i3);
	}

	bool Tests(const K& key)
	{
		size_t i1 = Hash1()(key) % N;
		if (_bitset.test(i1) == false)
		{
			return false;
		}

		size_t i2 = Hash2()(key) % N;
		if (_bitset.test(i2) == false)
		{
			return false;
		}

		size_t i3 = Hash3()(key) % N;
		if (_bitset.test(i3) == false)
		{
			return false;
		}

		return true;//可能存在误判,如"华山"
	}
private:
	delia::BitSet<N> _bitset;
};

void TestBloomFilter()
{
	BloomFilter<100> bf;
	bf.Set("大雁塔");
	bf.Set("钟楼");
	bf.Set("兵马俑");
	bf.Set("华山");
}

Test.cpp

#define  _CRT_SECURE_NO_WARNINGS  1
#include "BloomFilter.h"

int main()
{
	TestBloomFilter();
	return 0;
}

4. Advantages and Disadvantages of Bloom Filter

1.Advantages

(1) The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally small), and has nothing to do with the amount of data. (2) The hash functions have no relationship with each other
. Convenient for hardware parallel operations
(3) The Bloom filter does not need to store the element itself, which has great advantages in some situations with strict confidentiality requirements (
4) When it can withstand a certain amount of misjudgment, the Bloom filter is better than other data The structure has great space advantages

(5) When the amount of data is large, Bloom filters can represent the entire set, but other data structures cannot.
(6) Bloom filters using the same set of hash functions can perform intersection, union, and difference operations. 

2. Disadvantages 

(1) There is a false positive rate, that is, there is a false positive (False Position), that is, it cannot accurately determine whether the element is in the collection (remedial method: create a whitelist to store data that may be misjudged) (2) Unable to obtain
elements Itself
(3) Generally, elements cannot be deleted from Bloom filters
(4) If deletion is done by counting, there may be a counting wraparound problem

5. Bloom filter application

1. Find file intersections 

 给两个文件,分别有100亿个query,只有1G内存,如何找到两个文件交集?请给出近似算法和精确算法。

(1) Approximation algorithm

Determining intersection is essentially to determine whether it is present or not. Read the first query, map all elements to the Bloom filter, and then read the query in the second file to determine whether each query is in the Bloom filter. If there is an intersection.

(2) Accurate algorithm

 Assuming that each query is 20 bytes, 10 billion queries are 10 billion * 20 bytes = 200 billion KB = 200GB, using hash segmentation

 

2. Extended Bloom filter

如何扩展BloomFilter使得它支持删除元素的操作?

        The Bloom filter does not support deletion. This is because the Bloom filter may misjudge an element when it is not present. Deleting its corresponding bit will affect other elements, and multiple elements may be mapped to the same bit. bit, so deleting a certain bit will affect other elements, which may cause other elements to be deleted.

        However, you can use the following methods to make Bloom filters support deletion of elements:

        After the element is found in the Bloom filter, since multiple bits are used to represent an element, a count can be used for each bit of the Bloom filter instead of 0/1 (whether or not). When there are multiple elements mapped When the bit is reached, the bit counts ++, and when deleted, the bit counts --.

6. Hash Application

给一个超过100G大小的log file, log中存着IP地址, 设计算法找到出现次数最多的IP地址? 与上题条件相同,如何找到top K的IP?如何直接用Linux系统命令实现?

1. Find the IP with the most occurrences

(1) If the file exceeds 100G and cannot be loaded into the memory, the file needs to be hashed and each IP in the log file is converted into an integer through a hash function. If the IP is the same, then the converted The integers are also the same and will be mapped to the same small file.

 (2) After being divided into small files, they can be loaded into the memory. For each small file loaded into the memory, use map<string,int> to count the times of all IPs in the small file and find out the one that appears the most. IP.

(3) Use map<string,int> to count the IP that appears most often in each file, and you can find the IP that appears most often.

In addition: Divide the file into 100 small files. These 100 small files are not evenly divided. Some may be less than 1G, and some may be larger than 1G. When there are dozens of files larger than 1G, You can consider dividing the file directly into 200 parts instead of 100 parts, so that each small file is approximately 512MB.

2. Find the IP of top K

(1) Building a heap for a 100G file cannot be stored in the memory, so it still needs to be divided into small files. As shown in the figure above, a 100G large file is divided into 100 small files using a hash function.

(2) Load the first file into the memory, and build a small heap with K elements for the first small file. As long as it is larger than the top element of the heap, it will be put into the heap. What is left in the heap is the first small file. The K IPs that appear most frequently.

(3) Load the remaining small files into the memory in sequence. Each time a small file is loaded, all IPs in the small file are compared with the top element of the heap. As long as it is larger than the top element of the heap, it is put into the heap. The last thing left in the heap are the K IPs that appear the most.

3. Use Linux system commands to find the IP of top K

If there is the following file IP.log:

192.168.1.5
69.52.220.44
10.152.16.23
192.168.3.10
192.168.1.4
192.168.2.1
192.168.0.9
10.152.16.23
192.163.0.9
192.168.0.9 
69.52.220.44 
192.168.1.4 
192.168.1.5
192.163.0.9 
192.168.2.1
192.168.0.1
192.168.0.2
192.168.0.9
9.9.9.9
127.0.0.1
192.168.0.90
192.168.0.89
192.168.0.8
192.168.0.9
192.163.0.9

(1) Sort by row and output the results to standard output 

sort 文件名

 

 (2) Count and display the number of rows or columns that appear in the text file

uniq -c

(3) Sort in reverse order according to the number of occurrences

sort -r

 (4) View the first K lines

head -k

Display the top K IPs with the most occurrences

Guess you like

Origin blog.csdn.net/gx714433461/article/details/127006102
Recommended