Detailed explanation of c++---Bloom filter

Advantages and Disadvantages of Bitmaps

Advantages of bitmaps:
1. Bitmaps can save space. Bitmaps only use one bit to determine whether the stored data is present or not, while red-black trees and hash tables store whether data appears or not. Requires a lot of extra space.
2. The efficiency of bitmap processing of data is very high. It only takes O(1) time to determine the result of data existence. Compared with red-black trees, it is more efficient in some aspects.
Disadvantages of bitmap:
1. Generally, the range of data is required to be relatively concentrated. If the range is particularly scattered, the consumption of space quota will increase.
2. It can only be used for plastic data, and it is not so applicable for other types of data. For example, the number of floating-point numbers is much larger than that of positive numbers, so it is a bit inappropriate to use bitmaps for storage of floating-point numbers.

Why there are Bloom filters:

If we want to record whether a string appears, we can also use a bitmap to record it. Bitmaps can only store integer data. Strings are not integer, so if you want to store strings, you have to use a hash function. Conversion, convert a string into an integer, so that we convert a string into an integer for storage, but there is still a problem with this. There are infinite strings, but there are only limited integers, so use this There will definitely be problems when using this method to mark. For example, the integer numbers converted into strings aceg and bdfh through hash functions are the same. Then there will be problems when using bitmaps. I didn't insert them originally. aceg, but because bdfh was inserted before, when I look for aceg, it will also tell me that the data exists, but it actually does not exist, so this is a problem that occurred when implementing bitmaps before. If the marked place is 0, then It means that the data is definitely not there and there is no misjudgment, but if the search location is 1, it means that the current data may exist but it may be misjudged. Then in order to optimize this problem, someone proposed the Bloom filter. , he just uses several more hash functions for mapping. Although collisions will be used, as long as one of the positions corresponding to these hash functions is 0, it means that the current data does not exist. For example, the picture below
Insert image description here

There are currently 10 bits, and the Bloom filter has three hash functions, so every time we store a piece of data, the three bits will be assigned a value of 1. For example, the value of the string abcd converted by the Bloom filter is 4 5 6, because the values ​​​​at the 4 5 6 positions are all 0, the current data does not exist, so just change the values ​​​​in these positions to 1, such as the following picture: Then the characters must be stored at this
Insert image description here
time String bcde, the integer type converted by this string is 2 3 4. Because the values ​​at positions 2 and 3 are 0, the current data does not exist. Then change the value at position 2 3 to 1, then here The picture is as follows:
Insert image description here

When we also want to store the data cdef, the integer type corresponding to the string is 3 4 5. Because none of the 3 positions is 0, we think that the current data exists and the direct insertion fails, but the string is really Does it exist? Not necessarily right, it was mistakenly thought to exist because of hash conflicts, so the Bloom filter cannot completely solve the problem, but it can optimize the probability of data conflicts. So this is the Bloom filter, which maps multiple locations to reduce the false positive rate. However, the more mappings here, the better, because the more mappings, the lower the efficiency and the more space it takes up. So this is I hope everyone can understand the principle of Bloom filter.

Application scenarios of Bloom filter:

1. The Bloom filter is suitable for some scenarios that do not require certain accuracy.
For example, when registering a nickname, you need to determine whether the nickname exists. At this time, you can only roughly judge whether it exists. If it does not exist, it is true. Does not exist. If it exists, it may not exist, so there may be errors here but we do not need to judge so accurately, so we can use Bloom filters here.
2. Improve efficiency.
For example, if you use a person's ID on the client to find the corresponding data from tens of millions of people's data, first of all, the entered ID may not exist. Secondly, the data is stored on the server's disk, and the disk's Access is very slow, and the data you enter may be wrong, so this will cause the data access efficiency to be particularly low, so in order to improve the efficiency of data access, we can add a Bloom filter before accessing the database. Before searching in the disk, it first determines whether the data is present or not. If the data is present, search in the disk. If the data is not present, it returns directly.

Bloom filter implementation

First of all, we have to determine how many hash functions are needed for the Bloom filter. Because the Bloom filter contains multiple hash functions, one data will occupy multiple spaces, so we will have to open up more space when we open up the space. Several spaces are used to store data. For example, when there was only one hash function, 100 bits were needed to store 100 data, but now there are 3 hash functions, and 300 bits are opened to store 100 data. Space or 400 bits of space? Right, although the more hash functions we can help us solve the hash conflict problem, the more space it consumes, right? So for problems like this, we will definitely be able to get a formula for how many hashes to use. How much space can the function open up to achieve the highest efficiency? Then the formula is k =m*ln2/n, k represents the number of hash functions, m is the length of the Bloom filter, and n is the number of inserted elements. When hash When the number of hash functions is 3, m is approximately equal to 4.2 times n, so this means that when we insert a piece of data, we need to open up 4.2 spaces to minimize the false positive rate, so this tells us that when the number of hash functions is 3, the length of the Bloom filter must be 4.2 times the number of inserted data, so for the sake of introduction, we will modify the 4.2 times here to 4 times. With this theoretical support, we can simulate and implement Bloom Filter, first we need three hash functions to convert strings into integers, so this is not our focus, so we will directly list these three functors for you:

struct BKDRHash
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t hash = 0;
		for (auto ch : key)
		{
    
    
			hash *= 131;
			hash += ch;
		}
		return hash;
	}
};

struct APHash
{
    
    
	size_t operator()(const string& key)
	{
    
    
		unsigned int hash = 0;
		int i = 0;
		for (auto ch : key)
		{
    
    
			if ((i & 1) == 0)
			{
    
    
				hash ^= ((hash << 7) ^ (ch) ^ (hash >> 3));
			}
			else
			{
    
    
				hash ^= (~((hash << 11) ^ (ch) ^ (hash >> 5)));
			}
			++i;
		}
		return hash;
	}
};
struct DJBHash
{
    
    
	size_t operator()(const string& key)
	{
    
    
		unsigned int hash = 5381;
		for (auto ch : key)
		{
    
    
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};

Next we have to implement this class. First of all, this class definitely needs a template. The first parameter N in the template represents the maximum number of data to be stored. The second parameter x represents the average number of bits needed to store one piece of data. , then there is a type parameter used to represent the data type of the record, and then there are three parameters used to represent the template parameters used. For convenience of use, you can add a template parameter here, then the hash function template here corresponds to the above three A hash function. According to the above inference, the value of , then the code here is as follows:

template<size_t n>
class bitset
{
    
    
public:
	bitset()
	{
    
    
		ch.resize(n / 8 + 1);
	}
	void set(size_t x)
	{
    
    
		size_t i = x / 8;
		size_t j = x % 8;
		ch[i] |= (1 << j);
	}
	void reset(size_t x)
	{
    
    
		size_t i = x / 8;
		size_t j = x % 8;
		ch[i] &= ~(1 << j);
	}
	bool test(size_t x)
	{
    
    
		size_t i = x / 8;
		size_t j = x % 8;
		return ch[i] & (1 << j);
	}
private:
	vector<char> ch;
};

Then the general framework of the Bloom filter is as follows:

template<size_t N,
	size_t X=4,
	class K=string,
	class HashFunc1= BKDRHash,
	class HashFunc2= APHash,
	class HashFunc3= DJBHash>
class BloomFilter
{
    
    
public:
	void set(const K& key)
	{
    
    

	}
	bool test(const K& key)
	{
    
    

	}

private:
	bitset<N* X> _bs;
};

For the set function, you can use three hash functions to get three positions, and then call the set function of _bs to initialize all three positions to 1. Then the code here is as follows:

void set(const K& key)
{
    
    
	size_t hash1 = HashFunc1()(key) % (N * X);
	size_t hash2 = HashFunc2()(key) % (N * X);
	size_t hash3 = HashFunc3()(key) % (N * X);
	_bs.set(hash1);
	_bs.set(hash2);
	_bs.set(hash3);
}

Because as long as one of the three positions corresponding to a data is 0, then the data does not exist, so here we can get the data of the three positions step by step, and then judge one by one. As long as one is 0, then we Just return false, if all three are 1, we will return true, then the code here is as follows:

bool test(const K& key)
{
    
    
	size_t hash1 = HashFunc1()(key) % (N * X);
	if (!_bs.test(hash1))
	{
    
    
		return false;
	}
	size_t hash2 = HashFunc2()(key) % (N * X);
	if (!_bs.test(hash2))
	{
    
    
		return false;
	}
	size_t hash3 = HashFunc3()(key) % (N * X);
	if (!_bs.test(hash2))
	{
    
    
		return false;
	}
	return true;
}

Then our code is almost completed here, and you may wonder why the Bloom filter does not have a delete function? Bloom filters do not support reset, because when you delete a value, it is likely to affect the stability of other values. So can another form of Bloom filter be designed here to support deletion? The answer is through counting. When there is data corresponding to a value, it does not mean that the data at the corresponding position becomes 1 but increases by 1. When deleting, the value is decremented by one, but this method will bring about another problem. The problem was that one bit was used to indicate the presence or absence of a piece of data, and multiple corresponding bits were used to judge at the same time. If you use counting to judge, one bit is not enough to record, and you may need to use multiple bits. Storage will double the space consumption, so the Bloom filter does not directly delete the function. So when you see here, our Bloom filter is completed. The complete code is as follows:

template<size_t N,
	size_t X=4,
	class K=string,
	class HashFunc1= BKDRHash,
	class HashFunc2= APHash,
	class HashFunc3= DJBHash>
class BloomFilter
{
    
    
public:
	void set(const K& key)
	{
    
    
		size_t hash1 = HashFunc1()(key) % (N * X);
		size_t hash2 = HashFunc2()(key) % (N * X);
		size_t hash3 = HashFunc3()(key) % (N * X);
		_bs.set(hash1);
		_bs.set(hash2);
		_bs.set(hash3);
	}
	bool test(const K& key)
	{
    
    
		size_t hash1 = HashFunc1()(key) % (N * X);
		if (!_bs.test(hash1))
		{
    
    
			return false;
		}
		size_t hash2 = HashFunc2()(key) % (N * X);
		if (!_bs.test(hash2))
		{
    
    
			return false;
		}
		size_t hash3 = HashFunc3()(key) % (N * X);
		if (!_bs.test(hash3))
		{
    
    
			return false;
		}
		return true;
	}

private:
	bitset<N* X> _bs;
};

Bloom filter testing

After the Bloom filter code is implemented, we can write a piece of code here to test it. First, let the Bloom filter store a series of different strings, then the code here is as follows:

void test_bloomfilter2()
{
    
    
	const size_t N = 100000;
	BloomFilter<N> bf;
	std::vector<std::string> v1;
	std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";
	for (size_t i = 0; i < N; ++i)
	{
    
    
		v1.push_back(url + std::to_string(i));
	}
	for (auto& str : v1)
	{
    
    
		bf.set(str);
	}
}

Then we create some strings similar to the above for testing to determine whether the current string exists. Although the subsequent strings are similar to the above, they do not exist, so when testing, as long as they exist, it means The current string has a false positive, and then we can create a variable to record the number of false positive strings, and then divide it by the sum of all test strings. Then the complete code for the test is as follows:

void test_bloomfilter2()
{
    
    
	const size_t N = 100000;
	BloomFilter<N> bf;
	std::vector<std::string> v1;
	std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";
	for (size_t i = 0; i < N; ++i)
	{
    
    
		v1.push_back(url + std::to_string(i));
	}
	for (auto& str : v1)
	{
    
    
		bf.set(str);
	}
		std::vector<std::string> v2;
	for (size_t i = 0; i < N; ++i)
	{
    
    
		std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";
		url += std::to_string(999999 + i);
		v2.push_back(url);
	}
	size_t n2 = 0;
	for (auto& str : v2)
	{
    
    
		if (bf.test(str))
		{
    
    
			++n2;
		}
	}
	cout << "相似字符串误判率:" << (double)n2 / (double)N << endl;
}

Running the above code will give the following result:
Insert image description here

You can see that the probability of false positives here is 0.2812. Using the above idea, we can also test the false positive rate of dissimilar strings. Then the code here is as follows:

std::vector<std::string> v3;
for (size_t i = 0; i < N; ++i)
{
    
    
	string url = "zhihu.com";
	url += std::to_string(i + rand());
	v3.push_back(url);
}
size_t n3 = 0;
for (auto& str : v3)
{
    
    
	if (bf.test(str))
	{
    
    
		++n3;
	}
}
cout << "不相似字符串误判率:" << (double)n3 / (double)N << endl;

The result of running the code is as follows:
Insert image description here

It can be seen that the false positive rate of dissimilar strings will be lower. We can also modify the above code. It turns out that inserting a data requires opening up 4 spaces. So if we open up 6 times the space, the false positive rate will be lower. Will it decrease? Let’s take a look at the results of the test:
Insert image description here

Is it obviously reduced, right? At this time, we will keep the current ratio unchanged and add another hash function? Will his false alarm rate decrease further? The hash function here is as follows:
Insert image description here
You can see that the false positive rate is further reduced, but this reduction is obtained in exchange for more extra space, so this is exchanging space for the false positive rate, then this is Bloom Filter testing.

Guess you like

Origin blog.csdn.net/qq_68695298/article/details/131464747