[C++] Hash application - Bloom filter

⭐Blog homepage: ️CS semi homepage
⭐Welcome to follow: like, favorite and leave a message
⭐ Column series: Advanced C++
⭐Code repository: Advanced C++
Home It is not easy for people to update. Your likes and attention are very important to me. Friends, please like and follow me. Your support is the biggest motivation for my creation. Friends are welcome to send private messages to ask questions. Family members, don’t forgetLike and collect + follow! ! !


1. Bloom filter proposed

When we watch Douyin, recommended advertisements and recommended Douyin content often appear. So how is this recommendation achieved? Of course, the server is used to record all the historical records that the user has viewed. When the recommendation system recommends news, it will filter out the historical records of each user and filter out those records that already exist. Only in this way can the effect of deduplication be achieved. Then There is another very difficult question, how do we find it quickly? We have the following three methods to achieve search:
The first method: We use hash tables or hash buckets, but we find that using hash tables is too wasteful of space.
The second approach: We use the concept of bitmap to store user records, but the disadvantage is that we can store integers, but we cannot handle string types such as string.
The third method: Then we just combine the hash table in the first method and the bitmap in the second method, which not only saves space but also stores strings, etc. A value of type string.

2. The concept of Bloom filter

Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query. It can be used to tell you "something must not be exists or may exist", it uses multiple hash functions to map a data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

The Bloom filter is essentially an extension and deformation of the bitmap, which can effectively reduce the false positive rate. The way it reduces the false positive rate is that when a data is mapped to a bitmap, the Bloom filter uses multiple hash functions to It is mapped to multiple bits. When determining whether a piece of data is in the bitmap, the corresponding bits need to be calculated based on these hash functions. If these bits are all set to 1, it is determined that the data exists, otherwise It is determined that the data does not exist. When multiple hash functions are used to map it to multiple bits, the misjudgment rate can be effectively reduced. Then we use the following QQ nickname to explain:

We assume that a data is mapped to three bits. We find that Zhang San mapped these three bits first. When we judge whether Li Si is in this bitmap, we first map three bits and find that the first two mappings are To the same bit position as John Doe, a hash function conflict occurs, but the last one is mapped to the last bit of the bitmap, and the last bit is 0, then it is judged that there is no Nickname John Doe in the bitmap, so it can Use the nickname John Doe.
Insert image description here

But as we insert more and more nicknames, the misjudgment rate is getting higher and higher, that is, we have more hash conflicts. Let’s look at the picture below, and a complete misjudgment occurs: (Wang Wu Mingming It does not appear in the bitmap, but a misjudgment occurred. This is because the three bits of the nickname Wang Wu have reached a hash conflict, because the two nicknames "Zhang San" and "Li Si" have already occupied The nickname "Wang Wu" has been replaced)

Insert image description here

1. Characteristics of Bloom filter

When the Bloom filter determines that a data does not exist in the bitmap, it is very accurate because it must be mapped to at least one bit in the bitmap that is 0!

When the Bloom filter determines that a data exists in the bitmap, it is inaccurate because there may be a misjudgment. Other nicknames may have occupied the bits of the new data, and the new nickname has not really been used. exists in the bitmap.

2. How to control the misjudgment rate

Judgment condition one: the size of the bitmap. When the bitmap is small, the Bloom filter will soon set all bits of the bitmap to 1. At this time, the false positive rate of the Bloom filter will be very high. Therefore, the longer the bitmap is, the more likely it will be. The misjudgment rate will be lower.

Judgment condition two: the number of hash functions. When there are more hash functions, the bitmap bits will soon be set to 1 by all bits of the Bloom filter. However, if the number of hash functions is too small, the misjudgment rate will also increase.

So depending on the size of the bitmap and the number of hash functions, someone summarized the following formula:

Insert image description here

Among them, k is the number of hash functions, m is the length of the Bloom filter, n is the number of inserted elements, and p is the false positive rate. Here we take k=3, ln2 is 0.7, then the relationship between m and n is m=4*n, that is, the length of the Bloom filter is 4 times the length of the inserted element.

3. Implementation of Bloom filter

Bloom filters can implement a template class, so the bitmap inserted (Bloom filter) can not only be strings, but also other types. In general, Bloom filters are used to process strings. So here you can set the default type of template parameter K to string.

	//布隆过滤器
	template<size_t N, class K = string, class Hash1 = BKDRHash, class Hash2 = APHash, class Hash3 = DJBHash>
	class bloomfilter
	{
    
    
	public:
		//...
	private:
		bitset<N> _bs;
	};

At the same time, our top three algorithms for converting strings into integers operate as follows:

1. Introduce three types of string conversion into integers

	struct BKDRHash
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			// BKDR
			size_t value = 0;
			for (auto ch : s)
			{
    
    
				value *= 31;
				value += ch;
			}
			return value;
		}
	};

	struct APHash
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t hash = 0;
			for (long i = 0; i < s.size(); i++)
			{
    
    
				if ((i & 1) == 0)
				{
    
    
					hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
				}
				else
				{
    
    
					hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
				}
			}
			return hash;
		}
	};

	struct DJBHash
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t hash = 5381;
			for (auto ch : s)
			{
    
    
				hash += (hash << 5) + ch;
			}
			return hash;
		}
	};

The hash function chosen here to convert strings into integers is BKDRHash, APHash and DJBHash, which have the highest overall scores after testing. These three hash algorithms have the smallest probability of producing hash conflicts in various scenarios. At this time, the probability of collision when these three hash functions are used alone is relatively small, and now the probability of collision is even smaller when they are used at the same time.

2. Insertion of Bloom filter

That is, three different bits are calculated through these three functions converted into integers and mapped to the bitmap. When inserting an element, you need to calculate the three bits corresponding to the element through three hash functions, and then set these three bits in the bitmap to 1.

		void Set(const K& key)
		{
    
    
			// 计算出key对应的三个位
			size_t i1 = Hash1()(key) % N;
			size_t i2 = Hash2()(key) % N;
			size_t i3 = Hash3()(key) % N;

			// 置1
			_bs.set(il);
			_bs.set(i2);
			_bs.set(i3);
		}

3. Searching for Bloom filters

In detection, we only need to use three hash functions to calculate the three bits corresponding to the element, and then determine whether these three bits in the bitmap are set to 1.

As long as one of these three bits is not set, it means that the element must not exist.
If all three bits are set, returning true indicates that the element exists but may cause misjudgment.

		bool Test(const K& key)
		{
    
    
			size_t i1 = Hash1()(key) % N;
			if (_bs.test(il) == false)
			{
    
    
				return false;
			}

			size_t i2 = Hash2()(key) % N;
			if (_bs.test(i2) == false)
			{
    
    
				return false;
			}

			size_t i3 = Hash3()(key) % N;
			if (_bs.test(i3) == false)
			{
    
    
				return false;
			}
			// 三个都存在,可能导致误判
			return true;
		}

4. Deletion of Bloom filter

Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.

Insert image description here
As shown in the picture above, if we delete the data "John Doe", then all three 1's will be set to 0, which will result in two of John Doe's being set to 0! Isn’t Zhang San’s data strange?

A method to support deletion: expand each bit in the Bloom filter into a small counter, add one to k counters (hash addresses calculated by k hash functions) when inserting elements, and delete elements When , k counters are decremented by one, and deletion operations are increased by occupying several times more storage space.

4. Advantages of Bloom filter

  1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally small), and has nothing to do with the amount of data
  2. Hash functions have no relationship with each other and facilitate hardware parallel operations.
  3. Bloom filters do not need to store the elements themselves, which has great advantages in some situations with strict confidentiality requirements.
  4. When it can withstand certain misjudgments, Bloom filters have a great space advantage over other data structures.
  5. When the amount of data is large, Bloom filters can represent the entire set, but other data structures cannot
  6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations

5. Disadvantages of Bloom filter

  1. There is a false positive rate, that is, there is a false positive (False Position), that is, it cannot accurately determine whether the element is in the collection (remedial method: create a whitelist to store data that may be misjudged)
  2. Cannot get the element itself
  3. Elements cannot generally be removed from Bloom filters
  4. If you use counting method to delete, there may be a counting wraparound problem.

Guess you like

Origin blog.csdn.net/m0_70088010/article/details/133964277
Recommended