What do you do when you encounter big data processing? Let's take a look at bitmaps and Bloom filters (below)

Table of contents

Preamble

First, why there is a Bloom filter

Second, what is a Bloom filter

Third, the implementation of the Bloom filter

 Fourth, the advantages and disadvantages of Bloom filter

4.1 Advantages of Bloom Filters

 4.2 Disadvantages of Bloom filter and its improvement

 4.2.1 Analysis of finding misjudgments and their improvement methods

4.2.2 Unable to delete and analysis of improvement methods

Summarize


Preamble

This article mainly explains the processing of big data strings - Bloom filter. This article mainly leads you to understand the basic use of Bloom filter, analysis of advantages and disadvantages, application scenarios, and application cases

First, why there is a Bloom filter

In our daily life, there are not a few big data searches about string types. For example, when you register a game, when you enter the account name, the server needs to compare the account name with the database that stores the name. If there is one in the database, then The current name has been used, you need to re-enter the account number.

There may be hundreds of millions or billions of data in the above database. If you use a hash table to compare, the data cannot be loaded into the memory, and the bitmap mentioned in the previous section can only handle plastic types, so it is Need to introduce a new data structure - Bloom filter to deal with.

Second, what is a Bloom filter

The Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query , and can be used to tell you "something must not Exist or may exist ", it uses multiple hash functions to map a piece of data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space .

 First of all, there is an inevitable problem in the mapping of strings: there are too many combinations of strings. If one location is mapped, it is very likely to cause conflicts, that is, multiple different strings are mapped to the same string.

And our Bloom filter's solution to this problem is: a string maps multiple locations through different hash functions, so as to reduce conflicts, pay attention to reduce conflicts, in the case of hash mapping, the conflict of strings is It cannot be absolutely avoided, but in this way, conflicts can be greatly reduced

Some friends may ask questions at this time, is it true that the more locations mapped, the less likely the conflict is?

For this question, you can take a look at the following article, which describes the mathematical analysis of the Bloom filter in detail.

Explain in detail the principle, usage scenarios and precautions of the Bloom filter-Knowledge

Third, the implementation of the Bloom filter

For the implementation of the Bloom filter, we use the previous bitmap as the basis for encapsulation, but the space to be opened here is somewhat different from the bitmap.

 The figure above is the formula introduced by Zhihu. K is the number of hash functions, m is the size of the Bloom filter space, n is the number of inserted elements, and p is the false positive rate. The hash function we choose here is 3 , calculated according to the formula, if you want to keep the false positive rate at a low level, at least meet the following ratio

 4.3n=m

Therefore, when we open up space, we need to open up 4.3*N bits of space

The specific code is as follows

struct BKDRHash
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto ch : s)
			{
				hash += ch;
				hash *= 31;
			}

			return hash;
		}
	};

	struct APHash
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (long i = 0; i < s.size(); i++)
			{
				size_t ch = s[i];
				if ((i & 1) == 0)
				{
					hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
				}
				else
				{
					hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
				}
			}
			return hash;
		}
	};


	struct DJBHash
	{
		size_t operator()(const string& s)
		{
			size_t hash = 5381;
			for (auto ch : s)
			{
				hash += (hash << 5) + ch;
			}
			return hash;
		}
	};




	template<size_t N, class K = string, class Hash1=BKDRHash, class Hash2=APHash, class Hash3=DJBHash>
	class BloomFilter
	{
	public:
		void set(const K& key)
		{
			int len = N * _x;

			//根据三个字符串转换size_t的函数,求三个映射位置并且插入
			size_t hash1 = Hash1()(key) % len;
			_bs.set(hash1);

			size_t hash2 = Hash2()(key) % len;
			_bs.set(hash2);

			size_t hash3 = Hash3()(key) % len;
			_bs.set(hash3);

		}

		bool test(const K& key)
		{
			int len = N * _x;

			size_t hash1 = Hash1()(key) % len;
			size_t hash2 = Hash2()(key) % len;
			size_t hash3 = Hash3()(key) % len;

			//三个映射位置都不为空才存在
			if (_bs.test(hash1)
				&& _bs.test(hash2)
				&& _bs.test(hash3))
			{
				return true;
			}

			return false;

		}

	private:
		static const size_t _x = 5;
		BitSet<N* _x> _bs;
	};

The above three hash functions are efficient and effective hash functions that we found on the Internet, and you can also use other hash functions.

 Fourth, the advantages and disadvantages of Bloom filter

4.1 Advantages of Bloom Filters

1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, generally relatively small), which has nothing to do with the size of the data.
2. The hash functions have no relationship with each other, which is convenient for hardware Parallel operation
3. The Bloom filter does not need to store the element itself, which has great advantages in some occasions that require strict confidentiality
. Large space advantage
5. When the amount of data is large, the Bloom filter can represent the complete set, and other data structures cannot
6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations

 4.2 Disadvantages of Bloom filter and its improvement

The main disadvantages are 1. There may be misjudgment in the search, 2. And it cannot be deleted

 4.2.1 Analysis of finding misjudgments and their improvement methods

First of all, why is there a misjudgment?

When searching, all the positions of the map are not 0, which means that it may exist. If one is 0, it means that the data must not exist . Why do you say that none of the positions may exist? Let’s take a look at the following figure

 As shown in the figure above, we are not inserting "Eleme", but because its mapped position just overlaps with the positions mapped by other characters, at this time, the Bloom filter will tell us that the element exists, but the actual does not exist

So is there any way to improve it?

Make a second judgment . When we first use the Bloom filter to search, if the return exists, we are traversing the database to find the data. In this way, compared with the original direct traversal database search, the Bloom filter will help us filter out most of the data. data (this is also the significance of its filtering), and the remaining small part of the data may need to be searched in the database, which greatly reduces the number of times we traverse the database and greatly improves efficiency

4.2.2 Unable to delete and analysis of improvement methods

Bloom filters cannot be deleted directly, because when deleting an element, other elements may be affected. (Note that Bloom filters are mainly used for searching, not for adding, deleting, checking, and modifying)

Of course, if you have to delete, there is a natural way: expand each bit in the Bloom filter into a small counter, and add one to k counters (hash addresses calculated by k hash functions) when inserting elements , when deleting an element, decrement the k counters by one, and increase the deletion operation by taking up several times the storage space.

The delete method above is also flawed

1. It is impossible to confirm whether the element is really in the bloom filter
2. There is count wrapping

Summarize

Bitmap and Bloom filters are specially used to deal with the search problem of large data files. Their performance efficiency is much higher than that of hash tables, but each has limitations in usage scenarios. When using it, ironies should still analyze the application scenarios Choose the right data structure

Guess you like

Origin blog.csdn.net/zcxmjw/article/details/131004791