C++ hash application - bitmap bloom filter

bloom filter

C++ bloom filter


The disadvantage of using a hash table to store user records is that it consumes a large amount of memory; the disadvantage of using a bitmap to store user records is that bitmaps generally handle shaping, and the content is strings or custom types. Based on the above, if the combination of hash and bitmap is called Bloom filter, will the above problems be solved?

concept

A Bloom filter is a probabilistic data structure. It can insert and query efficiently, and then tell us that a certain data must not exist or may exist. It uses multiple hash functions to map a piece of data into a bitmap structure. Not only can improve query efficiency, but also can save memory space.

If only one hash function is used to map to the bitmap, then the following situations may occur. The string string exists first, and then a string str is mapped to the bitmap. Since "str" ​​conflicts with "string", the result of the bitmap feedback to "str" ​​is that "str" ​​already exists.

image-20230424185758754

If multiple hash functions are used for mapping, the above situation will be greatly reduced.

When the existing data has two mapping positions on the bitmap through two hash functions, the newly queried string "str" ​​is mapped through two hash functions, and one of the mapped positions is the same as the string "string" One of the mapping positions conflicts, but the string "str" ​​also has a mapping position that the feedback does not exist, then the string "str" ​​does not exist. Compared with before, the misjudgment rate is reduced, but the misjudgment rate cannot be completely eliminated. The more hash functions used, the more positions to be mapped on the bitmap, and the lower the same false positive rate.

image-20230424190249938

In such a scenario, the Bloom filter reduces the misjudgment rate by mapping multiple locations .

Substantial use

When the Bloom filter judges that a piece of data exists, it may be inaccurate, because the location of the data mapped by multiple hash functions may have been occupied by one or more pieces of data. At this time, it needs to enter the database for query.

When the Bloom filter judges that a data does not exist, it is accurate, because if the position of the data mapping is occupied by other data, the bit on the bitmap will be 1 (the unoccupied bit is 0)

control false positive rate

If the Bloom filter is too small, the ratio of all the above bits occupied (set to 1) will be greater, and the false positive rate of the Bloom filter will be greater at this time, so the length of the Bloom filter directly affects The false positive rate, the larger the Bloom filter, the smaller the false positive rate.

The more the number of hash functions, the more positions a single data needs to be mapped to the bitmap. If too many positions of the Bloom filter are set to 1 at this time, the misjudgment rate will be very high, but Ha If the number of Xi functions is too small, the misjudgment rate will be high.

Then how to choose the trade-off between the length of the Bloom filter and the number of hash functions directly controls the false positive rate

A big guy got the following relationship
m = − nlnp / ( ln 2 ) 2 m=-nlnp/(ln2)^2 through experimentsm=n l n p / ( l n 2 )2

k = l n 2 m / 2 k=ln2m/2 k=l n 2 m /2

Where n is the number of inserted elements, p is the false positive rate, m is the length of the Bloom filter, and k is the number of hash functions

Here we estimate that if 3 hash functions are used, (k=3), and ln2 is approximately 0.7, then the relationship between m and n is m=4.2n (the length of the Bloom filter should be 4.2 of the number of inserted elements times)

accomplish

Because the elements inserted into the Bloom filter include strings and other data types including custom types, it can be implemented as a template class, and only the caller needs to provide a hash function that converts the data type into an integer. In general, Bloom filters are used to process string types, so the default value of the template parameter here is string

The member of the Bloom filter is generally a bitmap, so it is also necessary to provide a non-type template parameter N to specify the length of the bitmap to the caller.

The four hash algorithms with the highest comprehensive scores are called below (converting strings into integers)

struct BKDRHash
	{
    
    
		size_t operator()(const string& key)
		{
    
    
			size_t hash = 0;
			for (auto ch : key)
			{
    
    
				hash *= 131;
				hash += ch;
			}
			return hash;
		}
	};

	struct APHash
	{
    
    
		size_t operator()(const string& key)
		{
    
    
			unsigned int hash = 0;
			int i = 0;

			for (auto ch : key)
			{
    
    
				if ((i & 1) == 0)
				{
    
    
					hash ^= ((hash << 7) ^ (ch) ^ (hash >> 3));
				}
				else
				{
    
    
					hash ^= (~((hash << 11) ^ (ch) ^ (hash >> 5)));
				}

				++i;
			}

			return hash;
		}
	};

	struct DJBHash
	{
    
    
		size_t operator()(const string& key)
		{
    
    
			unsigned int hash = 5381;

			for (auto ch : key)
			{
    
    
				hash += (hash << 5) + ch;
			}

			return hash;
		}
	};

	struct JSHash
	{
    
    
		size_t operator()(const string& s)
		{
    
    
			size_t hash = 1315423911;
			for (auto ch : s)
			{
    
    
				hash ^= ((hash << 5) + ch + (hash >> 2));
			}
			return hash;
		}
	};

insert and find

When an element is inserted into the Bloom filter, the data needs to be mapped to the corresponding bitmap through three hash functions and set to 1 (usage of set in bitset in stl library)

When used to detect whether a certain data is in the Bloom filter, it is necessary to calculate the position of the data mapping on the bitmap through three hash functions, and then judge these bits:

If all three bits are set to 1, return true to indicate that the data exists (misjudgment may occur)

If only one bit is not set to 1, return false immediately to indicate that the data does not exist (non-existence is accurate)

template<size_t N,//最多存储的数据个数
	size_t X=6,//平均存储一个数据要开辟6个映射位
		class K=string,//数据类型的模板参数---缺省值给string
	    class HashFunc1=BKDRHash,
		class HashFunc2 = APHash,
		class HashFunc3 = DJBHash>
		//class HashFunc4 = JSHash>
	class BloomFilter
	{
    
    
	public:
		void Set(const K&key)
		{
    
    
			size_t hashi1 = HashFunc1()(key) %(N * X);
			
			size_t hashi2 = HashFunc2()(key) % (N * X);
			
			size_t hashi3 = HashFunc3()(key) % (N * X);
			_bts.set(hashi1);
			_bts.set(hashi2);
			_bts.set(hashi3);
			//size_t hashi4 = HashFunc4()(key) % (N * X);

   }

		bool Test(const K& key)
		{
    
    
			size_t hashi1 = HashFunc1()(key) % (N * X);
			if (!_bts.test(hashi1))//数据不在是确定的
			{
    
    
				return false;
			}
			size_t hashi2 = HashFunc2()(key) % (N * X);
			if (!_bts.test(hashi2))//数据不在是确定的
			{
    
    
				return false;
			}
			size_t hashi3 = HashFunc3()(key) % (N * X);
			if (!_bts.test(hashi3))//数据不在是确定的
			{
    
    
				return false;
			}

			return true;//可能存在误判--映射的几个位置都冲突,就会发生误判
		}

	private:
		std::bitset<N* X> _bts;//开辟最多存储的数据个数*平均存储一个数据要开辟的映射位
	};

example

image-20230425003734590

In fact, the Bloom filter can only reduce the misjudgment rate, but cannot completely eliminate it, and there will still be conflicting data after many experiments.

image-20230425004154159

Bloom filter removal

Bloom filters generally do not support delete operations for the following reasons:

The Bloom filter judges that a data exists is uncertain (the existence of data may be a misjudgment)

  1. When the data to be deleted exists in the Bloom filter and it is a misjudgment, deleting the bit on the bitmap corresponding to the data (setting the corresponding bit from 1 to 0) will affect other data that is also mapped to these positions.
  2. Regardless of whether the data to be deleted is actually on the bitmap, deleting the data will affect other data that is also mapped to the same location.

If you are sure to support the deletion operation, it is best to enter the database (disk) to confirm whether the data exists when deleting data . This process needs to pass through the file IO stream. This process is relatively slow and extremely inefficient; A new counter is added for each bit. When there is data to be inserted, it is mapped to this bit, the counter ++, and when data is to be deleted, the counter on the corresponding bit is – . But this method will cause the memory required by the bitmap to increase exponentially , and the cost is huge. So in general Bloom filters do not support delete operations.

Advantages of Bloom filters

  1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, generally relatively small), regardless of the size of the data
  2. Hash functions have nothing to do with each other, which is convenient for hardware parallel operation
  3. Bloom filter does not need to store the element itself, which has great advantages in some occasions with strict confidentiality requirements
  4. When able to withstand a certain amount of misjudgment, the Bloom filter has a great space advantage over other data structures
  5. When the amount of data is large, the Bloom filter can represent the full set, and other data structures cannot
  6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations

Bloom filter defect

  1. There is a false positive rate, that is, there is a false positive (False Position), that is, it is impossible to accurately determine whether an element is in the set (remedy: create
    a white list to store data that may be misjudged)
  2. cannot get the element itself
  3. Elements cannot be removed from bloom filters in general
  4. If counting is used to delete, there may be a problem of counting wrapping

Related big data topics

给两个文件,分别有100亿个query,我们只有1G内存,如何找到两个文件的交集?给出近似算法

Problem-solving ideas:

First read the query in one of the files and insert all of them into the Bloom filter.

Then read the query in another file, and judge whether each query is in the Bloom filter in turn. If it exists, it is the intersection of the two files, and put the intersection into the same file. However, the file that stores the intersection still needs to be deduplicated. Put this file in a set or map for deduplication. This algorithm may have misjudgment-approximate algorithm.

Accurate algorithm:

Assuming that each query is 50 bytes on average, 10 billion queries add up to 500GB. Since we only have 1G memory, we divide the query of a file into 400 files through the hashfunc function. Each query is used as a key, which is converted into an integer i through the hashfunc function, and the number of i is entered into the Ai or Bi file corresponding to i. In this way, the query of two large files can be divided into corresponding small files.

image-20230425182947029

If the hashfunc function used to split two large files is the same, then the i obtained by splitting file A and file B through the hashfunc function is the same, and the query corresponding to the key has the same probability (the query may conflict)

Now you only need to find the intersection of A0 and B0, A1 and B1, A2 and B2... the small file is the intersection of the original two large files.

image-20230425183706828

Theoretically, the average size of each small file is 512M, so we can load one of the small files corresponding to the value of i into the memory and put them in the set, and then iterate through the query in the other small file in turn. Determine whether each query is in the set container, if it exists, it is an intersection.

But because the files are not evenly divided when splitting, the size of the split small files may exceed 1G.

If the size of one of the two small files corresponding to the value of i does not exceed 1G, load the smaller file into the memory and put it into the set, and then traverse the larger one to determine the intersection.

.(img-u5LNDvmc-1682421381710)]

Theoretically, the average size of each small file is 512M, so we can load one of the small files corresponding to the value of i into the memory and put them in the set, and then iterate through the query in the other small file in turn. Determine whether each query is in the set container, if it exists, it is an intersection.

But because the files are not evenly divided when splitting, the size of the split small files may exceed 1G.

If the size of one of the two small files corresponding to the value of i does not exceed 1G, load the smaller file into the memory and put it into the set, and then traverse the larger one to determine the intersection.

If the size of the two small files corresponding to the value of i exceeds 1G, split the two small files again according to the above splitting method. After the segmentation is completed, it is judged whether it is an intersection.

Guess you like

Origin blog.csdn.net/m0_71841506/article/details/130372032
Recommended