[C++] Simple implementation of bitmap and Bloom filter

Bitmaps and Bloom Filters

1. Simple implementation of bitmap

1.What is a bitmap

2. Implementation of bitmap

2. Bloom filter

1.What is Bloom filter?

2. Implementation of Bloom filter

3. Advantages of Bloom filter

4. Disadvantages of Bloom filter


1. Simple implementation of bitmap

1.What is a bitmap

        Bitmap uses the characteristics of bits 0 and 1 to map the status representation of data. It is suitable for processing massive data and non-duplicate data. Generally used to indicate the existence or absence of data.

2. Implementation of bitmap

        Indicates that the data is an integer type

#pragma once

#include <vector>

namespace bit
{
	// 使用无符号整型防止模板参数为负数
	template<size_t N>
	class bitset
	{
	public:
		bitset()
		{
			_n.resize(N / 32 + 1);
		}

		// 映射指定值到位图中
		void set(size_t x)
		{
			int i = x / 32;
			int j = x % 32;
			_n[i] |= (1 << j);
		}

		// 取消映射值
		void reset(size_t x)
		{
			int i = x / 32;
			int j = x % 32;
			_n[i] &= (~(1 << j));
		}

		// 判断是否存在
		bool test(size_t n)
		{
			int i = n / 32;
			int j = n % 32;

			return _n[i] & (1 << j);
		}

	private:
		vector<int> _n;
	};
}

        Test cases and effects:

#include <iostream>
using namespace std;

#include "BitSet.h"

int main()
{
	bit::bitset<1000> bs;
	bs.set(1);
	bs.set(10);
	bs.set(100);

	cout << bs.test(1) << endl;
	cout << bs.test(10) << endl;
	cout << bs.test(100) << endl;
	cout << bs.test(999) << endl<<endl;

	bs.set(999);
	bs.reset(10);

	cout << bs.test(1) << endl;
	cout << bs.test(10) << endl;
	cout << bs.test(100) << endl;
	cout << bs.test(999) << endl << endl;


	//bit::bitset<-1> bs1;
	//bit::bitset<0xffffffff> bs2;

	return 0;
}

2. Bloom filter

1.What is Bloom filter?

        Bloom filter is a compact and clever probabilistic data structure proposed by Bloom in 1970. It has the characteristics of efficient insertion, query, and quick acquisition of the existence of data. It consists of multiple hash functions that map one data to multiple bitmap structures. Compared with ordinary data structures, it can save a lot of memory burden.

        Among them, the best to understand is the status representation of string. For example, in a game, players need to register a name that cannot be the same as that of other players. Of course, the company will definitely store the data of each player, but if there are too many players, this will cause the response time when taking names to be too long, affecting the player's gaming experience. . At this point we need to use Bloom filters to map strings (names) to multiple bitmaps. If they all exist, then search them in the massive data; if they do not exist, just return the result directly. This is undoubtedly a great optimization for data search.

2. Implementation of Bloom filter

        The mapping of a large number of string types is achieved through Bloom filters. The mapping of strings is referred to many algorithm experts (only three algorithms are cited in this article).

// 布隆过滤器
struct BKDRHash
{
	size_t operator()(const string& s)
	{
		// BKDR
		size_t value = 0;
		for (auto ch : s)
		{
			value *= 31;
			value += ch;
		}
		return value;
	}
};

struct APHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

struct DJBHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 5381;
		for (auto ch : s)
		{
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};

// 使N个数据乘以X倍有利于布隆过滤器中数据重复情况的减少(X倍数越大重复越少)
template<size_t N, size_t X = 5, class K = string,
	class HashFunc1 = BKDRHash,
	class HashFunc2 = APHash,
	class HashFunc3 = DJBHash>
	class BloomFilter
{
public:
	void Set(const K& key)
	{
		size_t len = X * N;
		size_t index1 = HashFunc1(key) % len;
		size_t index2 = HashFunc2(key) % len;
		size_t index3 = HashFunc3(key) % len;
		/* cout << index1 << endl;
		cout << index2 << endl;
		cout << index3 << endl<<endl;*/
		_bs.set(index1);
		_bs.set(index2);
		_bs.set(index3);
	}

	bool Test(const K& key)
	{
		size_t len = X * N;

		size_t index1 = HashFunc1(key) % len;
		if (_bs.test(index1) == false)
			return false;
		size_t index2 = HashFunc2(key) % len;
		if (_bs.test(index2) == false)
			return false;
		size_t index3 = HashFunc3(key) % len;
		if (_bs.test(index3) == false)
			return false;

		return true;
	}

	// 不支持删除,删除可能会影响其他值。
	void Reset(const K& key);

private:
	bitset<X* N> _bs;
};

3. Advantages of Bloom filter

1. The query speed is extremely fast, O(k), where k is the number of hash functions;

2. Within the allowable error range, a large amount of memory consumption can be saved;

3. The Bloom filter does not store the element itself and has a certain degree of confidentiality.

4. Disadvantages of Bloom filter

1. Generally, mapped elements cannot be deleted, which is equivalent to disposable items;

2. The element itself cannot be found, but its certain status representation can only be obtained;

3. There is still the possibility of misjudgment.

Guess you like

Origin blog.csdn.net/qq_74641564/article/details/133586425