[Hash] Bitmap/Bloom filter

bitmap

Preface

Before implementing the bitmap structure, let’s look at a problem:

Gives 4 billion unique unsigned integers, unordered. Then given an unsigned integer, how to quickly determine whether this number is among the 4 billion numbers.

Method 1: Traverse 4 billion pieces of data. We will find that the time complexity is O(N), which is too high for a data volume of 4 billion.

Method 2: Sort (O(N*logN)), then binary search (O(logN)). The time complexity is now optimized compared to method one. However, an unsigned integer requires 4Byte, and 4 billion requires about 16G of memory, so it still cannot solve the problem.

Below we introduce bitmaps, which can solve the problem of insufficient memory very well:

bitmap concept

The so-called bitmap uses bits to identify a state, and is suitable for use scenarios with massive data and no duplication. Usually used to determine whether a certain data exists.

bitmap structure

In the question mentioned earlier, the memory required for an unsigned integer is 4Byte, so can we use smaller space to store a number? The answer is definitely yes. We can use a bit to identify whether the number exists. A number occupies 4 Byte, and each bit is 0 or 1. Then we can use the number on this bit to identify it. The status of a number:

Bitmap implementation

template<size_t N>
//N为位图中要存放数据的个数
class bitset
{
public:
	bitset()
	{
		_bit.resize(N / 8 + 1, 0);
		//为了防止存储数据个数小于8时开辟的空间为0,所以要加上1
	}

	//将x存放入位图中
	void set(size_t X)
	{
		size_t hashi = X / 8; // hashi为这个数要放在编号为第几个字节位
		size_t num = X % 8;   // num为这个数据需要存放在第hashi字节中的位数
		_bit[hashi] |= (1 << num);
	}

	//修改x在位图中的状态
	void reset(size_t X)
	{
		size_t hashi = X / 8;
		size_t num = X % 8;
		_bit[hashi] &= ~(1 << num);
	}
	
	//查询x是否在位图中
	bool test(size_t X)
	{
		size_t hashi = X / 8;
		size_t num = X % 8;
		return _bit[hashi] & (1 << num);
	}
private:
	vector<char> _bit;
    //注意:这里实现的位图结构中vector中存放的是char
};

 Test the bitmap through a use case:

 Bitmap advantages and disadvantages

Advantages: fast, space saving.

Disadvantages: Only integers can be mapped. Types such as floating point and string cannot store mappings.

bloom filter

Preface

When we use the novel app, the system will push content based on our preferences. But how to prevent novels that we have already browsed from being pushed? When pushing content, we will definitely remove novels that we have already read. So how to remove heavy areas? When the app pushes content, it will definitely remove duplication based on our historical browsing records. Then filter what you've already viewed. The question is, now that I want to push a novel, how can I quickly determine whether the novel is in the historical browsing history?

1. Use hashes to store user records Disadvantages: Waste of space

2. Use bitmaps to store user records. Bitmaps are generally used to process integers, but the novel content number is a string, so bitmaps cannot handle it.

3. Combine hashing with bitmaps, i.e. bloom filters, to deal with this problem

Bloom filter concept

The Bloom filter is a compact and clever probabilistic
data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query, and can be used to tell you "something- "Definitely does not exist or may exist
", it uses multiple hash functions to map a data into a bitmap structure. This method can not only improve query efficiency , but also
save a lot of memory space .

Bloom filter structure

For example, now insert several keywords into the filter: "A Dream of Red Mansions", "Journey to the West", "The Romance of the Gods", and "The Romance of the Three Kingdoms". Suppose now we have three mapping relationships.

 From this picture, we can see that the Bloom filter can only tell us that something must not exist or may exist . The underlying structure of the Bloom filter is also a bitmap, but in the storage and access process, there is a hash function. For example, in the picture above, there are three hash functions, and each string has its own mapping value in the Bloom filter.

Now we want to query whether the book "Ordinary World" exists in the Bloom filter, if its hash value is 3, 4, 6. This is the result of the judgment that it does not exist. But if his hash value is 3, 4, 5. Then a misjudgment will occur at this time. Because the previously stored data has set the mapping position of the "Ordinary World" book to 1, a misjudgment will occur now. Therefore, the characteristic of the Bloom filter is that it can accurately judge whether it exists, but there is a misjudgment of existence.

Bloom filter implementation

struct BKDRHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto ch : s)
		{
			hash += ch;
			hash *= 31;
		}
		return hash;
	}
};

struct APHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

struct DJBHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 5381;
		for (auto ch : s)
		{
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};

//这里已经给出了三个字符串哈希函数,可以根据自己的需要进行设计哈希函数
template<size_t N
,class K = string
,class Hash1 = BKDRHash
,class Hash2 = APHash
,class Hash3 = DJBHash>
class BloomFilter
{
public:
	void set(const K& key)
	{
		size_t len = N * _X;
		size_t hashi1 = Hash1()(key) % len;
		_bitset.set(hashi1);

		size_t hashi2 = Hash2()(key) % len;
		_bitset.set(hashi2);

		size_t hashi3 = Hash3()(key) % len;
		_bitset.set(hashi3);

		cout << hashi1 << ' ' << hashi2 << ' ' << hashi3 << endl;
	}

	bool test(const string& key)
	{
		size_t len = N * _X;
		size_t hashi1 = Hash1()(key) % len;
		if (!_bitset.test(hashi1))
		{
			return false;
		}

		size_t hashi2 = Hash2()(key) % len;
		if (!_bitset.test(hashi2))
		{
			return false;
		}

		size_t hashi3 = Hash3()(key) % len;
		if (!_bitset.test(hashi3))
		{
			return false;
		}

		return true; // 注意,当判断这个关键字存在时可能会出现误判,因为其他关键值在存入时可能映射的位置跟s相同
	}
private:
	static const size_t _X = 3; 
    //为了防止存储数据过多而引起误判率的提高,所以适当延长布隆过滤器的长度
	bitset<N*_X> _bitset;
};

Bloom filter advantages 


1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally small), and has nothing to do with the size of the data. 2. The hash functions have no relationship with each other, which is convenient
. Hardware parallel operation
3. The Bloom filter does not need to store the element itself, which has great advantages in some situations with strict confidentiality requirements.
4. When it can withstand a certain amount of misjudgment, the Bloom filter has this advantage compared to other data structures. Great space advantage
5. When the amount of data is large, Bloom filters can represent the entire set, but other data structures cannot.
6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations.

Bloom filter defects

1. There is a false positive rate, that is, there is a false positive (False Position), that is, it cannot accurately determine whether the element is in the set (remedial method: create a
whitelist to store data that may be misjudged)
2. The element itself cannot be obtained
3 . Generally, elements cannot be deleted from Bloom filters.
4. If deletion is done by counting, there may be a counting wraparound problem.

Guess you like

Origin blog.csdn.net/m0_74459723/article/details/130948672