Bitmap Bloom Filter for Data Structures

content

1. Bitmap

2. Bloom filter


1. Bitmap

The so-called bitmap is to use each bit to store a certain state, which is suitable for scenarios with massive data and no data repetition. It is usually used to determine whether a certain data exists or not.

The bottom layer of the bitmap is implemented with an array. Each binary bit of each element of the array can indicate whether a data is present or not, 0 indicates that the data exists, and 1 indicates that the data does not exist.

We can find that the bitmap is actually a direct-valued hash, indicating whether a value is present or not. There are only two states of 0 and 1. 1 indicates that 0 indicates not.

When we detect that the value of 25 bits is 1, we can judge that the data of 136 exists.

Main interface in bitmap

1.set (setting is about to set the corresponding bit position to 1)

 Corresponding bit operation skills:

1. Clear the rightmost n bits of X: x & (~0 << n)
2. Get the nth bit value of x: (x >> n) & 1
3. Get the nth bit power value of x: x & (1 << n)
4. Set only the nth position to 1: x | (1 << n)
5. Set only the nth position to 0: x & (~(1 << n))
6. Set x Clear the most significant bit to the nth bit (inclusive): x & ((1 << n) - 1)
7. Clear the nth bit to the 0th bit (inclusive): x & (~((1 << ( n + 1)) - 1))

Corresponding code:

void Set(size_t x){
			assert(x < N);

			// 算出x映射的位在第i个整数
			// 算出x映射的位在这个整数的第j个位
			size_t i = x / 32;
			size_t j = x % 32;

			// _bits[i] 的第j位标记成1,并且不影响他的其他位
			_bits[i] |= (1 << j);
		}

2.reset: the bit position to be used is 0

Corresponding code:

void Reset(size_t x){
			assert(x < N);

			size_t i = x / 32;
			size_t j = x % 32;

			// _bits[i] 的第j位标记成0,并且不影响他的其他位
			_bits[i] &= (~(1 << j));
		}

 3.Test (to determine whether a value exists)

bool Test(size_t x)
		{
			assert(x < N);

			size_t i = x / 32;
			size_t j = x % 32;

			// 如果第j位是1,结果是非0,非0就是真
			// 如果第j为是0,结果是0,0就是假
			return _bits[i] & (1 << j);
		}

Code summary:

template<size_t N>
	class BitSet
	{
	public:
		BitSet()
		{
			_bits.resize(N / 32 + 1, 0);
		}

		// 把x映射的位标记成1
		void Set(size_t x)
		{
			assert(x < N);

			// 算出x映射的位在第i个整数
			// 算出x映射的位在这个整数的第j个位
			size_t i = x / 32;
			size_t j = x % 32;

			// _bits[i] 的第j位标记成1,并且不影响他的其他位
			_bits[i] |= (1 << j);
		}

		void Reset(size_t x)
		{
			assert(x < N);

			size_t i = x / 32;
			size_t j = x % 32;

			// _bits[i] 的第j位标记成0,并且不影响他的其他位
			_bits[i] &= (~(1 << j));
		}

		bool Test(size_t x)
		{
			assert(x < N);

			size_t i = x / 32;
			size_t j = x % 32;

			// 如果第j位是1,结果是非0,非0就是真
			// 如果第j为是0,结果是0,0就是假
			return _bits[i] & (1 << j);
		}
	private:
		vector<int> _bits;
	};

4. Application of bitmap 

1. Quickly find whether a certain data is in a set
2. Sort
3. Find the intersection, union, etc. of two sets
4. Disk block marking in the operating system

2. Bloom filter

1. The introduction of Bloom filter

When we use the news client to watch news, it will constantly recommend new content to us, and it will repeat every time it recommends and remove the content that we have already seen. The question is, how does the news client recommendation system implement push deduplication? The server records all the historical records that the user has seen. When the recommendation system recommends news, it will filter from the historical records of each user, and filter out those records that already exist. How to quickly find it?
1. Use a hash table to store user records, disadvantage: waste of space.
2. Use bitmap to store user records, disadvantage: can only handle integers.
3. Combine hash and bitmap, that is, Bloom filter.

 2. The concept of bloom filter 

Bloom filter is a compact and ingenious probabilistic data
structure characterized by efficient insertion and query, and can be used to tell you "something must be does not exist or may exist", which uses multiple hash functions
to map a data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

Implementation principle:

1. Create a BitSet of m bits, first initialize all bits to 0

Insert data:

 Add a string, and after k hash functions, calculate k values ​​in the range of 0 - m-1, and set the BitSet position corresponding to the k values ​​to 1

Inspection process:

1. Pass the data through k hash functions to calculate k values ​​respectively

2. If all k bits are 1, it is judged to exist. (may misjudge different values ​​mapped to the same location)

3. If any 1 bit is 0, it must not exist.

4. Bloom filter needs to pre-position the size of the array in advance

delete

 Bloom filters do not directly support delete work, because when one element is deleted, other elements may be affected.
For example: delete the "tencent" element in the above figure, if the binary bit position corresponding to the element is directly 0, the "baidu" element is also deleted, because these two elements are on the bits calculated by multiple hash functions. Just overlap. A method to support deletion: expand each bit in the Bloom filter into a small counter , add one to k counters (hash addresses calculated by k hash functions) when inserting elements, and delete elements When , the k counters are decremented by one, and deletion operations are increased at the cost of taking up several times the storage space.

Defects:
1. Can't confirm whether the element is really in the bloom filter
2. There is a count wraparound

Corresponding code implementation:

struct HashBKDR
{
	// "int"  "insert" 
	// 字符串转成对应一个整形值,因为整形才能取模算映射位置
	// 期望->字符串不同,转出的整形值尽量不同
	// "abcd" "bcad"
	// "abbb" "abca"
	size_t operator()(const std::string& s)
	{
		// BKDR Hash
		size_t value = 0;
		for (auto ch : s)
		{
			value += ch;
			value *= 131;
		}

		return value;
	}
};

struct HashAP
{
	// "int"  "insert" 
	// 字符串转成对应一个整形值,因为整形才能取模算映射位置
	// 期望->字符串不同,转出的整形值尽量不同
	// "abcd" "bcad"
	// "abbb" "abca"
	size_t operator()(const std::string& s)
	{
		// AP Hash
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; i < s.size(); i++)
		{
			ch = s[i];
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

struct HashDJB
{
	// "int"  "insert" 
	// 字符串转成对应一个整形值,因为整形才能取模算映射位置
	// 期望->字符串不同,转出的整形值尽量不同
	// "abcd" "bcad"
	// "abbb" "abca"
	size_t operator()(const std::string& s)
	{
		// BKDR Hash
		register size_t hash = 5381;
		for (auto ch : s)
		{
			hash += (hash << 5) + ch;
		}

		return hash;
	}
};

template<size_t N, class K = std::string,
class Hash1 = HashBKDR,
class Hash2 = HashAP,
class Hash3 = HashDJB>
class BloomFilter
{
public:
	void Set(const K& key)
	{
		//Hash1 hf1;
		//size_t i1 = hf1(key);
		size_t i1 = Hash1()(key) % N;
		size_t i2 = Hash2()(key) % N;
		size_t i3 = Hash3()(key) % N;

		

		_bitset.Set(i1);
		_bitset.Set(i2);
		_bitset.Set(i3);
	}

	bool Test(const K& key)
	{
		size_t i1 = Hash1()(key) % N;
		if (_bitset.Test(i1) == false)
		{
			return false;
		}

		size_t i2 = Hash2()(key) % N;
		if (_bitset.Test(i2) == false)
		{
			return false;
		}

		size_t i3 = Hash3()(key) % N;
		if (_bitset.Test(i3) == false)
		{
			return false;
		}

		// 这里3个位都在,有可能是其他key占了,在是不准确的,存在误判
		// 不在是准确的
		return true; 
	}

private:
	bitSet<N> _bitset;
	

};

Pros and cons of bloom filters:

advantage:

1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, which is generally small), independent of the amount of data
2. Hash functions are not related to each other, which is convenient for hardware Parallel operation
3. The Bloom filter does not need to store the element itself, which has great advantages in some situations where confidentiality requirements are stricter
4. When it can withstand a certain misjudgment, the Bloom filter has a lot Large space advantage
5. When the amount of data is large, the Bloom filter can represent the complete set, but other data structures cannot
. 6. The Bloom filter using the same set of hash functions can perform intersection, union, and difference operations 

shortcoming:

1. There is a false positive rate, that is, there is a false positive (False Position), that is, it is impossible to accurately determine whether the element is in the set (remedy: create a whitelist
to store the data that may be misjudged) 2.
The element itself cannot be obtained3
. Under normal circumstances, elements cannot be removed from the Bloom filter.
4. If the count method is used to delete, there may be a count wraparound problem

Guess you like

Origin blog.csdn.net/qq_56999918/article/details/123505371
Recommended