Bitmap|Bloom filter simulation implementation|STL source code analysis series|Hand tearing STL

 Today the blogger brings you the simulation implementation of bitmap and Bloom filter.


foreword

那么这里博主先安利一下一些干货满满的专栏啦!

手撕数据结构https://blog.csdn.net/yu_cblog/category_11490888.html?spm=1001.2014.3001.5482这里包含了博主很多的数据结构学习上的总结,每一篇都是超级用心编写的,有兴趣的伙伴们都支持一下吧!
算法专栏https://blog.csdn.net/yu_cblog/category_11464817.html这里是STL源码剖析专栏,这个专栏将会持续更新STL各种容器的模拟实现。

STL源码剖析https://blog.csdn.net/yu_cblog/category_11983210.html?spm=1001.2014.3001.5482


Bitmap and Bloom filters

bitmap

Bitmap (Bitmap) generally refers to a data structure or technology that uses bits as the smallest unit to store and process data. Bitmaps can be used to represent a set of binary flags or bit states, and can efficiently store large amounts of Boolean information compressed.

One of the most common uses of bitmaps is to represent the state of a collection or flag. For example, a bitmap can be used to represent a collection of elements, where each element corresponds to a bit in the bitmap. If the value of the bit is 1, it means that the element is in the set; if the value of the bit is 0, it means that the element is not in the set.

In C language, you can use unsigned integer types (eg unsigned int, unsigned long) or arrays to implement bitmaps.

bloom filter

Bloom Filter is a probabilistic data structure for efficiently judging whether an element belongs to a set. It is based on the concept of a bitmap (Bitmap), but uses multiple hash functions to achieve higher lookup efficiency.

A Bloom filter consists of a bit array and several hash functions. Initially, all bit array values ​​are set to 0. When an element is to be inserted into the Bloom filter, the element is calculated by multiple hash functions to obtain multiple hash values. Then set the corresponding bit array position to 1. When it is necessary to determine whether an element is in the set, it is also calculated through multiple hash functions to check whether the corresponding bit array positions are all 1. If any bit is 0, it can be determined that the element is not in the set; if all bits are 1, it means that the element may be in the set (there is a probability of misjudgment).

The main advantage of Bloom filters is their efficient insertion and query operations. Its time complexity is O(k), where k is the number of hash functions, usually a small constant. The space complexity of Bloom filters is also relatively low, only affected by the size of the bit array and the number of hash functions.

However, Bloom filters also have some limitations. First of all, there is a certain misjudgment rate, that is, when judging whether an element is in the set, there is a certain probability that a wrong judgment will occur. Secondly, the inserted element cannot be deleted, because the deletion operation will affect the judgment results of other elements. Therefore, the Bloom filter is suitable for scenarios that require high query speed and can tolerate a certain rate of misjudgment, such as caching and preventing repeated operations.

It is necessary to choose the Bloom filter according to the specific application scenarios and data characteristics, and pay attention to the control of false positive rate and capacity estimation during design to achieve the best results.

BitSet.h

#pragma once

#include<vector>
#include<iostream>
using namespace std;

//位图特点
//1.快、节省空间
//2.相对局限,只能映射处理整型



//用char -- 一个char位置存8位
//怎么找位置,比如20
//20/8=2表示放在第几个char上
//20%8=4表示放在这个char的第几个位置
namespace yfc
{
	template<size_t N>
	class bit_set
	{
	public:
		bit_set()
		{
			_bits.resize(N / 8 + 1, 0);//+1可以保证空间一定够
		}
		void set(size_t x)
		{
			//把x的位置设置成1
			size_t i = x / 8;
			size_t j = x % 8;
			//怎么把_bit[i]的第j位弄成1呢
			//用一个或运算!
			_bits[i] |= (1 << j);
		}
		void reset(size_t x)
		{
			//把x的位置设置成0
			size_t i = x / 8;
			size_t j = x % 8;
			_bits[i] &= ~(1 << j);
		}
		bool test(size_t x)
		{
			//看这一位是0还是1
			size_t i = x / 8;
			size_t j = x % 8;
			return _bits[i] & (1 << j);
		}
	private:
		vector<char> _bits;
	};
	void test_bit_set1()
	{
		bit_set<100>bs1;
		bs1.set(8);
		bs1.set(9);
		bs1.set(20);

		cout << bs1.test(8) << endl;
		cout << bs1.test(9) << endl;
		cout << bs1.test(20) << endl;

		bs1.reset(8);
		bs1.reset(9);
		bs1.reset(20);

		cout << bs1.test(8) << endl;
		cout << bs1.test(9) << endl;
		cout << bs1.test(20) << endl;
	}
	void test_bit_set2()
	{
		//这三种写法都可以
		bit_set<-1>bs1;//-1的写法是最好的 -- -1对应的size_t就是全1
#if 0
		bit_set<0xffffffff>bs2;
		bit_set < 1024 * 1024 * 1024 * 4 - 1> bs3;
#endif
		//我们打开看任务管理器 -- 是可以看到是512MB左右的
	}




	//面试题2
	//我们用两个位图就行 -- 两个位图对应位置一起表示状态
	template<size_t N>
	class twobitset
	{
	private:
		bit_set<N>_bs1;
		bit_set<N>_bs2;
	public:
		void set(size_t x)
		{
			//要先判断一下
			bool inSet1 = _bs1.test(x);
			bool inSet2 = _bs2.test(x);
			if (inSet1 == false && inSet2 == false)
			{
				//00->01
				_bs2.set(x);
			}
			else if (inSet1 == false && inSet2 == true)
			{
				//01->10
				_bs1.set(x);
				_bs2.reset(x);
			}
		}
		void print_once_num()
		{
			for (size_t i = 0; i < N; i++)
			{
				if (_bs1.test(i) == false && _bs2.test(i) == true)
				{
					cout << i << " ";
				}
			}
			cout << endl;
		}
	};
	void test_bit_set3()
	{
		int a[] = { 1,2,3,4,5,6,7,8,9,10,12,10,9,8,6,5,3,2,1 };
		twobitset<100>bs;
		for (auto e : a)
		{
			bs.set(e);
		}
		bs.print_once_num();
	}
	//面试题3
	//也是用两个位图
	//第一个是文件1的映射
	//第二个是文件2的映射
	//映射位都是1的值就是交集

	//面试题4
	//其实和2是一样的,00/01/10/11就行
}

BloomFilter.h

#pragma once

//布隆过滤器

#include"BitSet.h"
#include<algorithm>
#include<string>
using namespace std;



namespace yfc
{	
	template<class K = string>
	struct HashBKDR
	{
		size_t operator()(const K& key)
		{
			size_t val = 0;
			for (auto ch : key)
			{
				val *= 131;
				val += ch;
			}
			return val;
		}
	};
	template<class K = string>
	struct HashAP
	{
		size_t operator()(const K& key)
		{
			size_t hash = 0;
			for (size_t i = 0; i < key.size(); i++)
			{
				if ((i & 1) == 0)
				{
					hash ^= ((hash << 7) ^ key[i] ^ (hash >> 3));
				}
				else
				{
					hash ^= (~((hash << 11) ^ key[i] ^ (hash >> 5)));
				}
			}
			return hash;
		}
	};
	template<class K = string>
	struct HashDJB
	{
		size_t operator()(const K& key)
		{
			size_t hash = 5381;
			for (auto ch : key)
			{
				hash += (hash << 5) + ch;
			}
			return hash;
		}
	};

	template<size_t N, class K = string,
		class Hash1 = HashBKDR<string>,
		class Hash2 = HashAP<string>,
		class Hash3 = HashDJB<string>>
	class BloomFilter
	{
	private:
		const static size_t _ratio = 5;
		bit_set<_ratio* N> _bits;
		//如果使用std::bitset
		//考虑放到堆上new一个
		//因为std::bit有个隐藏的bug会把栈撑爆
	public:
		void set(const K& key)
		{
			size_t hash1 = Hash1()(key) % (_ratio * N);
			_bits.set(hash1);
			size_t hash2 = Hash2()(key) % (_ratio * N);
			_bits.set(hash2);
			size_t hash3 = Hash3()(key) % (_ratio * N);
			_bits.set(hash3);
		}
		bool test(const K& key)
		{
			size_t hash1 = Hash1()(key) % (_ratio * N);
			if (!_bits.test(hash1))return false;
			size_t hash2 = Hash2()(key) % (_ratio * N);
			if (!_bits.test(hash2))return false;
			size_t hash3 = Hash3()(key) % (_ratio * N);
			if (!_bits.test(hash3))return false;

			return true;//可能存在误判 -- 上面的不在是准确的
		}
	};
	void testBloomFilter1()
	{
		BloomFilter<10>bf;
		string arr[] = { "苹果","西瓜","阿里","美团","苹果","字节","西瓜","苹果","香蕉","苹果","腾讯" };
		for (auto& str : arr)
		{
			bf.set(str);
		}
		for (auto& str : arr)
		{
			cout << bf.test(str) << endl;
		}
	}
	//测误判率的性能测试
	void TestBloomFilter2()
	{
		srand(time(0));
		const size_t N = 100000;
		BloomFilter<N> bf;
		cout << sizeof(bf) << endl;

		std::vector<std::string> v1;
		std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";

		for (size_t i = 0; i < N; ++i)
		{
			v1.push_back(url + std::to_string(1234 + i));
		}

		for (auto& str : v1)
		{
			bf.set(str);
		}

		// 相似
		std::vector<std::string> v2;
		for (size_t i = 0; i < N; ++i)
		{
			std::string url = "http://www.cnblogs.com/-clq/archive/2021/05/31/2528153.html";
			url += std::to_string(rand() + i);
			v2.push_back(url);
		}

		size_t n2 = 0;
		for (auto& str : v2)
		{
			if (bf.test(str))
			{
				++n2;
			}
		}
		cout << "相似字符串误判率:" << (double)n2 / (double)N << endl;

		std::vector<std::string> v3;
		for (size_t i = 0; i < N; ++i)
		{
			string url = "zhihu.com";
			url += std::to_string(rand() + i);
			v3.push_back(url);
		}

		size_t n3 = 0;
		for (auto& str : v3)
		{
			if (bf.test(str))
			{
				++n3;
			}
		}
		cout << "不相似字符串误判率:" << (double)n3 / (double)N << endl;
	}
}

Guess you like

Origin blog.csdn.net/Yu_Cblog/article/details/131622329