[C++ and data structures] Bitmaps and Bloom filters

Table of contents

1. Bitmap

1. The concept of bitmap 

2. Implementation of bitmap 

 ①、Basic structure

②、set 

③、reset:

④、test 

⑤. Question:

⑥. Advantages, disadvantages and applications of bitmaps: 

⑦. Complete code and testing

2. Bloom filter

1. Proposal of Bloom filter

2. Implementation of Bloom filter

①.Basic structure

②. Implementation of three Hash functors

③、 set

④、 test

⑤. Delete

 ⑥、Complete code and testing

⑦、Advantages and disadvantages


1. Bitmap

1. The concept of bitmap 

1. Interview questions
Give 40 billion unique unsigned integers, not sorted. Given an unsigned integer, how to quickly determine whether a number is within
This 40 among the 100 million. 【Tencent】
Finding the presence of a number is actually a key model. The common solutions are as follows:
  • 1. Traversal, time complexityO(N)
  • 2. exclusion(O(NlogN)), usage dichotomy< a i=4>: logN
  • 3. Location solution

The problem with the first two solutions is that the amount of data is too large and cannot be placed in the memory.

We can first consider how much space 4 billion non-repeating unsigned integers occupy?
One billion bytes is 1 G, and 4 billion integers, one integer is 4 bytes, that is, 16 billion (9 zeros) bytes are needed, that is, 16G is needed. These data take up too much space and cannot be stored in the memory. No less.

How to solve bitmap problem?

The so-called bitmap is to use every bit to store a certain state, is suitable for Massive data, no data duplication scenarios. It is usually used to determine whether a certain data exists. Whether the data is in the given integer data, the result is present or not, which are exactly two states. Then a binary bit can be used to represent the information whether the data exists. If the binary bit If it is 1 , it means it exists. If it is 0, it means it does not exist. . (Bitmap is a type of direct addressing method)
For 4 billion data, we need to open at least 4.2 billion spaces. Why? If there is a number 4200000000, which position do you want to map it to? When we open the space, we need to open the range of the data in order to satisfy all the mapping requirements.
Then we directly open 2^32 spaces (what is said here is that each position is a bit Bits, 2^32≈4.29 billion, because the bitmap is the bits stored in each position ), to establish a mapping relationship for all data direct addressing methods, that is, we open up 4.29 billion bits, and first treat 4.29 billion as bytes, then it ≈ 4G, and 4G ≈ 4000M (or MB) , and here is the bit, so it needs to be divided by 8, 4000/8=500M, so the bitmap storage can occupy 500M, which not only saves space, but is also very efficient > (high efficiency, because there is no hash conflict in the direct addressing method)

2. Implementation of bitmap 

 ①、Basic structure

How many bits should be opened for bitmap initialization, because bitmaps save space by storing bits. You pass N is also a bit, followed by the fact that the type of vector is int, because the type does not support bits

class bitset
{
public:
	bitset(size_t N)
	{//N代表要开多少比特位(简称:位)
		_bits.resize(N / 32 + 1, 0);
		_num = N;
	}

private:
	//int* _bits;
	std::vector<int> _bits;
	size_t _num;  //表示存了多少个比特位,映射存储的多少个数据
};
_bits.resize(N / 32 + 1, 0);

Why N / 32 + 1?

Becausevector’sresize space is in integer units, and each position of the bitmap stores a bit, and an integer has 32 bits, N represents the number of bits< a i=6>, then N/32 is the number of integers, but +1 is needed. Why? For example, N=100, 100/32=3, that means opening 3 integer positions, that is, 32*3=96 bits, but there is still no room for 100, so you open 96 bits (same principle as 97, 98... . There is no position), In order to avoid this problem, we will +1, that is, open one more shaping, a>That will only waste up to 31 bits


②、set 

Function:Set the xth bit to 1, meansThis bit exists

void set(size_t x)
{//把第x位设置为1,表示此位置存在

	size_t index = x / 32; //算出映射的位置在第几个整数
	size_t pos = x % 32;   //算出x在整数的第几个位

	_bits[index] |= (1 << pos); //对于这个整数第pos的位置或1
	//1<<pos表示先把1移动到和pos相同位置的比特位上,因为pos=几,1就会左移几位
	//|=表示是或,即除了pos位置的位要变成1,其余位置都不受影响
}

 In fact, it is to examine how to change a certain bit of an integer to 1 without affecting the other 31 bits.

The left shift of the little-endian machine is to the left, and the left shift of the big-endian machine is to the right.

This is a bug in the C language design. It is a problem left over from history and can easily mislead people. The development of computer technology is flourishing, and then it is integrated and unified.


③、reset:

Function: Set the xth position of to 0, indicating that this bit does not exist

void reset(size_t x)
{//把第x位设置为0,表示此位置不存在

	size_t index = x / 32; //算出映射的位置在第几个整数
	size_t pos = x % 32;   //算出x在整数的第几个位

	_bits[index] &= ~(1 << pos); //对于这个整数第pos的位置与0
}


④、test 

Function: Determine whether the x-th bit is present ( is to determine whether the bit mapped by x is 1)

//判断x在不在(也就是说x映射的位是否为1)
bool test(size_t x)
{
	size_t index = x / 32; //算出映射的位置在第几个整数
	size_t pos = x % 32;   //算出x在整数的第几个位

	return _bits[index] & (1 << pos); //结果非0为真,0则为假
}


⑤. Question:

Theoretically, these 4 billion numbers cannot be stored in the memory. They should be stored in files. Then we have to read the files. Because the 4 billion numbers need to be opened according to the range, 4.29 billion of the space needs to be opened. How can it be opened like this? Big space? Common methods are as follows:

①, bitset bs(-1); //Because the constructor parameter of the bitmap is size_t, then -1 if viewed as an unsigned number is the maximum value of the integer

②、bitset bs(0xffffffff);


⑥. Advantages, disadvantages and applications of bitmaps: 

Advantages:Saving space and high efficiency

Disadvantages:can only handle shaping

application:

  • Quickly find whether a certain data is in a collection
  • exclusion order + exclusion
  • Find the intersection, union, etc. of two sets
  • Disk block marking in the operating system


⑦. Complete code and testing

#pragma once
#include<iostream>
#include<vector>

namespace mz
{
	class bitset
	{
	public:
		bitset(size_t N)
		{//N代表要开多少比特位(简称:位)
			_bits.resize(N / 32 + 1, 0);
			_num = N;
		}

		void set(size_t x)
		{//把第x位设置为1,表示此位置存在

			size_t index = x / 32; //算出映射的位置在第几个整数
			size_t pos = x % 32;   //算出x在整数的第几个位

			_bits[index] |= (1 << pos); //对于这个整数第pos的位置或1
			//1<<pos表示先把1移动到和pos相同位置的比特位上,因为pos=几,1就会左移几位
			//|=表示是或,即除了pos位置的位要变成1,

			++_num;
		}

		void reset(size_t x)
		{//把第x位设置为0,表示此位置不存在

			size_t index = x / 32; //算出映射的位置在第几个整数
			size_t pos = x % 32;   //算出x在整数的第几个位

			_bits[index] &= ~(1 << pos); //对于这个整数第pos的位置与0

			--_num;
		}

		//判断x在不在(也就是说x映射的位是否为1)
		bool test(size_t x)
		{
			size_t index = x / 32; //算出映射的位置在第几个整数
			size_t pos = x % 32;   //算出x在整数的第几个位

			return _bits[index] & (1 << pos); //结果非0为真,0则为假
		}

	private:
		//int* _bits;
		std::vector<int> _bits;
		size_t _num;  //表示存了多少个比特位,映射存储的多少个数据
	};

	void test_bitset()
	{
		bitset bs(100);
		bs.set(99);
		bs.set(98);
		bs.set(97);
		bs.reset(98);

		for (size_t i = 0; i < 100; ++i)
		{
			printf("[%d]:%d\n", i, bs.test(i));
		}
	}

}

Some test results are as follows: 

 


2. Bloom filter

1. Proposal of Bloom filter

When we use the news client to watch news, it will constantly recommend new content to us. Each time it recommends it, it will reiterate and remove the content that we have already seen. The question is, how does the news client recommendation system implement push duplication removal? The server is used to record all the historical records that the user has seen. When the recommendation system recommends news, it will filter out the historical records of each user and filter out those records that already exist. How to quickly search?
  • 1. Use hash table to store user records. Disadvantages: Waste of space
  • 2. Use bitmap to store user records. Disadvantage: Bitmap generally can only handle shaping. , if the content number is a string, it cannot be processed
  • 3. Combine hash with bitmap , that is, Bloom filter
The Bloom filter is created by Bloom ( Burton Howard Bloom ) in< a i=4>1970 A compact and relatively cleversummary rate data structure< /span>. save a lot of memory space can not only improve query efficiency, but also . It uses multiple hash functions to combine a Data is mapped into a bitmap structure. This method in Something must not exist or may exist efficient insertion and query, can be used to tell you , characterized by


2. Implementation of Bloom filter

①.Basic structure

 Generally, bloom filters store strings or structures, etc. Generally, they will not be plastic, because shaping uses bitmaps. Come and save

The bottom layer of Bloom filter is implemented using bitmap. Because strings are more commonly used, we directly use strings as the default template. Parameters, others The three Hash template parameters represent using three positions to map a value

How many spaces does the constructor open?

Some experts have already calculated that 10 elements require a length of 43 bits, so let’s just start with 5 times

template<class K = string, class Hash1 = HashStr1, class Hash2 = HashStr2, class Hash3 = HashStr3>
class bloomfilter
{
public:
    //直接上来开满会有问题,因为可能我本身可能就没映射几个值
	//那就根据你大概会存多少个数据,来对应开空间
	//到底开多少比较好有人算过,即你存多少个值就要映射到多少个位
	bloomfilter(size_t num)
		:_bs(5 * num)
		,_N(5 * num)
	{}

private:
	bitset _bs;	//底层是一个位图
	size_t _N;
};

②. Implementation of three Hash functors

 Because three mapping positions are used to map a value, three functions for converting strings into integers need to be written. And because the string type is more commonly used, these three Functors will be used as default template parameters

The following are string algorithms that can reduce hash collisions (we use the following three algorithms for the following three functors):

struct HashStr1
{
	size_t operator()(const string& str)
	{	//运用BKDRHash
		size_t hash = 0;
		for (size_t i = 0; i < str.size(); ++i)
		{
			hash *= 131;
			hash += str[i];
		}

		return hash;
	}
};

struct HashStr2
{
	size_t operator()(const string& str)
	{	//运用RSHash
		size_t hash = 0;
		size_t magic = 63689; //魔数
		for (size_t i = 0; i < str.size(); ++i)
		{
			hash *= magic;
			hash += str[i];
			magic *= 378551;
		}

		return hash;
	}
};

struct HashStr3
{
	size_t operator()(const string& str)
	{	//运用SDBMHash
		size_t hash = 0;
		for (size_t i = 0; i < str.size(); ++i)
		{
			hash *= 65599;
			hash += str[i];
		}

		return hash;
	}
};

③、 set

The function wants to mark the key number as existing, and we said that we need to use three positions to map this key value, then Call the Hash functor to first calculate the mapping position of the string, %_N is because we opened _N bits for the bitmap at the beginning, The mapping position you calculated is likely to be greater than _N, so it is %_N, which can be stored and unified. Thena key value can be represented by three mapping positions< a i=6>了

Note: Hash1 is a functor type, and Hash1() is a functor object. Of course, you can also write it as Hash1 hs1; hs1(key) % _N; but it is obviously more troublesome.

void set(const K& key)
{
	size_t index1 = Hash1()(key) % _N;//利用Hash1类型的匿名对象
	size_t index2 = Hash2()(key) % _N;
	size_t index3 = Hash3()(key) % _N;

	_bs.set(index1);//表示index1这个位置存在
	_bs.set(index2);
	_bs.set(index3);
}

④、 test

Function: Determine whether the key value exists

Because one value uses three mapping positions, we determine whether the three mapping positions of the calculated key exist at the same time in the bitmap, it exists only if the key value exists at the same time. Otherwise, if one of them does not exist, it definitely does not exist

bool test(const K& key)
{
	size_t index1 = Hash1()(key) % _N;
	if (_bs.test(index1) == false)
		return false;

	size_t index2 = Hash2()(key) % _N;
	if (_bs.test(index2) == false)
		return false;

	size_t index3 = Hash3()(key) % _N;
	if (_bs.test(index3) == false)
		return false;

	return true;  //但这里也不一定是真的在,还是可能存在误判

	//判断在,是不准确的,可能存在误判
	//判断不在,是准确的
}

⑤. Delete

Bloom filters cannot directly support deletion because when one element is deleted, other elements may be affected.

A method to support deletion:Expand each bit in the Bloom filter into a small counter, and give (khash addresses calculated by hash functions)< a i=6> increases by one. When deleting an element, k counters are decremented by one, and the deletion operation is increased by occupying several times more storage space.
defect:
1. Unable to confirm whether the element is actually in the bloom filter
2. Count wraparound exists

void reset(const K& key)	
{
	//将映射的位置给置0就可以?
	//不支持删除,可能会存在误删,故布隆过滤器一般不支持删除
}

 ⑥、Complete code and testing

#pragma once
#include"bitset.h"
#include<string>

using std::string;
using std::cout;
using std::endl;

namespace mz
{
	struct HashStr1
	{
		size_t operator()(const string& str)
		{	//运用BKDRHash
			size_t hash = 0;
			for (size_t i = 0; i < str.size(); ++i)
			{
				hash *= 131;
				hash += str[i];
			}

			return hash;
		}
	};

	struct HashStr2
	{
		size_t operator()(const string& str)
		{	//运用RSHash
			size_t hash = 0;
			size_t magic = 63689; //魔数
			for (size_t i = 0; i < str.size(); ++i)
			{
				hash *= magic;
				hash += str[i];
				magic *= 378551;
			}

			return hash;
		}
	};

	struct HashStr3
	{
		size_t operator()(const string& str)
		{	//运用SDBMHash
			size_t hash = 0;
			for (size_t i = 0; i < str.size(); ++i)
			{
				hash *= 65599;
				hash += str[i];
			}

			return hash;
		}
	};

	template<class K = string, class Hash1 = HashStr1, class Hash2 = HashStr2, class Hash3 = HashStr3>
	class bloomfilter
	{
	public:
		//直接上来开满会有问题,因为可能我本身可能就没映射几个值
		//那就根据你大概会存多少个数据,来对应开空间
		//到底开多少比较好有人算过,即你存多少个值就要映射到多少个位
		bloomfilter(size_t num)
			:_bs(5 * num)
			,_N(5 * num)
		{}

		void set(const K& key)
		{
			size_t index1 = Hash1()(key) % _N;//利用Hash1类型的匿名对象
			size_t index2 = Hash2()(key) % _N;
			size_t index3 = Hash3()(key) % _N;

			_bs.set(index1);
			_bs.set(index2);
			_bs.set(index3);

		}

		bool test(const K& key)
		{
			size_t index1 = Hash1()(key) % _N;
			if (_bs.test(index1) == false)
				return false;

			size_t index2 = Hash2()(key) % _N;
			if (_bs.test(index2) == false)
				return false;

			size_t index3 = Hash3()(key) % _N;
			if (_bs.test(index3) == false)
				return false;

			return true;  //但这里也不一定是真的在,还是可能存在误判
		
			//判断在,是不准确的,可能存在误判
			//判断不在,是准确的
		}

		void reset(const K& key)
		{
			//将映射的位置给置0就可以?
			//不支持删除,可能会存在误删,故布隆过滤器一般不支持删除
		}

	private:
		bitset _bs;	//底层是一个位图
		size_t _N;
	};

	void test_bloomfilter()
	{
		bloomfilter<string> bf(100); //这里不给string,直接用<>也行,因为string是默认的
		bf.set("abcd");
		bf.set("aadd");
		bf.set("bcad");

		cout << bf.test("abcd") << endl;
		cout << bf.test("aadd") << endl;
		cout << bf.test("bcad") << endl;
		cout << bf.test("cbad") << endl;

	}
}


⑦、Advantages and disadvantages

Bloom filter advantages
  • 1. The time complexity of adding and querying elements is:O(K), (Kis the number of hash functions, which is generally small) and has nothing to do with the size of the data
  • close
  • 2. Hash functions have no relationship with each other, which facilitates hardware parallel operations
  • 3. Bloom filter does not need to store the element itself, which has great advantages in some situations with strict confidentiality requirements
  • 4. When it can withstand certain misjudgments, the Bloom filter has a huge space advantage over other data structures
  • 5. When the amount of data is large, the Bloom filter can represent the entire set, but other data structures cannot
  • 6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations
Bloom filter defects
  • 1. There is a false positive rate, that is, there is a false positive(False Position), that is, it cannot Accurately determine whether the element is in the set(Remedy: again
  • Create a whitelist to store data that may be misjudged)
  • 2. Cannot get the element itself
  • 3. Generally, elements cannot be deleted from Bloom filters
  • 4. If you use counting method to delete, there may be a counting wraparound problem

Guess you like

Origin blog.csdn.net/m0_74044018/article/details/133973537