Bitmap and Bloom filter + hash segmentation idea

insert image description here

1. Bitmap (bitset)

  • bitmap is aTake one bit as the data recording unitthe hash table ofunsigned integeris the key value, using the direct addressing method ( there is no problem of hash collision ) , and its hash mapping function is
    • f ( key ) = key (the existence status of the key is recorded by the key bit) f(key)=key (the existence status of the key is recorded by the key bit)f(key)=k ey ( the existence status of key is recorded by the k ey bit )
    • A bit of 1 meansThe key corresponding to the mapping bit exists, the bit is 0Indicates that the key corresponding to the mapping bit does not exist
  • The bitmap in the STL vector<char>is used to adapt the container, and use bit operations to realize
    insert image description here
    the record of the existence status of its functional interface key:
    insert image description here

Underlying implementation:

//Size记录要存放的数据个数上限(非类型模板参数),即至少需要开辟Size个比特位的空间
template<size_t Size>
class bitset
{
    
    
public:
	bitset()
	{
    
    
		_table.resize((Size / 8) + 1, 0);
	}

	//将第key个比特位设置为1,表示key存在于集合中
	void set(size_t key)
	{
    
    
		//计算第key个比特位位于vector的第几个字节
		size_t bytes = key / 8;
		//计算第key个比特位位于某字节的第几个个比特位
		size_t bits = key % 8;
		//通过位运算将第key个比特位设置为1
		_table[bytes] |= (1 << bits);
	}

	//将第key个比特位设置为0,表示将数据key从集合中删除
	void reset(size_t key)
	{
    
    
		//计算第key个比特位位于vector的第几个字节
		size_t bytes = key / 8;
		//计算第key个比特位位于某字节的第几个个比特位
		size_t bits = key % 8;
		//通过位运算将第key个比特位设置为0
		_table[bytes] &= ~(1 << bits);
	}

	//查询key是否存在于集合中
	bool test(size_t key)
	{
    
    
		//计算第key个比特位位于vector的第几个字节
		size_t bytes = key / 8;
		//计算第key个比特位位于某字节的第几个个比特位
		size_t bits = key % 8;
		//通过位运算判断第key个比特位是否为1
		return _table[bytes] & (1 << bits);
	}


private:
	std :: vector<char> _table;
};
  • Bitmaps can only record whether keywords exist in the collection, but compared to red-black trees and hash buckets, bitmaps have high space efficiency and time efficiency, and are very suitable forHandling Massive Data:
    • bitset<-1> (-1 converted to an unsigned integer) , such an object occupies only512MBor so of memory, while it can be used to recordAll possible key values
    • Practical application:
      1. Quickly find whether a certain data is in a collection
      2. Data sorting + deduplication
      3. Find the intersection, union, etc. of two sets
      4. Disk block marking in the operating system
  • With string hash functions, bitmaps can be used to recordThe presence status of the string in the research collection, but different strings may correspond to the same key value, in order to reduce the probability of hash collision of different strings,a stringCan useMultiple different string hash functionsMapped to the bitmap multiple times, the bitmap designed in this way is called a Bloom filter

Two. Bloom filter (bloomFilter)

  • the same string throughMultiple different string hash functionsmultiple times mapped tosame bitmap, thus effectively reducing theProbability of a hash collision for a string in a bitmapinsert image description here

Underlying implementation:

  • Realized by multiplexing bitset:
//字符串哈希映射函数1
struct BKDRHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 0;
		for (auto ch : s)
		{
    
    
			hash += ch;
			hash *= 31;
		}

		return hash;
	}
};

//字符串哈希映射函数2
struct APHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
    
    
			size_t ch = s[i];
			if ((i & 1) == 0)
			{
    
    
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
    
    
				hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

//字符串哈希映射函数3
struct DJBHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 5381;
		for (auto ch : s)
		{
    
    
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};



// Size是最多不同key的个数
template<size_t Size,class Key = string,class Hash1 = BKDRHash,class Hash2 = APHash,class Hash3 = DJBHash>
class BloomFilter
{
    
    
public:
	void set(const Key& key)
	{
    
    
		size_t len = Size * _factor;

		//同一个字符串映射三次
		size_t hash1 = Hash1()(key) % len;
		_bs.set(hash1);

		size_t hash2 = Hash2()(key) % len;
		_bs.set(hash2);

		size_t hash3 = Hash3()(key) % len;
		_bs.set(hash3);

	}

	bool test(const Key& key)
	{
    
    
		size_t len = Size * _factor;


		//只有三个哈希映射都相同才认为关键字是重复的
		size_t hash1 = Hash1()(key) % len;
		if (!_bs.test(hash1))
		{
    
    
			return false;
		}

		size_t hash2 = Hash2()(key) % len;
		if (!_bs.test(hash2))
		{
    
    
			return false;
		}

		size_t hash3 = Hash3()(key) % len;
		if (!_bs.test(hash3))
		{
    
    
			return false;
		}


		return true;
	}
private:
	static const size_t _factor = 6;
	//由于一个key要占用三个比特位,因此需要额外开辟_factor倍数的空间
	bitset<Size * _factor> _bs;
};
  • Application of Bloom filter:
    1. bloom filterdoes not store the element itself, which has a great advantage in some occasions that require strict data confidentiality
    2. In a scenario where a certain amount of misjudgment can be tolerated, the Bloom filter is better than other data structuresMore time and space efficient
    3. When the amount of data is large,Bloom filters can represent the full set of data, other data structures cannot (limited by memory)
    4. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations
    • Judgment on the existence of nicknames in the game, etc.Scenarios for Duplicate Data FilteringBloom filters are often used

3. Hash segmentation idea

  • The idea of ​​hashing is aIdeas for dealing with massive amounts of data—If there are 10 billion character strings and the computer only has 1G of memory available, how to design an algorithm to find the character string that appears most often?
    • First, perform hash segmentation on the data set, and divide it intoSplit into N sub-files( N subfilesNumbered from 0~N-1) , the segmentation method is: use the string hash function Hasn()to obtain the key value of each string, and then classify each string into the sub-file corresponding to the number i according to the following mapping relationship:

    • i = H a s h ( k e y ) m o d    N i =Hash(key)\mod N i=Hash(key)modNinsert image description here

    • becauseIdentical strings must be categorized into the same subfile, so load each sub-file into memory and use map for statistics. (If some sub-files are too large, you can continue to perform hash segmentation in the same way (with different string hash functions))

  • The above-mentioned hash segmentation method can also be applied to the following problem: the existing file A and file B, which store 10 billion character strings respectively, and the computer only has 1G memory available, how to get the intersection of the two files?
    • An efficient solution: Hash split file A and file B separately:insert image description here

    • becauseIdentical strings must be classified into subfiles with the same number, so the subfileAi sum BiLoad them into memory in pairs and use set to find the common elements
      insert image description here

Guess you like

Origin blog.csdn.net/weixin_73470348/article/details/131957833