[Data Structure] Hash Application

Table of contents

1. Bitmap

1. Bitmap concept

2. Bitmap implementation

2.1, bitmap structure

2.2, bit position 1

2.3, bit position 0

2.4. Detecting bits in the bitmap

3. Bitmap example

3.1. Find integers that appear only once

3.2. Find the intersection of two files

3.3. Find all integers that appear no more than 2 times

2. Bloom filter

1. Bloom filter proposed

2. Bloom filter concept

3. Bloom filter implementation

3.1, Bloom filter insertion

3.2, Bloom filter search

3.3, Bloom filter deletion

4. Bloom filter example

4.1. Find the intersection of two files storing query

4.2. Hash cutting


1. Bitmap

1. Bitmap concept

 The so-called bitmap is to use each bit to store a certain state, which is suitable for scenarios with massive data and no data duplication. It is usually used to judge whether a certain data exists or not.

Advantages of bitmaps:

  • high speed
  • save space

Disadvantages of bitmaps:

  • Only integer types can be mapped, and other types such as floating point numbers, strings, etc. cannot store mappings.

2. Bitmap implementation

2.1, bitmap structure

The structure of the bitmap class is as follows:

template<size_t N>
class bitset
{
public:
	bitset()
	{
		_bits.resize(N / 8 + 1, 0);
	}

    //将某个比特位置1
	void set(size_t x)
	{}
    
    //将某个比特位置0
	void reset(size_t x)
	{}

    //检查位图中某个比特位是否为1
    bool test(size_t x)
    {}

private:
	vector<char> _bits;
};

2.2, bit position 1

Implementation code:

	void set(size_t x)
	{
		size_t i = x / 8;
		size_t j = x % 8;

		_bits[i] |= (1 << j);
	}

 Calculate the bit mapped by x in the ith char type of the array with division. Use the modulus to calculate the bit mapped to x in the jth bit of the i char type . Then use the bitwise OR operation to set the specified bit position to 1.

 It should be noted that when performing a bitwise OR operation, 1 is used to shift left by j bits, not right. This is because in our human subjective understanding, the arrangement of digits is as follows:

 But in fact, in the storage logic of the virtual layer of the computer, digital storage is like this:

 What we mean by left shift and right shift is not to move to the left or to the right, but to move to a higher position and a lower position. So in order to find the target position, you need to use the left shift, not the right shift.

2.3, bit position 0

Implementation code:

	void reset()
	{
		size_t i = x / 8;
		size_t j = x % 8;

		_bits[i] &= ~(1 << j);
	}

 Calculate the bit mapped by x in the ith char type of the array with division. Use the modulus to calculate the bit mapped to x in the jth bit of the i char type. Then use the bitwise NOT AND operation to set the specified bit position to 1.

2.4. Detecting bits in the bitmap

Implementation code:

	bool test(size_t x)
	{
		size_t i = x / 8;
		size_t j = x % 8;

		return _bits[i] & (1 << j);
	}

3. Bitmap example

3.1. Find integers that appear only once

Setting state: Occurs 0 times, the state is 00. Occurs 1 time, status is 01. 2 or more occurrences, the status is 10 .

template<size_t N>
class twobitset
{
public:
	void set(size_t x)
	{
		// 00->01
		if (_bs1.test(x) == false
			&& _bs2.test(x) == false)
		{
			_bs2.set(x);
		}

		// 01->10
		else if (_bs1.test(x) == false
			&& _bs2.test(x) == true)
		{
			_bs1.set(x);
			_bs2.reset(x);
		}

		//10
	}

	void Print()
	{
		for (size_t i = 0; i < N; ++i)
		{
			if (_bs2.test(i))
			{
				cout << i << endl;
			}
		}
	}
public:
	bitset<N> _bs1;
	bitset<N> _bs2;
};

The test results are as follows: 

3.2. Find the intersection of two files

 Method 1 : Read the value of one of the files into the bitmap, and then read another file, judge whether it is in the above bitmap, if it is in the intersection, take out the value, and set the corresponding bitmap to 0.

 Method 2 : Create two bitmaps, map the data read from file 1 to bitmap 1, and map the data read from file 2 to bitmap 2. Then bitmap 1 and bitmap 2 are bitwise ANDed. The end result is an intersection.

3.3. Find all integers that appear no more than 2 times

 Setting state: Occurs 0 times, the state is 00. Occurs 1 time, status is 01. Occurs 2 times, status is 10. 3 or more occurrences, the status is 11.

The implementation code is similar to that in 3.1.

2. Bloom filter

1. Bloom filter proposed

 When we use the news client to watch the news, it will continuously recommend new content to us, and it will repeat it every time it recommends, and remove the content that has already been seen. Here comes the question, how does the news client recommendation system realize push deduplication? The server records all the historical records that the user has seen. When the recommendation system recommends news, it will filter the historical records of each user and filter out those records that already exist. How to quickly find it?

  1.  Use a hash table to store user records, disadvantage: waste of space.
  2.  Use a bitmap to store user records. Disadvantages: Bitmaps can generally only handle shaping. If the content number is a string, it cannot be processed.
  3.  Combining a hash with a bitmap, i.e. a bloom filter.

2. Bloom filter concept

 The Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query, and can be used to tell you "something must not Exist or may exist", it uses multiple hash functions to map a piece of data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

 Bloom filters can reduce the probability of collisions. It is easy to misjudgment if a value is mapped to one location, and the misjudgment rate can be reduced by mapping to multiple locations.

Advantages of Bloom filters:

  • The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, generally relatively small), and has nothing to do with the size of the data.
  • Hash functions have nothing to do with each other, which is convenient for hardware parallel operation.
  • The Bloom filter does not need to store the element itself, which has great advantages in some occasions that require strict confidentiality.
  • When able to withstand a certain amount of misjudgment, the Bloom filter has a great space advantage over other data structures.
  • When the amount of data is large, the Bloom filter can represent the full set, but other data structures cannot.
  • Bloom filters using the same set of hash functions can perform intersection, union, and difference operations.

 Disadvantages of Bloom filters:

  • There is a false positive rate, that is, there is a false positive (False Position), that is, it is impossible to accurately determine whether an element is in the set (remedy: create a white list to store data that may be misjudged).
  • Cannot get the element itself.
  • Elements cannot be removed from bloom filters in general.
  • If counting is used to delete, there may be a problem of counting wrapping.

3. Bloom filter implementation

3.1, Bloom filter insertion

 Insert into Bloom filter: "baidu":

 Insert into Bloom filter: "tencent": 

 

 Implementation code:

struct BKDRHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto ch : s)
		{
			hash += ch;
			hash *= 31;
		}
		return hash;
	}
};

struct DJBHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 5381;
		for (auto ch : s)
		{
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};

struct APHash
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

template<size_t N, class K = string, class Hash1 = BKDRHash, class Hash2 = APHash, class Hash3 = DJBHash>
class BloomFilter
{
public:
	void set(const K& key)
	{
		size_t len = N * _X;

		size_t hash1 = Hash1()(key) % len;
		_bs.set(hash1);

		size_t hash2 = Hash2()(key) % len;
		_bs.set(hash2);

		size_t hash3 = Hash3()(key) % len;
		_bs.set(hash3);
	}
private:
	static const size_t _X = 4; // 布隆过滤器的长度与数据数量的倍数关系
	bitset<N*_X> _bs;  //这样可以有效的减少不同数据间的冲突
};

3.2, Bloom filter search

 The idea of ​​the Bloom filter is to map an element into a bitmap with multiple hash functions, so the bit of the mapped position must be 1. So you can search in the following way: Calculate whether the bit position corresponding to each hash value is stored as zero. As long as one is zero, it means that the element must not be in the hash table, otherwise it may be in the hash table.

Implementation code:

bool test(const K& key)
{
	size_t len = N * _X;

	size_t hash1 = Hash1()(key) % len;
	if (!_bs.test(hash1))
		return false;

	size_t hash2 = Hash2()(key) % len;
	if (!_bs.test(hash2))
		return false;

	size_t hash3 = Hash3()(key) % len;
	if (!_bs.test(hash3))
		return false;

	return true;

	//依然存在误判,有可能把不在的判断成在
}

 It should be noted that even if three hash functions are used for judgment, there is still the possibility of misjudgment. If it is judged that the data does not exist, the data must not exist. If it is judged that the data exists, there is a certain possibility that the data does not exist.

 Therefore, the Bloom filter can only be applied to scenarios that can tolerate misjudgment, such as video push and so on. For some scenarios where misjudgment is not tolerated, the Bloom filter also has a corresponding solution: if it is judged that the data exists, it will go to the database for a second confirmation, if it still exists, it will return exists, if it does not exist, it will return not exists.

 The number of hash functions represents how many bits are mapped to a value. The more hash functions, the lower the misjudgment rate, but the more hash functions, the larger the average space occupied.

3.3, Bloom filter deletion

Bloom filters cannot directly support deletion, because when one element is deleted, other elements may be affected.

  A method that supports deletion: expand each bit in the Bloom filter into a small counter, add one to k counters (hash addresses calculated by k hash functions) when inserting elements, and delete elements When , decrement the k counters by one, and increase the delete operation by taking up several times more storage space.

defect:

  1. There is no way to confirm whether the element is actually in the bloom filter.
  2. There is count wrap around.

4. Bloom filter example

4.1. Find the intersection of two files storing query

Use hash segmentation to divide a large file into multiple small files, and then let the small files intersect:

 Using this method, because it is not evenly divided, there may be many conflicts, and the problem that a certain Ai and Bi small file is too large. There are only two situations where this problem occurs:

  1. In a single file, there is a large number of repeated queries.
  2. In a single file, there are a large number of different queries.

You can directly use an unordered_set/set, read the file query in turn, and insert it into the set:

  1. If the query of the entire small file is read, the set can be successfully inserted, which means that it is the first case.
  2. If the query of the entire small file is read and an exception is thrown during the insertion process, it means the second case. Change to other hash functions, split again, and then find the intersection.

Description: Set inserts the key, if it already exists, returns false. If the memory is used up, a bad_alloc exception will be thrown, and the rest will succeed.

4.2. Hash cutting

 Given a log file with a size of more than 100G, IP addresses are stored in the log, and an algorithm is designed to find the IP address with the most occurrences.

Still using the hash cut method:

 Process each small file in turn, and use unordered_map or map to count the number of occurrences of ip.

  1.  If no exception is thrown during the statistics process, the statistics will be normal. After counting a small file, the one with the most records. Clear the memory, and count the next small file.
  2.  If an exception occurs during the statistics process, it means that a single file is too large and there are too many conflicts. Change to other hash functions and split again.

 Create a small heap of k data, and insert the small heap every time it is counted, and convert it into a topK problem, which can finally be solved.

Guess you like

Origin blog.csdn.net/weixin_74078718/article/details/131026316