[C++] Bitmap application | Bloom filter

1. Bitmap application

topic one

Given 4 billion non-repeating unsigned integers that have not been sorted, and given an unsigned integer, how to quickly determine whether a number is among the 4 billion numbers


Normal thinking:
1. Sorting + binary search
2. Put it into a hash table or red-black tree


1 billion bytes is approximately equal to 1GB
4 billion integers are approximately equal to 16GB
If the above two methods are used, the memory is not enough


The hash map of the direct addressing method of the hash determines whether the shaping is
present or not. Map the mark in turn and store the value.
At least one char is used to indicate the presence or absence of a value, which is 4 billion bytes or 4GB, but this is still too large
. In the presence or absence, there is no need to save the value, use 0/1 to represent


Use one bit to identify the value represented by each integer, that is, the bitmap
needs 4 billion bits, 1 billion bytes is approximately equal to 1GB, and 4 billion bits is approximately equal to 500MB

Code

In the bitset class,
by controlling the char, the bit is controlled


set

set sets the bit of x mapping to 1

Since the subscript is calculated from 0
, the 0-7 bit is counted as the 0th char, and the 8-15 is counted as the 1st char, which is stored in the corresponding char
first and counted in the first char. Corresponding to the first few bits of char



j represents the position to find the corresponding bit, and you want to set it to 1
<< is a low-to-high shift
1<<j, that is, all positions except j position are 0

So | 1, no matter the number at this position is 1/0, it will be 1 after |

rset

rset sets the bit of the x mapping to 0


j means to find the position of the corresponding bit, and want to set it to 0,
so &0, regardless of the number of the position is 1/0, & is 0

test

test to judge whether it is




j means to find the position of the corresponding bit. The current position value & 1
may also exist in other positions, so the result is not 0, which means that the position exists.
If the result is 0, it means that the position does not exist.

specific code

template<size_t N>
class bitset
{
    
    
public:
	bitset()
	{
    
    
		_bits.resize((N / 8) + 1, 0);
	}
	void set(size_t x)
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		_bits[i] |= (1 << j);
	}
	void rset(size_t x)
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		_bits[i] &= ~(1 << j);
	}
	bool test(size_t x)//判断在不在
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		return _bits[i] & (1 << j);
	}
private:
	vector<char> _bits;
};

Topic two

Given 10 billion integers, design an algorithm to find the one that occurs only once?


Use 2 bits to represent the current data
00 means 0 times 01 means 1 time 10 means more than 1 time


Encapsulate the code of topic 1



The class of topic 1 is bitset, so use this to define two bits _bs1 _bs2 By
judging that the two bits are 1/0
, if the number of occurrences is 0, then +1 becomes 0. 1,
if the number of occurrences is 1, then +1 becomes 1.
If the number of occurrences of 0 is more than 1, it remains unchanged.
Finally, the number that appears once is printed out through the print function in the class.

Summary of advantages and disadvantages of bitmap

advantage:

fast speed save space

Disadvantages:
only integers can be mapped, and string floating-point numbers cannot store mappings


Therefore, the Bloom filter is proposed to solve the problem that the string type cannot be stored to a certain extent.

2. Bloom filter

background

Disadvantages of using hash table storage: waste of space

Disadvantages of using bitmap storage: Bitmaps can generally only handle integers, but if they are strings, they cannot be processed.
Combining hashes with bitmaps is a Bloom filter

concept

Using multiple hash functions to map a piece of data into a bitmap structure
can improve efficiency and save a lot of space


Assuming that two strings are mapped to the same location, it will cause a hash collision.
The Bloom filter wants to reduce the probability of collision.
One value is mapped to one location, which is easy to misjudgment. One value is mapped to multiple locations, which can reduce misjudgment. Rate


Use a variety of hash mapping algorithms to map to different locations
For example: each value is mapped to 2 locations

Implementation

When passing the template, pass in hash1 hash2 hash3 to convert the K type to integer
hash1 hash2 hash3 as three different mapping methods

hash1 hash2 hash3

The BKDRHash algorithm has been used in the case of .
When the string needs to be converted into an integer, add all the characters in the string to determine the corresponding key. Use
BKDRHash as the default value and pass it to hash1

Click to view a detailed explanation: Hash


Pass APHash as the default value to hash2


Pass DJBHash as the default value to hash3


The APHash algorithm and the DJBHash algorithm are based on mathematical derivation.
Click the link to view the detailed explanation of the APHash algorithm and the DJBHash algorithm: Hash Algorithm


N value problem

N represents the maximum number of inserted key data


k is the number of hash functions, m is the length of the bloom filter, and n is the number of inserted elements

When k is 3, 3= ( m/n ) *0.69, m=4.3n
m is more equal to 4n
The length of the Bloom filter is approximately equal to 4 times the number of inserted elements


set

_bs is the bitmap structure of topic 1.
By calling different implementations of operator() in hash1 hash2 hash3, the
incoming corresponding strings are converted into different integers, and bitmaps are used to insert them at different mapping positions.


tset

Only when three different positions of hash1 hash2 hash3 are in, it is there, if one of the positions is not in, then it is not in


Even if the ASCII values ​​​​of the two strings are the same, but the order is different, the corresponding mapping positions corresponding to hash1 hash2 hash3 are also different

Is it accurate to be in tset or not?

Absence is accurate. When it is absent, the current mapping position is 0. If the data exists, it is impossible to make the mapping position 0


is inaccurate,

ts originally did not exist at the checking position, but due to conflicts with other strings, it happened to be mapped to the checking position of ts, and it would be mistaken for the existence of ts, resulting in misjudgment


Usage scenarios and features

Scenarios that can tolerate misjudgment,
such as: quickly determine whether a nickname has been used. The
nickname may be due to misjudgment, which may create duplicates, but there will be no impact


Normally, the mobile phone number cannot be put into the Bloom filter. If it is used, it may be misjudged. If it has not been registered, it will show that the user exists.

insert image description here

But the Bloom filter can also be done.
If the current data is not there, it will directly return false.
If the current data is there, there may be a misjudgment problem, so go to the database to search, if it is, it will directly return the data exists, if not, then returns false


Features of Bloom filter
Advantages: fast, save memory
Disadvantages: misjudgment (data in)

specific code

#include<iostream>
using namespace std;
#include<vector>


template<size_t N>
class bitset
{
    
    
public:
	bitset()
	{
    
    	
		_bits.resize((N / 8) + 1, 0);
	}
	void set(size_t x)
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		_bits[i] |= (1 << j);
	}
	void rset(size_t x)
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		_bits[i] &= ~(1 << j);
	}
	bool test(size_t x)//判断在不在
	{
    
    
		size_t i = x / 8;//第几个char上
		size_t j = x % 8;//char上的第几个比特位
		return _bits[i] & (1 << j);
	}
private:
	vector<char> _bits;
};

void test_bitset()
{
    
    
	bitset<100> v;
	v.set(10);
	cout << v.test(10) << endl;
	cout << v.test(15) << endl;
}

//仿函数
struct BKDRHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 0;
		for (auto e : s)
		{
    
    
			hash += e;
			hash *= 31;
		}
		return hash;
	}
};

struct APHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
    
    
			size_t ch = s[i];
			if ((i & 1) == 0)
			{
    
    

				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
    
    
				hash ^= (~((hash) << 11) ^ ch ^ (hash >> 5));
			}
		}
		return hash;
	}
};

struct DJBHash
{
    
    
	size_t operator()(const string& s)
	{
    
    
		size_t hash = 5381;
		for (auto e : s)
		{
    
    
			hash += (hash << 5) + e;
		}
		return hash;
	}
};


template< size_t N,
	class K = string,
	class Hash1 = BKDRHash,
	class Hash2 = APHash,
	class Hash3 = DJBHash>
class	BloomFilter  //布隆过滤器
{
    
    
public:
	void set(const K& key)
	{
    
    
		size_t len = N * _X; //整体长度
		//将其转换为可以取模的整型值
		size_t hash1 = Hash1()(key) % len;
		_bs.set(hash1);

		size_t hash2 = Hash2()(key) % len;
		_bs.set(hash2);

		size_t hash3 = Hash3()(key) % len;
		_bs.set(hash3);
	}

	//判断在不在
	bool test(const K& key)
	{
    
    
		size_t len = N * _X; //整体长度

		//三个位置都在才在,有一个位置不在 则不在
		size_t hash1 = Hash1()(key) % len;
		if (!_bs.set(hash1))
		{
    
    
			return false;
		}

		size_t hash2 = Hash2()(key) % len;
		if (!_bs.set(hash2))
		{
    
    
			return false;
		}

		size_t hash3 = Hash3()(key) % len;
		_bs.set(hash3);
		if (!_bs.set(hash3))
		{
    
    
			return false;
		}
		return true;
	}
private:
	static const size_t _X = 4;//整数倍
	bitset<N* _X> _bs;
};

// 一般是字符串才使用 布隆过滤器
// 所以默认使用字符串类型
void test_BloomFilter()
{
    
    
	BloomFilter<100> v;
	v.set("sort");
	v.set("left");
	v.set("right");
	v.set("hello world");
	v.set("test");
	v.set("etst");

}

Guess you like

Origin blog.csdn.net/qq_62939852/article/details/131015367