Article directory
1. Bitmap application
topic one
Given 4 billion non-repeating unsigned integers that have not been sorted, and given an unsigned integer, how to quickly determine whether a number is among the 4 billion numbers
Normal thinking:
1. Sorting + binary search
2. Put it into a hash table or red-black tree
1 billion bytes is approximately equal to 1GB
4 billion integers are approximately equal to 16GB
If the above two methods are used, the memory is not enough
The hash map of the direct addressing method of the hash determines whether the shaping is
present or not. Map the mark in turn and store the value.
At least one char is used to indicate the presence or absence of a value, which is 4 billion bytes or 4GB, but this is still too large
. In the presence or absence, there is no need to save the value, use 0/1 to represent
Use one bit to identify the value represented by each integer, that is, the bitmap
needs 4 billion bits, 1 billion bytes is approximately equal to 1GB, and 4 billion bits is approximately equal to 500MB
Code
In the bitset class,
by controlling the char, the bit is controlled
set
set sets the bit of x mapping to 1
Since the subscript is calculated from 0
, the 0-7 bit is counted as the 0th char, and the 8-15 is counted as the 1st char, which is stored in the corresponding char
first and counted in the first char. Corresponding to the first few bits of char
j represents the position to find the corresponding bit, and you want to set it to 1
<< is a low-to-high shift
1<<j, that is, all positions except j position are 0
So | 1, no matter the number at this position is 1/0, it will be 1 after |
rset
rset sets the bit of the x mapping to 0
j means to find the position of the corresponding bit, and want to set it to 0,
so &0, regardless of the number of the position is 1/0, & is 0
test
test to judge whether it is
j means to find the position of the corresponding bit. The current position value & 1
may also exist in other positions, so the result is not 0, which means that the position exists.
If the result is 0, it means that the position does not exist.
specific code
template<size_t N>
class bitset
{
public:
bitset()
{
_bits.resize((N / 8) + 1, 0);
}
void set(size_t x)
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
_bits[i] |= (1 << j);
}
void rset(size_t x)
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
_bits[i] &= ~(1 << j);
}
bool test(size_t x)//判断在不在
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
return _bits[i] & (1 << j);
}
private:
vector<char> _bits;
};
Topic two
Given 10 billion integers, design an algorithm to find the one that occurs only once?
Use 2 bits to represent the current data
00 means 0 times 01 means 1 time 10 means more than 1 time
Encapsulate the code of topic 1
The class of topic 1 is bitset, so use this to define two bits _bs1 _bs2 By
judging that the two bits are 1/0
, if the number of occurrences is 0, then +1 becomes 0. 1,
if the number of occurrences is 1, then +1 becomes 1.
If the number of occurrences of 0 is more than 1, it remains unchanged.
Finally, the number that appears once is printed out through the print function in the class.
Summary of advantages and disadvantages of bitmap
advantage:fast speed save space
Disadvantages:
only integers can be mapped, and string floating-point numbers cannot store mappings
Therefore, the Bloom filter is proposed to solve the problem that the string type cannot be stored to a certain extent.
2. Bloom filter
background
Disadvantages of using hash table storage: waste of spaceDisadvantages of using bitmap storage: Bitmaps can generally only handle integers, but if they are strings, they cannot be processed.
Combining hashes with bitmaps is a Bloom filter
concept
Using multiple hash functions to map a piece of data into a bitmap structure
can improve efficiency and save a lot of space
Assuming that two strings are mapped to the same location, it will cause a hash collision.
The Bloom filter wants to reduce the probability of collision.
One value is mapped to one location, which is easy to misjudgment. One value is mapped to multiple locations, which can reduce misjudgment. Rate
Use a variety of hash mapping algorithms to map to different locations
For example: each value is mapped to 2 locations
Implementation
When passing the template, pass in hash1 hash2 hash3 to convert the K type to integer
hash1 hash2 hash3 as three different mapping methods
hash1 hash2 hash3
The BKDRHash algorithm has been used in the case of .
When the string needs to be converted into an integer, add all the characters in the string to determine the corresponding key. Use
BKDRHash as the default value and pass it to hash1
Click to view a detailed explanation: Hash
Pass APHash as the default value to hash2
Pass DJBHash as the default value to hash3
The APHash algorithm and the DJBHash algorithm are based on mathematical derivation.
Click the link to view the detailed explanation of the APHash algorithm and the DJBHash algorithm: Hash Algorithm
N value problem
N represents the maximum number of inserted key data
k is the number of hash functions, m is the length of the bloom filter, and n is the number of inserted elements
When k is 3, 3= ( m/n ) *0.69, m=4.3n
m is more equal to 4n
The length of the Bloom filter is approximately equal to 4 times the number of inserted elements
set
_bs is the bitmap structure of topic 1.
By calling different implementations of operator() in hash1 hash2 hash3, the
incoming corresponding strings are converted into different integers, and bitmaps are used to insert them at different mapping positions.
tset
Only when three different positions of hash1 hash2 hash3 are in, it is there, if one of the positions is not in, then it is not in
Even if the ASCII values of the two strings are the same, but the order is different, the corresponding mapping positions corresponding to hash1 hash2 hash3 are also different
Is it accurate to be in tset or not?
Absence is accurate. When it is absent, the current mapping position is 0. If the data exists, it is impossible to make the mapping position 0
is inaccurate,
ts originally did not exist at the checking position, but due to conflicts with other strings, it happened to be mapped to the checking position of ts, and it would be mistaken for the existence of ts, resulting in misjudgment
Usage scenarios and features
Scenarios that can tolerate misjudgment,
such as: quickly determine whether a nickname has been used. The
nickname may be due to misjudgment, which may create duplicates, but there will be no impact
Normally, the mobile phone number cannot be put into the Bloom filter. If it is used, it may be misjudged. If it has not been registered, it will show that the user exists.
But the Bloom filter can also be done.
If the current data is not there, it will directly return false.
If the current data is there, there may be a misjudgment problem, so go to the database to search, if it is, it will directly return the data exists, if not, then returns false
Features of Bloom filter
Advantages: fast, save memory
Disadvantages: misjudgment (data in)
specific code
#include<iostream>
using namespace std;
#include<vector>
template<size_t N>
class bitset
{
public:
bitset()
{
_bits.resize((N / 8) + 1, 0);
}
void set(size_t x)
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
_bits[i] |= (1 << j);
}
void rset(size_t x)
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
_bits[i] &= ~(1 << j);
}
bool test(size_t x)//判断在不在
{
size_t i = x / 8;//第几个char上
size_t j = x % 8;//char上的第几个比特位
return _bits[i] & (1 << j);
}
private:
vector<char> _bits;
};
void test_bitset()
{
bitset<100> v;
v.set(10);
cout << v.test(10) << endl;
cout << v.test(15) << endl;
}
//仿函数
struct BKDRHash
{
size_t operator()(const string& s)
{
size_t hash = 0;
for (auto e : s)
{
hash += e;
hash *= 31;
}
return hash;
}
};
struct APHash
{
size_t operator()(const string& s)
{
size_t hash = 0;
for (long i = 0; i < s.size(); i++)
{
size_t ch = s[i];
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
}
else
{
hash ^= (~((hash) << 11) ^ ch ^ (hash >> 5));
}
}
return hash;
}
};
struct DJBHash
{
size_t operator()(const string& s)
{
size_t hash = 5381;
for (auto e : s)
{
hash += (hash << 5) + e;
}
return hash;
}
};
template< size_t N,
class K = string,
class Hash1 = BKDRHash,
class Hash2 = APHash,
class Hash3 = DJBHash>
class BloomFilter //布隆过滤器
{
public:
void set(const K& key)
{
size_t len = N * _X; //整体长度
//将其转换为可以取模的整型值
size_t hash1 = Hash1()(key) % len;
_bs.set(hash1);
size_t hash2 = Hash2()(key) % len;
_bs.set(hash2);
size_t hash3 = Hash3()(key) % len;
_bs.set(hash3);
}
//判断在不在
bool test(const K& key)
{
size_t len = N * _X; //整体长度
//三个位置都在才在,有一个位置不在 则不在
size_t hash1 = Hash1()(key) % len;
if (!_bs.set(hash1))
{
return false;
}
size_t hash2 = Hash2()(key) % len;
if (!_bs.set(hash2))
{
return false;
}
size_t hash3 = Hash3()(key) % len;
_bs.set(hash3);
if (!_bs.set(hash3))
{
return false;
}
return true;
}
private:
static const size_t _X = 4;//整数倍
bitset<N* _X> _bs;
};
// 一般是字符串才使用 布隆过滤器
// 所以默认使用字符串类型
void test_BloomFilter()
{
BloomFilter<100> v;
v.set("sort");
v.set("left");
v.set("right");
v.set("hello world");
v.set("test");
v.set("etst");
}