Learning C + - Discussion Bloom filter bitmap

1. Bitmap

所谓位图,就是用每一位来存放某种状态,适用于大规模数据,但数据状态又不是很多的情况。通常是用来判断某个数据存不存在的。

Topic:
to 4 billion of non-repetition of unsigned integers, no row over the order. To an unsigned integer, how to quickly determine whether a number in which the number of four billion. [Tencent]

Thinking:
This question must first determine the four billion non-repetition of unsigned integers actually account for much memory, because we can not load too much memory to an existing computer.

An integer is 4 bytes, 4 billion is 16 billion bytes integer, it is equivalent to 16G of memory, in terms of general computer difficult to achieve this load, so we can take the following two options, one is dividing a bitmap.

Method:
① segmentation
using division processing, the number 4000000000 processed in batches, of course, can achieve our ultimate goal, but this advantage would be too time complexity.

② bitmap BitMap
before I first introduced this method to introduce what is a bitmap.

Bitmap BitMap: Each bitmap is a data bit of each data represents an array of 0 indicates absence of data, data indicating a presence.
Here Insert Picture Description
As described above, when we need to store a data, we need the following methods:

1. This figure first determine which data in the entire data (section).
2. The determination on which a Bit data (section) bits.
1 to 3 is set in this position.

Bitmap implementation code :

#include <iostream>
#include <vector>
using namespace std;

class BitMap
{
public:
    BitMap(size_t range)
    {
        //此时多开辟一个空间
        _bits.resize(range / 32 + 1); //size为要处理的的数据总数,
        //一个整型可以存放32个数据状态,所以这里vector的size为size/32

    }
    void Set(size_t x)
    {
        int index = x / 32;//确定哪个数据(区间)
        int temp = x % 32;//确定哪个Bit位
        _bits[index] |= (1 << temp);//位操作即可
    }
    void Reset(size_t x)
    {
        int index = x / 32;
        int temp = x % 32;
        _bits[index] &= ~(1 << temp);//取反
    }
    bool Test(size_t x)
    {
        int index = x / 32;
        int temp = x % 32;
        if (_bits[index]&(1<<temp))
            return 1;
        else
            return 0;
    }

// 位图中比特为1的个数
size_t Count()const
 {
 const char* pCount = "\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4";
 size_t size = _bit.size();
 size_t count = 0;
 for(size_t i = 0; i < size; ++i)
 {
 int value = _bit[i];
 int j = 0;
 while(j < sizeof(_bit[0]))
 {
 char c = value;
 count += pCount[c&0x0f];
 c >>= 4;
 count += pCount[c&0x0f];
 ++j;
 value >>= 8;
 }
 }
 return count;
}

private:
    vector<int> _bits;
};

2. Bloom filter

1. Bloom filter is made
when we watch the news using a news client, it will give us constantly recommend new content each time you re going to recommend it, get rid of those elements already seen. The question is, how news client recommendation system to achieve push sent heavy? With the server records the history of all users have seen, will be screened from the history of each user's recommendation system is recommended when news filtered out those records that already exist. How quickly find it?

  1. Record a hash table to store user disadvantage: a waste of space
  2. Bit map memory for recording user disadvantage: hash collision can not handle
  3. FIG binding bit hash, i.e., a Bloom filter

2. What is the Bloom filter
essentially Bloom filter is a data structure, more subtle probabilistic data structure (probabilistic data structure), characterized by efficient inserts and queries can be used to tell you "something certainly does not exist or may exist. " Compared to the traditional List, Set, Map and other data structures, it is more efficient and take up less space, but the drawback is the result of its return is probabilistic, not exact.

3. Bloom filter data structure
Bloom filter is a bit vector or bit arrays, a long way:
Here Insert Picture Description
If we want to map a value to a bloom filter, we need to use a plurality of different hash function to generate a plurality of a hash value, and for each bit position of a hash value generated points 1, for example, the value "baidu" and three different hash function to generate hash values 1,4,7 respectively, then converted to the FIG. :
Here Insert Picture Description
Ok, we are now a reload value "tencent", if the hash function returns 3,4,8, then continue drawing becomes: Here Insert Picture Description
is worth noting that, due to the 4-bit this bit hash function both values are returned this bit bit, so it was covered up. Now if we want to query "dianping" this value exists, the hash function returns three values 1,5,8, we find a value of 0 on the results of this 5-bit bit, indicating that none of the values are mapped to the bit position on, so we can safely say "dianping" this value does not exist. And when we need to query the "baidu" this value is present, then the hash function is bound to return 1,4,7, and then we check found that the value of these three bits are 1 bit, then we can say that "baidu" exist yet? The answer is no, can only be "baidu" this value may exist.

Why is this ? Answers with simple, because with the increase in value more and more, is set to bit position 1 will be more and more, so that a value "taobao" even if not stored before, but in case the hash function returns three other bit bits are set, the value of 1, then the program will still determine the presence of "taobao" this value.

4. The following combination of bitmap and hash thought, we realize Bloom filter:

// 假设布隆过滤器中元素类型为K,每个元素对应5个哈希函数
template<class K, class KToInt1 = KeyToInt1, 
 class KToInt2 = KeyToInt2,
 class KToInt3 = KeyToInt3,
 class KToInt4 = KeyToInt4,
 class KToInt5 = KeyToInt5>
class BloomFilter
{
public:
 BloomFilter(size_t size) // 布隆过滤器中元素个数
 : _bmp(10*size)
 , _size(0)
 {}
 
private:
 BitMap _bmp;
 size_t _size; // 实际元素的个数
}

Bloom filter insert

bool Insert(const K& key)
{
 size_t bitCount = _bmp.Size();
 size_t index1 = KToInt1()(key)%bitCount;
 size_t index2 = KToInt2()(key)%bitCount;
 size_t index3 = KToInt3()(key)%bitCount;
 size_t index4 = KToInt4()(key)%bitCount;
 size_t index5 = KToInt5()(key)%bitCount;
 _bmp.Set(index1);
 _bmp.Set(index2);
 _bmp.Set(index3);
 _bmp.Set(index4);
 _bmp.Set(index5);
 _size++;
}

Bloom filter Find

Thought Bloom filter element is to be mapped to a bitmap with a plurality of hash functions, and therefore bits are mapped to a certain position. Can be performed in the following manner to find: calculate a hash value corresponding to each bit position stored is zero, as long as there is a zero, not a certain element representing the hash table, or it may in the hash table.

Note: If the Bloom filter when an element is not present, this element must not exist, if the elements are present, the element may be present, because there are some false positives and some hash function.

bool IsInBloomFilter(const K& key)
{
 size_t bitCount = _bmp.Size();
 size_t index1 = KToInt1()(key)%bitCount;
 if(!_bmp.Test(index1))
 return false;
 size_t index2 = KToInt2()(key)%bitCount;
 if(!_bmp.Test(index2))
 return false;
 size_t index3 = KToInt3()(key)%bitCount;
 if(!_bmp.Test(index3))
 return false;
 size_t index4 = KToInt4()(key)%bitCount;
 if(!_bmp.Test(index4))
 return false;
 size_t index5 = KToInt5()(key)%bitCount;
 if(!_bmp.Test(index5))
 return false;
 return true; // 有可能在
 }

Bloom filter support delete it

Currently we knew Bloom filter can support and add isExist operation, the operation can then delete it, the answer is no, for example, the figure above the bit position 4 is covered by common values of the two words, once you delete a value such as "tencent "and it is set to 0, then the next judge another value such as" baidu "if it exists, will return false directly, but in fact you did not delete it.
How to solve this problem, the answer is counting deleted. But count delete need to store a value, rather than the original bit position, will increase the memory size. In this case, the value is increased to a value corresponding to the stored index slot plus one, minus one is deleted, it is determined whether or not there is to see whether the value is greater than 0.

How to choose the number and the hash function bloom filter length

Obviously, the Bloom filter is too small Soon all bits are 1 bit, the query will return any value "possible", would not achieve the purpose of the filter. Bloom filter length will directly affect the false alarm rate, the smaller the longer the false alarm rate which Bloom filter.
In addition, the number of hash functions also need to weigh, the more the number of the bloom filter bit rate of 1 bit faster, and the efficiency of the Bloom filter is lower; but if too little, then our mistake Daily rate becomes high.

Best Practices

Common applicable common with using a Bloom filter to reduce disk or network IO requests, because once a value must not exist, we can not follow up expensive queries.

In addition, since you use the Bloom filter to speed up finding and determine whether there is, then the low performance of the hash function is not a good choice, recommended MurmurHash, Fnv these.

It is recommended that an article is well written, we can serve as a reference: https://www.jianshu.com/p/2104d11ee0a2

Guess you like

Origin blog.csdn.net/tonglin12138/article/details/93382025