Bloom Filter Bloom filter principle and (1)

Primer

"Mathematical beauty" describes Bloom filter is very classic:

In everyday life, including when designing computer software, often determine whether an element in a collection. such as:

  • In a word processor, you need to check whether an English word is spelled correctly (that is, to determine whether it is a known dictionary);
  • On the FBI, a suspect's name is already in the list of suspects;
  • In the crawler where a site has been visited;
  • yahoo, gmail and other mail spam filtering, and so on ...

A common problem scenario above need to be addressed are: How to see if there is one thing in the collection of large amounts of data inside .

The usual practice are the following ideas:

  • Array,
  • List,
  • Tree, balanced binary tree, Trie
  • map (red-black tree)
  • Hash table

Above these types of data structures with some search algorithms can solve the problem of the amount of data, but a set of very large amounts of data inside when problems arise. For example: there are five million records and even 100 million records? This time the problem of conventional data structures on the highlights out. The contents of the data structure stored elements in the array, linked lists, trees, etc., once the data is too large, memory consumption will also show linear growth, and ultimately achieve a bottleneck. Hash-table query efficiency can reach O (1). But the hash table consumes memory remains high. Use one hundred million consume junk email address hash table is stored? Hash table approach: First, a hash function to the email address mapping 8-byte fingerprint information; consideration of the hash table storage efficiency is typically less than 50% (hash collisions); thus the memory consumption: 8 * 2 * 1 billion bytes = 1.6GB of memory. Thus, the storage billions of e-mail addresses may require hundreds of GB of memory. Unless it is a super computer, a server is generally not stored. This time, Bloom filter (Bloom Filter) came into being.

Bloom filter

Bloom Filter by Burton  ·  proposed in 1970 Bloom (Burton Bloom). It is actually a long series of random binary vector and mapping functions. We made the above example to illustrate how it works.

Assumed that the memory one hundred million e-mail addresses, to build a 1.6 billion binary (bit), that is two hundred million bytes vector, then the 1.6 billion bits cleared. For each email address X, generating a random number of 8 different devices (F1, F2, ..., F8 ) generates eight fingerprint information (f1, f2, ..., f8 ). Then a random generator G this 8 fingerprint information mapped to 1-16 billion in eight natural numbers g1, g2, ..., g8 (in fact, eight hash function). Now this binary eight positions are all set to 1. After the one hundred million e-mail addresses have carried out such processing, a Bloom filter for these email addresses been built. As shown below. In addition, the above why one hundred million e-mail addresses need to build 1.6 billion and take 8-bit binary hash function. Here recommend a very famous Bowen, address: https://blog.csdn.net/jiaomeng/article/details/1495500 , which has proved a conclusion: you want to keep the error rate is low, it is best to make way array half still empty. That is 1.6 billion bits up to 800 million bits is set to 1, error recognition rate will be reduced to very low.

Now, let's look at how to use Bloom filter to monitor a suspicious e-mail address Y is in the blacklist. (, ..., F8 F1, F2) generate the same random generator 8 8 addresses this fingerprint information s1, s2, ..., s8, then this corresponds to eight fingerprints bloom filter 8 bits, respectively, t1, t2, ..., t8. If Y in the blacklist, obviously, t1, t2, ..., t8 corresponding to the binary number must be 1. In this way, we encounter any email address in the blacklist can be accurately found. Bloom filter will never miss any suspicious address blacklist. However, it is at a deficiency. It is possible to have a very small blacklist not be determined that the email address blacklist, because there may be an e-mail address corresponding good bloom filter 8 in position "happens" is ( other address) is set to 1. Fortunately, this is unlikely. We call misidentification rate. In the above example, the erroneous recognition rate is one ten thousandth or less. Benefits Bloom filter that quickly, save time, but there is a certain error recognition rate. Common remedy is to re-establish a small white list, store those e-mail addresses may be false positives.

The code is simple to achieve

#include <the iostream> 
#include <Vector> the using namespace STD; class Bitmap {
 public : 
    Bitmap (size_t size): size_ (size) { 
        bitVec_.resize ((size_ >> . 5 ) + . 1 );   // open up more space because only the array representing the interval [0, size)     }
     void bitmapSet ( int Val) {
         int index = Val >> . 5 ;   // equivalent to dividing the 32, the shift operation can improve the performance int offset =% Val 32 ; 
        bitVec_ [index] | = ( . 1 << offset);
        

 



        int capacity = bitVec_.capacity();
    }
    bool bitmapGet(int val) {
        int index = val >> 5;
        int offset = val % 32;
        return bitVec_[index] & (1 << offset);
    }
private:
    size_t size_;
    vector<unsigned int> bitVec_;
};

class BloomFilter {
private:
    struct SimpleHash {
        SimpleHash() {}
        SimpleHash(size_t cap, size_t seed)
        : cap_(cap), seed_(seed) {}
        int hash(const std::string& s) {
            int result = 0;
            for (auto c : s) {
                result = result * seed_ + c;
            }
            return (cap_ - 1) & result;
        }
        private:
            size_t cap_;
            size_t seed_;
    };

    enum { defaultSize = 100000000 * 16 };  //16亿

public:
    BloomFilter() {
        bitmap_ = new Bitmap(defaultSize);
        hashs_.reserve(seeds_.size());
        for (int i = 0; i < seeds_.size(); ++i) {
            SimpleHash* hash = new SimpleHash(defaultSize, seeds_[i]);
            hashs_.push_back(hash);
        }
    }
    ~BloomFilter() {
        delete bitmap_;
        for (auto h : hashs_) {
            delete h;
        }
    }
    void add(const string& s) {
        for (auto h : hashs_) {
            bitmap_->bitmapSet(h->hash(s));
        }
    }
    bool contain(const string& s) {
        bool ret = true;
        for (auto h : hashs_) {
            ret = ret && bitmap_->bitmapGet(h->hash(s));
        }
        return ret;
    }

private:
    std::vector<int> seeds_ = { 7, 11, 13, 31, 37, 61, 73, 97 };  //还不是随机生成
    std::vector<SimpleHash*> hashs_;
    Bitmap* bitmap_;
};

void bloomFilterTest() {
    std::string email = "[email protected]";
    BloomFilter bf;
    bf.add(email);
    bool ret1 = bf.contain(email);
    bool ret2 = bf.contain("even.com");
}

int main() {
    bloomFilterTest();

    system("pause");
    return 0;
}

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/evenleee/p/12002882.html