Application of Bloom Filter

1. What is a Bloom filter? 

  Fast and small processing method
  Bloom filter (Bloom Filter): is a very space-efficient probabilistic algorithm and data structure, used to determine whether an element is in a set (similar to Hashset).
  Its core is a long binary vector and a series of hash functions

  The length of the array and the number of hash functions are determined dynamically.

  Hash function: SHA1, SHA256, MD5..

 

2. The classic scene of the application 

  A public email provider like Yahoo, HotMail, and Gmail
  always needs to filter out spam from people
  who send spam
  . Registering a new address, there are 5 billion addresses that send spam emails at least in the world.
  How to quickly determine whether an email address is a spam address? Save it and confirm?

  An average mailbox is 18 bytes, how big is the capacity of 5 billion mailboxes?
  18byte x 5 billion = 9 billion

 

3. Strengths and Weaknesses

  Advantages:
    full storage but not the element itself, which is advantageous in some occasions with very strict confidentiality requirements;
    high space efficiency
    Insertion/query time is a constant O(k), far exceeding the general algorithm

  Disadvantages:
    There is a false positive rate, which increases as the number of stored elements increases; in
    general, elements cannot be deleted from the Bloom filter; the
    process of determining the length of the array and the number of hash functions is complicated ;

 

4. Application scenarios 

  •  Google's famous distributed database Bigtable and Hbase use Bloom filters to find rows or columns that do not exist, and reduce the number of IOs for disk lookups
  •  The document storage inspection system also employs bloom filters to detect previously stored data
  •  Google Chrome uses a Bloom filter to speed up Safe Browsing
  •  Spam address filtering
  •  Crawler URL address deduplication
  •  Solve the problem of cache penetration

 

5. Bloom Filter actual combat

   Use goole guava to easily implement bloom filter
  source code analysis bitArray, numHashFunction, funnel, Strategy, put(),
  Demo instance
    Scenario description: 100w strings are put into bloom filter, and 1w strings are randomly generated to determine whether they exist in 100w
    Purpose, understand the simple use of bloom filter;
    understand the influence of false positive rate on the number of hash functions and the length of bit arrays;
    use bloom filter to solve the problem of cache breakdown

  

public class BloomFilterTest {
    
    private static final int insertions = 1000000; //100w
    
    @Test
    public  void bfTest(){
         // Initialize a Bloom filter that stores string data, the initial size is 100w, cannot be set to 0 
        BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions,0.001 ) ;
         // Initialize a set that stores string data, initialize size 100w 
        Set<String> sets = new HashSet<> (insertions);
         // Initialize a set that stores string data, initialize size 100w 
        List<String> lists = new ArrayList< String> (insertions);
        
        // Initialize 1 million random and unique strings to three containers---initialization operation 
        for ( int i = 0; i < insertions; i++ ) {
            String uuid = UUID.randomUUID().toString();
            bf.put(uuid);
            sets.add(uuid);
            lists.add(uuid);
        }
        
        int wrong = 0; // The number of times the Bloom filter is wrongly judged 
        int right = 0; // The number of times the Bloom filter is correct 
        for ( int i = 0; i < 10000; i++ ) {
            String test = i%100==0?lists.get(i/100):UUID.randomUUID().toString(); // Select the string that must exist in bf according to a certain proportion 
            if (bf.mightContain(test) ){
                 if (sets.contains(test)){
                    right ++;
                }else{
                    wrong ++;
                }
            }
        }
        
        System.out.println("=================right====================="+right);//100
        System.out.println("=================wrong====================="+wrong);
    }
    
}

 

6. Solve cache breakdown

private BloomFilter<String> bf;

@postConstruct   -------------> Initialized method
 private  void init(){
     // Add the unique code in
     // Initialize Bloom filter 
    bf = BloomFiler.create(Funnels.stringFunner(Charsets .UTF_8), encoding.size()*1.2 );
     for (String str:ucodes){
    bf.put(str);
}
======== Put ​​Bloom filter data into a single service, separate from business code
use multithreading
if(bf.mightContain(usercode)){
    return null;
}

 

 This time the Bloom filter landing scene is: optimizing the associated query

Optimization background: Querying an order needs to be associated with the early warning order data. Since each early warning is queried, the early warning table is queried once, which is inefficient, that is, to determine whether the order is early warning or not.

You can store a copy of the alerted order in the Bloom filter first, and then it can be used for association when querying the order.

The reason for applying this scenario: most orders are still normal, so don't associate each time

First go to the Bloom filter to check whether the order exists. If it does not exist, it will return to normal. If it exists, go to the early warning table to query, allowing a certain error rate.

 

 

 

 

 

 

 

 

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325124004&siteId=291194637