1. What is a Bloom filter?
Fast and small processing method
Bloom filter (Bloom Filter): is a very space-efficient probabilistic algorithm and data structure, used to determine whether an element is in a set (similar to Hashset).
Its core is a long binary vector and a series of hash functions
The length of the array and the number of hash functions are determined dynamically.
Hash function: SHA1, SHA256, MD5..
2. The classic scene of the application
A public email provider like Yahoo, HotMail, and Gmail
always needs to filter out spam from people
who send spam
. Registering a new address, there are 5 billion addresses that send spam emails at least in the world.
How to quickly determine whether an email address is a spam address? Save it and confirm?
An average mailbox is 18 bytes, how big is the capacity of 5 billion mailboxes?
18byte x 5 billion = 9 billion
3. Strengths and Weaknesses
Advantages:
full storage but not the element itself, which is advantageous in some occasions with very strict confidentiality requirements;
high space efficiency
Insertion/query time is a constant O(k), far exceeding the general algorithm
Disadvantages:
There is a false positive rate, which increases as the number of stored elements increases; in
general, elements cannot be deleted from the Bloom filter; the
process of determining the length of the array and the number of hash functions is complicated ;
4. Application scenarios
- Google's famous distributed database Bigtable and Hbase use Bloom filters to find rows or columns that do not exist, and reduce the number of IOs for disk lookups
- The document storage inspection system also employs bloom filters to detect previously stored data
- Google Chrome uses a Bloom filter to speed up Safe Browsing
- Spam address filtering
- Crawler URL address deduplication
- Solve the problem of cache penetration
5. Bloom Filter actual combat
Use goole guava to easily implement bloom filter
source code analysis bitArray, numHashFunction, funnel, Strategy, put(),
Demo instance
Scenario description: 100w strings are put into bloom filter, and 1w strings are randomly generated to determine whether they exist in 100w
Purpose, understand the simple use of bloom filter;
understand the influence of false positive rate on the number of hash functions and the length of bit arrays;
use bloom filter to solve the problem of cache breakdown
public class BloomFilterTest { private static final int insertions = 1000000; //100w @Test public void bfTest(){ // Initialize a Bloom filter that stores string data, the initial size is 100w, cannot be set to 0 BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions,0.001 ) ; // Initialize a set that stores string data, initialize size 100w Set<String> sets = new HashSet<> (insertions); // Initialize a set that stores string data, initialize size 100w List<String> lists = new ArrayList< String> (insertions); // Initialize 1 million random and unique strings to three containers---initialization operation for ( int i = 0; i < insertions; i++ ) { String uuid = UUID.randomUUID().toString(); bf.put(uuid); sets.add(uuid); lists.add(uuid); } int wrong = 0; // The number of times the Bloom filter is wrongly judged int right = 0; // The number of times the Bloom filter is correct for ( int i = 0; i < 10000; i++ ) { String test = i%100==0?lists.get(i/100):UUID.randomUUID().toString(); // Select the string that must exist in bf according to a certain proportion if (bf.mightContain(test) ){ if (sets.contains(test)){ right ++; }else{ wrong ++; } } } System.out.println("=================right====================="+right);//100 System.out.println("=================wrong====================="+wrong); } }
6. Solve cache breakdown
private BloomFilter<String> bf; @postConstruct -------------> Initialized method private void init(){ // Add the unique code in // Initialize Bloom filter bf = BloomFiler.create(Funnels.stringFunner(Charsets .UTF_8), encoding.size()*1.2 ); for (String str:ucodes){ bf.put(str); } ======== Put Bloom filter data into a single service, separate from business code use multithreading if(bf.mightContain(usercode)){ return null; }
This time the Bloom filter landing scene is: optimizing the associated query
Optimization background: Querying an order needs to be associated with the early warning order data. Since each early warning is queried, the early warning table is queried once, which is inefficient, that is, to determine whether the order is early warning or not.
You can store a copy of the alerted order in the Bloom filter first, and then it can be used for association when querying the order.
The reason for applying this scenario: most orders are still normal, so don't associate each time
First go to the Bloom filter to check whether the order exists. If it does not exist, it will return to normal. If it exists, go to the early warning table to query, allowing a certain error rate.