A difficult classic factory interview question: how to quickly determine whether a URL is in the set of 2 billion URLs? ...

problem

Problem description: A website has 2 billion URLs in a blacklist, how to store this blacklist? If you enter a url at this time, how can you quickly determine whether the url is in this blacklist? And it needs to be quickly judged within the given memory space (for example: 500M).

analysis

Perhaps the first thing many people think of is to use HashSet, because HashSet is based on HashMap, and the theoretical time complexity is: O (1). Achieved a quick goal, but what about space complexity? The URL string gets an Integer value through Hash. Integer occupies 4 bytes. The 2 billion URLs theoretically need:

2 billion * 4/1024/1024/1024 = 7.45G

The memory does not meet the space complexity requirements.

Here is the "Bloom filter" to be introduced in this article.

What is Bloom filter

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are much better than the general algorithm, the disadvantage is that it has a certain rate of false recognition and difficulty in deletion.

Is the description more abstract? Then understand the principle directly!

Take the above example as an example:

The maximum hash value of the Integer obtained by the hash algorithm is:

Integer.MAX_VALUE=2147483647

This means that the hash of any URL will be between 0 and 2147483647.

Then you can define a byte array with a length of 2147483647, which is used to store all possible values ​​of the collection. In order to store this byte array, the system only needs to:

2147483647/8/1024/1024=256M

For example: a URL (X) hash is 2, then fell to the byte array at the second position is 1, the byte array will be: 000….00000010.

Below, we hash all the 2 billion numbers into a byte array:

If the second bit on the byte array is 1, then this URL (X) may exist. Why is it possible? Because it is possible that other URLs are hashed by hash collision, which is a misjudgment.

But if the second bit on this byte array is 0, then this URL (X) must not exist in the collection.

Multiple hashes

1233356-e8ba00cf2c559ca3.png

In order to reduce the probability of misjudgment caused by hash collision, this URL (X) can be hashed N times using different hash algorithms to obtain N hash values, which fall on this byte array. If the position is not all 1, then this URL (X) must not exist in the collection.

Guava 的 BloomFilter

The Guava framework provides a specific implementation of the Bloom filter: BloomFilter, so that developers no longer have to write their own algorithm implementation.

Create BloomFilter

BloomFilter provides several overloaded static create methods to create instances:

public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions, double fpp);
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp);
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions);
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions);

Call method:

static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy);

Parameter meaning:

  • funnel specifies what type of data is stored in the Bloom filter, including: IntegerFunnel, LongFunnel, StringCharsetFunnel.
  • expectedInsertions The amount of data expected to be stored
  • The false judgment rate of fpp, the default is 0.03.

The size of the byte array in BloomFilter is determined by the expectedInsertions and fpp parameters. See the method:

static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
        p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}

The real byte array is maintained in the class: BitArray. Finally, through the put and mightContain methods, add elements and determine whether the elements exist.

Algorithm characteristics

1. Due to the use of hash judgment, time efficiency is very high. Space efficiency is also a major advantage.
2. There is a possibility of misjudgment, which needs to be used for specific scenarios.
3. Because it is impossible to distinguish between hash collisions, the delete operation is not very good.

scenes to be used

The great use of Bloom filters is to quickly determine whether an element is in a collection. Its common usage scenarios are as follows:

1. Blacklist: Anti-spam, judging whether a mailbox is spam from billions of spam lists (similarly, spam messages)
2. URL deduplication: URL deduplication by web crawlers to avoid crawling the same URL address
3, word spell check
4, Key verification of the Key-Value cache system (cache penetration): cache penetration, put all possible data caches into the Bloom filter, when hackers access non-existent caches Quickly return to avoid the cache and DB hang.
5. ID verification, for example, the order system queries whether an order ID exists, and returns directly if it does not exist.

References

https://www.jianshu.com/p/4d31af4c08fb


Kotlin developer community

1233356-4cc10b922a41aa80

The public account of the first Kotlin developer community in China, which mainly shares and exchanges related topics such as Kotlin programming language, Spring Boot, Android, React.js / Node.js, functional programming, and programming ideas.

The more noisy the world, the more peaceful thinking is needed.

1665 original articles published · 1067 praised · 750,000 views

Guess you like

Origin blog.csdn.net/universsky2015/article/details/105531347