Tencent Two Sides: There are 4 billion QQ numbers, and the memory is limited to 1G. How to deduplicate? I was dumbfounded by the question!

4 billion QQ numbers, limited to 1G memory, how to deduplicate?

4 billion unsigned ints, if stored directly in memory, need:

4*4000000000 /1024/1024/1024 = 14.9G, considering that there are some repeated words, the space of 1G is basically not enough.

To achieve this function, you can use a bitmap.

If you use a bitmap, a number only needs to occupy 1 bit, so 4 billion numbers are:

4000000000 * 1 /8 /1024/1024 = 476M

Compared with the previous 14.9G, it greatly saves a lot of space.

For example, if I want to put my QQ number "907607222" in the Bitmap, I need to find the position of 907607222, and then set it to 1.

In this way, after putting 4 billion numbers into the Bitmap, all positions of 1 indicate existence, and those not of 1 indicate non-existence. The same QQ number only needs to be set to 1 once. Then, in the end, all are The number of 1 can be traversed out.

What are BitMaps? What is the use?

Bitmap (BitMap), the basic idea is to use a bit to mark an element. Bit is the smallest unit in a computer, that is, what we often call 0 and 1 in a computer. This is represented by a bit.

The so-called bitmap is actually a bit array, that is, each position is a bit, and the value in it can be 0 or 1

Like the bitmap above, it can be used to represent 1, 4, 6:

If we don’t use a bitmap, if we want to record the three integers of 1, 4, and 6, we need to use three unsigned ints. It is known that each unsigned int occupies 4 bytes, so it is a byte, a 3*4 = 12word A section has 8 bits, so it is 12*8 = 96a bit.

Therefore, the biggest benefit of bitmaps is to save space.

Bitmaps have many uses, especially in scenarios such as deduplication and sorting. The famous Bloom filter is implemented based on bitmaps.

But the bitmap also has certain limitations, that is, it can only represent 0 and 1, and cannot store other numbers. So he is only suitable for this kind of scene that can express true or false.

Recommend an open source and free Spring Boot practical project:

https://github.com/javastacks/spring-boot-best-practice

What is a Bloom filter and what is its implementation principle?

A Bloom filter is a data structure used to quickly retrieve whether an element is likely to exist in a set (bit array).

Its basic principle is to use multiple hash functions to map an element into multiple bits, and then set these bits to 1. When querying for an element, if these bits are all set to 1, the element is considered likely to exist in the collection, otherwise it definitely does not exist

Therefore, the Bloom filter can accurately determine whether an element must not exist, but because of the existence of hash collisions, it cannot determine that an element must exist. It can only be judged that it may exist.

Therefore, there is a possibility of misjudgment in the Bloom filter, that is, when a non-existing Hero element passes through hash1, hash2, and hash3, it happens to conflict with the hash results of other values. Then it will be misjudged to exist, but in fact he does not exist.

To reduce the probability of this misjudgment, the main method is to reduce the probability of hash collision and introduce more hash algorithms.

The following is the working process of Bloom filter:

1. Initialize the Bloom filter

When initializing the Bloom filter, you need to specify the size of the set and the false positive rate. The Bloom filter contains a bit array and multiple hash functions, and each hash function generates an index value.

2. Add elements to the Bloom filter

To add an element to the Bloom filter, it is first necessary to pass the element through multiple hash functions to generate multiple index values, and then set the bits corresponding to these index values to 1. If these index values have already been set to 1, there is no need to set them again.

3. Whether the query element exists in the Bloom filter

To query whether an element exists in the Bloom filter, it is necessary to pass the element through multiple hash functions to generate multiple index values, and determine whether the bits corresponding to these index values are all set to 1. If these bits are all set to 1, the element is considered likely to be present in the set, otherwise it definitely does not.

The main advantage of the Bloom filter is that it can quickly determine whether an element belongs to a certain set, and can achieve high efficiency in space and time. However, it also has some disadvantages, such as:

Bloom filter has a certain misjudgment rate when judging whether an element exists. ,
Bloom filters remove elements more difficult, because removing an element requires setting its corresponding bits to 0, but these bits may be shared by other elements.

Application Scenario

The Bloom filter is widely used because of its high efficiency. The typical scenarios are as follows:

1. Web crawler: Crawler programs can use Bloom filters to filter out web pages that have already been crawled to avoid repeated crawling and waste of resources.

2. Cache system: The cache system can use Bloom filters to judge whether a query may exist in the cache, thereby reducing the number of query caches and improving query efficiency. Bloom filters are also often used to solve the problem of cache penetration.

3. Distributed system: In a distributed system, a Bloom filter can be used to determine whether an element exists in the distributed cache, avoiding queries on all nodes and reducing network load.

4. Spam filtering: Bloom filter can be used to judge whether an email address is in the spam list, so as to filter out spam.

5. Blacklist filtering: Bloom filter can be used to determine whether an IP address or mobile phone number is in the blacklist, thereby preventing malicious requests.

how to use

Bloom filters can be implemented using third-party libraries in Java. Common ones include Google Guava library, Apache Commons library, and Redis.

Such as Guava:

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
public class BloomFilterExample {
    public static void main(String[] args) {
        // 创建布隆过滤器，预计插入100个元素，误判率为0.01
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(), 100, 0.01);
        // 插入元素
        bloomFilter.put("Lynn");
        bloomFilter.put("666");
        bloomFilter.put("八股文");
        // 判断元素是否存在
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("张三"));  // false
    }
}

Apache Commons:

import org.apache.commons.lang3.StringUtils;
import org.apache.commons.collections4.BloomFilter;
import org.apache.commons.collections4.functors.HashFunctionIdentity;
public class BloomFilterExample {
    public static void main(String[] args) {
        // 创建布隆过滤器，预计插入100个元素，误判率为0.01
        BloomFilter<String> bloomFilter = new BloomFilter<>(HashFunctionIdentity.hashFunction(StringUtils::hashCode), 100, 0.01);
        // 插入元素
        bloomFilter.put("Lynn");
        bloomFilter.put("666");
        bloomFilter.put("八股文");
        // 判断元素是否存在
        System.out.println(bloomFilter.mightContain("Lynn")); // true
        System.out.println(bloomFilter.mightContain("张三"));  // false
    }
}

Redis can be used through the Bloom module, and Redisson can be used to:

Config config = new Config();
config.useSingleServer().setAddress("redis://127.0.0.1:6379");
RedissonClient redisson = Redisson.create(config);
RBloomFilter<String> bloomFilter = redisson.getBloomFilter("myfilter");
bloomFilter.tryInit(100, 0.01);
bloomFilter.add("Lynn");
bloomFilter.add("666");
bloomFilter.add("八股文");
System.out.println(bloomFilter.contains("Lynn"));
System.out.println(bloomFilter.contains("张三"));
redisson.shutdown();

First create a RedissonClient object, and then obtain an RBloomFilter object through this object, use the tryInit method to initialize the Bloom filter, specify the maximum number of elements that can be added as 100, and the false positive rate is 0.01.

Then, use the add method to add the elements "狗小HA", "666" and "八边文" to the Bloom filter, and use the contains method to check whether the element exists in the Bloom filter.

Or Jedis can also:

Jedis jedis = new Jedis("localhost");
jedis.bfCreate("myfilter", 100, 0.01);
jedis.bfAdd("myfilter", "Lynn");
jedis.bfAdd("myfilter", "666");
jedis.bfAdd("myfilter", "八股文");
System.out.println(jedis.bfExists("myfilter", "Lynn"));
System.out.println(jedis.bfExists("myfilter", "张三"));
jedis.close();

Copyright statement: This article is an original article of CSDN blogger "code.song", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting. Original link: https://blog.csdn.net/songmulin/article/details/130814507

Recent hot article recommendation:

1. 1,000+ Java interview questions and answers (2022 latest version)

2. Brilliant! Java coroutines are coming. . .

3. Spring Boot 2.x tutorial, too comprehensive!

4. Don't fill the screen with explosions and explosions, try the decorator mode, this is the elegant way! !

5. The latest release of "Java Development Manual (Songshan Edition)", download quickly!

Feel good, don't forget to like + forward!