Comprehensive analysis of high-level core knowledge in Java-data structure (blon filter [principle introduction, usage scenario, how to implement], bloom filter in Redis)

Preface

The two scenarios of massive data processing and cache penetration made me know the Bloom filter. I consulted some materials to understand it, but a lot of existing materials did not meet my needs, so I decided to summarize an article about Bloom. Filter articles. I hope that through this article, more people will understand the Bloom filter and will actually use it!

Below we will be divided into several aspects to introduce the Bloom filter:

  1. What is a Bloom filter?
  2. The principle of Bloom filter is introduced.
  3. Bloom filter usage scenarios.
  4. The Bloom filter is manually implemented through Java programming.
  5. Use the Bloom filter that comes with Google's open source Guava.
  6. Bloom filter in Redis.

1. What is a Bloom filter?

First, we need to understand the concept of bloom filters.

Bloom filter ( Bloom Filter) was proposed by a brother named Bloom in 1970. We can think of it as a data structure composed of two parts: a binary vector (or bit array) and a series of random mapping functions (hash functions). We usually compared to the commonly used List, Map, Setand other data structures, it takes up less space and is more efficient, but the drawback is the result of its return is probabilistic and not very accurate. In theory, the more elements added to the collection, the greater the possibility of false positives. In addition, the data stored in the Bloom filter is not easy to delete.

Each element in the bit array only occupies 1 bit, and each element can only be 0 or 1. In this way, applying for a bit array of 100w elements only takes up 1000000Bit / 8 = 125000 Byte = 125000/1024 kb ≈ 122kb.

Summary: A person named Bloom proposed a data structure to retrieve whether an element is in a given large set. This data structure is efficient and has good performance, but the disadvantage is that it has a certain error recognition rate and deletion difficulty . And, in theory, the more elements added to the set, the greater the possibility of false positives.

2. Introduction to the principle of Bloom filter

When an element is added to the Bloom filter, the following operations are performed:

  1. Use the hash function in the Bloom filter to calculate the element value to get the hash value (there are several hash functions to get several hash values).
  2. According to the obtained hash value, the value of the corresponding subscript is set to 1 in the bit array.

When we need to determine whether an element exists in the Bloom filter, the following operations are performed:

  1. Perform the same hash calculation again for certain elements;
  2. After that, it is judged whether each element in the bit array is 1, if the value is 1, then this value is in the bloom filter, if there is a value other than 1, it means that the element is not in the bloom filter.

A simple example:

as shown in the figure, when the string storage is to be added to the Bloom filter, the string is first generated by multiple hash functions to generate different hash values, and then under the corresponding bit array The elements of the table are set to 1 (when the bit array is initialized, all positions are 0). When the same string is stored for the second time, because the previous corresponding position has been set to 1, it is easy to know that this value already exists (deduplication is very convenient).

Different strings may be hashed at the same position. In this case, we can appropriately increase the size of the bit array or adjust our hash function.

In summary, we can conclude that the Bloom filter says that a certain element exists, and there is a small probability that it will misjudge. Bloom filter says that an element is not there, then this element must not be there.

Three, Bloom filter usage scenarios

  1. Determine whether a given data exists: for example, determine whether a number is in a digital set containing a large number of numbers (the number set is large, more than 500 million!), to prevent cache penetration (to determine whether the requested data is effective to avoid directly bypassing the cache request database) And so on, spam filtering of mailboxes, blacklist function, etc.
  2. De-duplication: For example, when crawling a given URL, de-duplicate the crawled URL.

Interlude:
More real Java interview questions from top Internet companies such as Alibaba, Tencent, Meituan, Jingdong, etc.; including: basics, concurrency, locks, JVM, design patterns, data structures, reflection/IO, database, Redis, Spring, message queues , Distributed, Zookeeper, Dubbo, Mybatis, Maven, Face Classic, etc.
More advanced skills for Java programmers; for example, efficient learning (how to learn and read code, face boring and large
amounts of knowledge) and efficient communication (communication methods and skills, communication techniques) and some careers shared by Java masters Career sharing document


Please click here to add """"""""" community, get it for free


Competitors who are better than you are learning, your enemies are sharpening, your girlfriends are losing weight, and the next door Lao Wang is practicing waist. We must keep learning, otherwise we will be surpassed by learners!
Try hard while you are young, and give your future self an explanation!


Fourth, manually implement bloom filters through Java programming

We have already talked about the principle of the Bloom filter. After knowing the principle of the Bloom filter, you can implement one manually.

If you want to implement one manually, you need to:

  1. A bit array of appropriate size to save the data
  2. Several different hash functions
  3. Implementation of the method of adding elements to the bit array (bloom filter)
  4. The implementation of the method to determine whether a given element exists in the bit array (bloom filter).

Here is a code that I think is pretty good (refer to existing code improvements on the Internet, applicable to all types of objects):

import java.util.BitSet; 
public class MyBloomFilter {
    
     
	/** 
		* 位数组的大小 
		*/
	private static final int DEFAULT_SIZE = 2 << 24; 
	/** 
		* 通过这个数组可以创建 6 个不同的哈希函数 
		*/ 
	private static final int[] SEEDS = new int[]{
    
    3, 13, 46, 71, 91, 134}; 
	/** 
		* 位数组。数组中的元素只能是 0 或者 1 
		*/ 
	private BitSet bits = new BitSet(DEFAULT_SIZE); 
	/** 
		* 存放包含 hash 函数的类的数组 
		*/ 
	private SimpleHash[] func = new SimpleHash[SEEDS.length]; 
	/** 
		* 初始化多个包含 hash 函数的类的数组,每个类中的 hash 函数都不一样 
		*/ 
	public MyBloomFilter() {
    
     
		// 初始化多个不同的 Hash 函数 
		for (int i = 0; i < SEEDS.length; i++) {
    
     
			func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]); 
		} 
	}
	/** 
		* 添加元素到位数组 
		*/ 
	public void add(Object value) {
    
     
		for (SimpleHash f : func) {
    
     
			bits.set(f.hash(value), true); 
		} 
	}
	/** 
		* 判断指定元素是否存在于位数组 
		*/ 
	public Boolean contains(Object value) {
    
     
		Boolean ret = true; 
		for (SimpleHash f : func) {
    
     
			ret = ret && bits.get(f.hash(value)); 
		}
		return ret; 
	}
	/** 
		* 静态内部类。用于 hash 操作! 
		*/ 
	public static class SimpleHash {
    
     
		private int cap; 
		private int seed; 
		public SimpleHash(int cap, int seed) {
    
     
			this.cap = cap; 
			this.seed = seed; 
		}
		/** 
			* 计算 hash 值 
			*/ 
		public int hash(Object value) {
    
     
			int h; 
			return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16))); 
		}
	} 
}

test:

String value1 = "https://javaguide.cn/"; 
String value2 = "https://github.com/Snailclimb"; 
MyBloomFilter filter = new MyBloomFilter(); 
System.out.println(filter.contains(value1)); 
System.out.println(filter.contains(value2)); 
filter.add(value1); 
filter.add(value2); 
System.out.println(filter.contains(value1)); 
System.out.println(filter.contains(value2));

Output:

false 
false 
true 
true

test:

Integer value1 = 13423; 
Integer value2 = 22131; 
MyBloomFilter filter = new MyBloomFilter(); 
System.out.println(filter.contains(value1)); 
System.out.println(filter.contains(value2)); 
filter.add(value1); 
filter.add(value2); 
System.out.println(filter.contains(value1)); 
System.out.println(filter.contains(value2));

Output:

false 
false 
true 
true

Fifth, use the Bloom filter that comes with Google's open source Guava

The purpose of my own realization is mainly to let myself understand the principle of bloom filters. The implementation of bloom filters in Guava is relatively authoritative, so we don't need to manually implement a bloom filter in actual projects.

First, we need to introduce Guava's dependencies in the project:

<dependency> 
<groupId>com.google.guava</groupId> 
<artifactId>guava</artifactId> 
<version>28.0-jre</version> 
</dependency>

The actual use is as follows:

We have created a Bloom filter that can store up to 1500 integers at most, and we can tolerate a false positive probability of (0.01)

// 创建布隆过滤器对象 
BloomFilter<Integer> filter = BloomFilter.create( 
		Funnels.integerFunnel(), 
		1500, 
		0.01);
// 判断指定元素是否存在 
System.out.println(filter.mightContain(1)); 
System.out.println(filter.mightContain(2)); 
// 将元素添加进布隆过滤器 
filter.put(1); 
filter.put(2); 
System.out.println(filter.mightContain(1)); 
System.out.println(filter.mightContain(2));

In our example, if mightContain()during the method returns true, we can determine that the 99% in the filter element, when the filter returns false, we can be 100% sure that the element is not present in the filter.

The implementation of the Bloom filter provided by Guava is still very good (if you want to know more about it, you can look at its source code implementation), but it has a major defect that it can only be used on a single machine (in addition, capacity expansion is not easy). But now the Internet is generally a distributed scenario. In order to solve this problem, we need to use the Bloom filter in Redis.

Six, Bloom filter in Redis

1 Introduction

After Redis v4.0 has Module (module/plug-in) function, Redis Modules allows Redis to use external modules to extend its functions. Bloom filter is one of the Module.

In addition, the official website recommends a RedisBloom as the Redis Bloom filter Module, address: https://github.com/RedisBloom/RedisBloom. Others include:

  • redis-lua-scaling-bloom-filter (lua script implementation): https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter
  • pyreBloom (Fast Redis Bloom filter in Python): https://github.com/seomoz/pyreBloom

RedisBloomIt provides multilingual client support, Pythonincluding: Java, , JavaScriptand PHP.

2. Install using Docker

The specific operations are as follows:

~ docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest ~ docker exec -it redis-redisbloom bash 
root@21396d02c252:/data# redis-cli 
127.0.0.1:6379>

3. List of commonly used commands

Note: key: the name of the Bloom filter, item: the added element.

  1. BF.ADD: Add the element to the Bloom filter, if the filter does not exist yet, create the filter. Format:BF.ADD {key} {item} .
  2. BF.MADD: Add one or more elements to "Blon Filter" and create a filter that does not yet exist. The operation of this command is the same as BF.ADD, except that it allows multiple inputs and returns multiple values. Format:BF.MADD {key} {item} [item ...] .
  3. BF.EXISTS: Determine whether the element exists in the Bloom filter. Format:BF.EXISTS {key} {item} .
  4. BF.MEXISTS: Determining whether there is one or more elements in the Bloom filter format: BF.MEXISTS {key} {item} [item ...].

In addition, the BF.RESERVEcommand needs to be introduced separately:

The format of this command is as follows:

BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion]

The following briefly introduces the specific meaning of each parameter:

  1. key: the name of the Bloom filter
  2. error_rate: The expected probability of false alarms. This should be a decimal value between 0 and 1. For example, for the expected false alarm rate of 0.1% (1 in 1000), error_rate should be set to 0.001. The closer the number is to zero, the greater the memory consumption of each item, and the higher the CPU usage of each operation.
  3. capacity: The capacity of the filter. When the number of elements actually stored exceeds this value, performance will begin to decline. The actual degradation will depend on how far the limit is exceeded. As the number of filter elements increases exponentially, performance will decrease linearly.

Optional parameters:

  • expansion: If a new sub-filter is created, its size will be the size of the current filter multiplied by expansion. The default extension value is 2. This means that each subsequent sub-filter will be twice the previous sub-filter.

4. Actual use

127.0.0.1:6379> BF.ADD myFilter java 
(integer) 
1 127.0.0.1:6379> BF.ADD myFilter javaguide 
(integer) 1 
127.0.0.1:6379> BF.EXISTS myFilter java 
(integer) 1 
127.0.0.1:6379> BF.EXISTS myFilter javaguide 
(integer) 1 
127.0.0.1:6379> BF.EXISTS myFilter github 
(integer) 0

Reference material: "Comprehensive Analysis of Java Intermediate and Advanced Core Knowledge" is limited to 100 copies. Some people have already obtained it through my previous article!
Seats are limited first come first served! ! !
Students who want to get this learning material can click here to get it for free """""""

Guess you like

Origin blog.csdn.net/Java_Caiyo/article/details/111473439