How to determine whether there is an element in a one hundred million data set?

Bloom filter concept
Bloom filter (Bloom Filter) in 1970 proposed by Bloom, is designed to retrieve an element exists in a set of algorithms.

You might want to determine whether an element in the collection, is not that a collection of built-in features?

Indeed, when a small number of elements is no problem, but if there is a mass of elements in trouble, such as millions or even billions of elements, and the sizes of each element, there may be large, space-efficient collection and then query efficiency will be worrying.

The Bloom filter can be ingenious solution to this problem, which includes a long series of binary vectors and hash functions, it does not store the actual content of the element, but in the binary vector logo in this element exists, and hash function is used to locate elements.

  1. Usage scenarios
    Bloom filter core role is to determine whether there is an element, it can play a very big role in today's massive data scene.

E.g:

2.1 through repository database to prevent
Bigtable, HBase Cassandra and other large data storage systems also use Bloom filters.

Query disk I / O, costly, if a large number of query data does not exist, it will seriously affect database performance.

Use Bloom filter can determine if the data does not exist in advance, to avoid unnecessary disk operations.

2.2 prevent caching penetration
will normally determine whether or not in the cache, and if not, attend DB, and placed in the cache query.

This is a normal process, no problem.

But if there is malicious requests, query data has been non-existent, such as detailed information about the user's query abc, and abc does not exist.

In accordance with normal procedure, then it will certainly be read DB, the pressure that the database is big.

Then you can use the Bloom filter, for example, abc requesting user when the first user determines whether or not the presence, absence returned directly, avoiding a database query.

2.3 weight crawler URL to
avoid crawling the same URL address.

Anti-spam
determine whether a mail is spam mail from billions of junk mail list.

  1. The principle
    we are to understand its principles with an example.

Suppose a binary array, a length of 8, the initial values ​​are 0 (0 indicates absence).
How to determine whether there is an element in a one hundred million data set?

Joe Smith now added elements, the first position is positioned in a binary array by the hash function, then this value is set to position 1:

hash1 (Zhang) 8 = 4%
How to determine whether there is an element in a one hundred million data set?

John Doe is now required is determined whether the presence, the position is calculated using the same method to obtain a value of this position

How to determine whether there is an element in a one hundred million data set?
0 if John Doe is not present.

This is the basic principle.

We all know that hash collisions are common, so the positioning elements by a hash function is unreliable.

For example Joe Smith, the king of five hash positioning are 4:

hash1(张三) % 8 = 4

hash1(王五) % 8 = 4

Zhang elements already exist, Wang Wu does not exist, but because the [4] the value is 1, the judgment result is the presence of the king five, which misjudged.

To solve the problem of hash conflict, often using multiple hash functions to position elements, such as:

The same element, through a number of different hash algorithm, the same probability calculated from the results is very low.

Value calculated position if it contains 0, then surely there is no certain elements

On the contrary, if we are all one, but there is a certain element can not be sure, because there may be a hash conflict

Bloom filter implemented Redis
Redis 4.0 launch mode module, extension modules can be developed, RedisBloom is the Bloom filter extension.

Practice
started with Redis environment RedisBloom of:

docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest

Enter redis client container:

# 进入容器

docker exec -it redis-redisbloom bash

# 登录redis客户端

redis-cli

Adding elements:

127.0.0.1:6379> BF.ADD newFilter foo

(integer) 1

Detecting element exists:

127.0.0.1:6379> BF.EXISTS newFilter foo

(integer) 1

127.0.0.1:6379> BF.EXISTS newFilter foo2

(integer) 0

Guess you like

Origin blog.51cto.com/14528283/2453674