Three kinds of ways to re --HashSet, Redis to weight, Bloom filter (BloomFilter)

Three kinds of ways to re


There are three ways to achieve de-duplication, then what difference will it make?

HashSet

Use of java HashSet not repeated to weight characteristics. The advantage is easy to understand. Easy to use.

Disadvantages: large memory footprint, lower performance.

Redis deduplication

Use of de-heavy set Redis. The advantage is speed (speed quickly the Redis itself), and to not take the heavy crawler resource server can handle a larger amount of data crawling.

Cons: Redis server need to be prepared to increase the development and use of cost.

Bloom filter (BloomFilter )

Using Bloom filters may also be implemented to weight. The advantage is occupied by memory than using HashSet is much smaller, but also for a large number of data de-duplication operation.

Cons: There may be a miscarriage of justice. No repeat determination may be repeated , but the duplicate data will determine repeat.



Bloom filter (BloomFilter)

Bloom filter (Bloom Filter) proposed by Burton Howard Bloom in 1970, which is a space efficient probabilistic data structure type, and for determining whether an element in the collection . In the black and white lists of spam filtering method, reptiles (Crawler) URL sentenced to heavy modules in and so often used.

Hash table can also be used for determining whether an element in the collection , but the Bloom filter spatial complexity hash table requires only 1/8 or 1/4 can be done the same problem .

Bloom filter elements can be inserted, but not delete existing elements. The more one of the elements, the greater the rate of false positives, but the omission is not possible.

principle:

Bloom filter is needed is an array of bits (bit map and the like) and the K mapping function (Hash table, and the like), in the initial state, the length of m bits Array group, all of its bits are set to zero.

 

For n elements set S = {S1, S2 ... Sn}, through the mapping function of k {f1, f2, ...... fk}, the set S of each element in Sj (1 < = j <= n) mapped to K values ​​{g1, g2 ... gk}, then the number of bits in the array group corresponding array [g1], array [g2] ...... array [gk] set to 1:

 

If the item you want to find whether an element in S, the mapping function by {f1, f2, ... fk} k values ​​obtained {g1, g2 ... gk}, and then determines array [g1], array [ g2] ... array [gk] is 1 whether, if all 1, then the item in S, or item not in S.

Bloom filters may cause some false positives, because the collection of several elements by a value obtained after mapping happens comprising g1, g2, ... gk, in such a case may result in false positives, but the probability is very small .

Published 434 original articles · won praise 105 · views 70000 +

Guess you like

Origin blog.csdn.net/qq_39368007/article/details/105048889