Reptiles data deduplication - Bloom filter

Reptile data deduplication:

  • Use MD5 to generate a fingerprint to determine whether a change in the page
  • Data stored mongodb, the composite index of keywords (do less)
  • Keyword hash map data to generate a fingerprint to determine whether the fingerprint collection redis in, and whether to filter through into the team to determine whether the request object, the object of the request filter (million level)
  • Bloom filter, large data de-duplication (one hundred million level)

Bloom filter:

 achieve:

  • First through the expected error rate p, a desired number of samples n, calculates the required number of bits m of length

    m=-n*lnp  /  (ln2)**2

  • Recalculation of hash function number k

    k = ln2 * m / n

  • According to again m, n, k, calculate the true error rate, because the upward adjustment of m, k, so the true error rate <the expected error rate

    p = (1-e ** (- nk / m)) ** k

  • The data (key) obtained by the k hash values ​​of k hash functions
  • The obtained hash value of m modulo operation to obtain the number of bits corresponding to the group index position

    index=HashCode(key)&(m-1)

  1. If stored, the bit array becomes a position corresponding to the index
  2. If the query, it is determined whether the index corresponding to the group number of bits are all 1, then there are all 1

 Principle: hashmap

  • Values ​​can be mapped to a key hashmap returns the result in (1) time complexity O
  • Hashmap default length is 16, each power of two extensions are

 Hash function properties:

  • Hash collision: Different inputs can be the same hash function result output, limited input field, an output field unlimited
  • Discrete: each output field results in an output field is uniformly distributed throughout the

 

Guess you like

Origin www.cnblogs.com/zwp-627/p/11299283.html