On the Bloom Filter Bloom filter

Start with a face questions begin:

To A, B two files, each storing five billion URL, each URL occupy 64 bytes, the memory limit is 4G, let you find out the A, common URL B file.

Nature of the problem lies in determining whether an element in a set. Hash table with time complexity O (1) to query elements, but paid the price space. In the big data problem, even if the hash table has a 100% utilization of space, we need at least 5 billion * 64Byte of space, 4G is definitely not enough.

Of course, we may think of using a bitmap, each URL rounded hash value, the bit is placed on a position corresponding to FIG. 4G about 32 billion bit, appears to be feasible. However, a bitmap suitable for mass, the value is evenly distributed set of de-emphasis. The complexity of the space bitmap is set with the largest increases linearly with increasing element. To design a very low collision rate hash function is bound to increase the range of the hash value, if the hash value of the maximum take to two 64- bit map takes about space G of 2.3 billion. Bitmap 4G maximum value is about 32 billion, for the design Yi Tiao URL 50 collision rate is very low, the maximum value of 320 million hash function more difficult.

The idea is to solve the subject of a file can be cut into small files 4G space, with emphasis on small files A and B correspond to two paper cutting.

A and B are cut files, according hash(URL) % kavoid redundancy computation value k to divide the URL different files, such as A1, A2, ..., Ak and B1, B2, ..., Bk, and can save the hash value. Such common URL Bn A small files and documents will certainly be assigned to An small files corresponding. An reading to a hash table, and then traversing Bn, determining whether there are duplicate URL.

Another solution is to use the idea of ​​Bloom Filter Bloom filter up.

About Bloom Filter

Bloom filter (Bloom-Filter) in 1970 proposed by Bloom. It can be used to retrieve an element is in a set.

Bloom filter is in fact an extension of the bitmap, it is to use a plurality of different hash functions. It includes a long binary vector (bit map) and a series of random mapping function.

Establishing a first m-bit bitmap, and then added to each element of k hash functions k hash value map find the bitmap position k, k 'bit and then these positions set to all 1. FIG k = 3 is the Bloom filter:

When the search, we simply retrieve these k bits are 1 is not on it: If any of these bits have a 0, then the subject is not certain elements; if we are all one, then the subject element is likely.

Bloom filter can be seen that the efficiency in time and space is relatively high, but there are disadvantages:

  • There is miscarriage of justice. Bloom filters can be determined within 100% of an element not in the set, but not 100% certain elements in a set. When k bits are 1, there may be other elements of these bit set to 1.
  • Delete difficult. A map element into the container k th bit position of FIG. 1, when not simply delete all set to 0 directly, it may affect the determination of other elements.

Bloom Filter to achieve

To achieve a Bloom filter, we need to estimate the amount of data to be stored is n, a desired error rate of P, then the number k calculated size of the bitmap m, hash function, and selecting a hash function.

FIG formula for determining the bit size of m:

The number of hash functions k formula:

Has been implemented in Python Bloom filter bag: pybloom

installation

pip install pybloom

Simply look at the implementation:

class BloomFilter(object):
    FILE_FMT = b'<dQQQQ'

    def __init__(self, capacity, error_rate=0.001):
        """Implements a space-efficient probabilistic data structure
        capacity
            this BloomFilter must be able to store at least *capacity* elements
            while maintaining no more than *error_rate* chance of false
            positives
        error_rate
            the error_rate of the filter returning false positives. This
            determines the filters capacity. Inserting more than capacity
            elements greatly increases the chance of false positives.
        >>> b = BloomFilter(capacity=100000, error_rate=0.001)
        >>> b.add("test")
        False
        >>> "test" in b
        True
        """
        if not (0 < error_rate < 1):
            raise ValueError("Error_Rate must be between 0 and 1.")
        if not capacity > 0:
            raise ValueError("Capacity must be > 0")
        # given M = num_bits, k = num_slices, P = error_rate, n = capacity
        #       k = log2(1/P)
        # solving for m = bits_per_slice
        # n ~= M * ((ln(2) ** 2) / abs(ln(P)))
        # n ~= (k * m) * ((ln(2) ** 2) / abs(ln(P)))
        # m ~= n * abs(ln(P)) / (k * (ln(2) ** 2))
        num_slices = int(math.ceil(math.log(1.0 / error_rate, 2)))
        bits_per_slice = int(math.ceil(
            (capacity * abs(math.log(error_rate))) /
            (num_slices * (math.log(2) ** 2))))
        self._setup(error_rate, num_slices, bits_per_slice, capacity, 0)
        self.bitarray = bitarray.bitarray(self.num_bits, endian='little')
        self.bitarray.setall(False)

    def _setup(self, error_rate, num_slices, bits_per_slice, capacity, count):
        self.error_rate = error_rate
        self.num_slices = num_slices
        self.bits_per_slice = bits_per_slice
        self.capacity = capacity
        self.num_bits = num_slices * bits_per_slice
        self.count = count
        self.make_hashes = make_hashfuncs(self.num_slices, self.bits_per_slice)

    def __contains__(self, key):
        """Tests a key's membership in this bloom filter.
        >>> b = BloomFilter(capacity=100)
        >>> b.add("hello")
        False
        >>> "hello" in b
        True
        """
        bits_per_slice = self.bits_per_slice
        bitarray = self.bitarray
        hashes = self.make_hashes(key)
        offset = 0
        for k in hashes:
            if not bitarray[offset + k]:
                return False
            offset += bits_per_slice
        return True

Basically the same formula.

FIG algorithm is divided into k-bit segment (the code num_slices, i.e. the number of k hash functions), each length of the code bits_per_slice, each bit hash function is only responsible for the set corresponding to segment 1:

        for k in hashes:
            if not skip_check and found_all_bits and not bitarray[offset + k]:
                found_all_bits = False
            self.bitarray[offset + k] = True
            offset += bits_per_slice

When the desired error rate of 0.001, the ratio of m to n is about 14:

>>> import math
>>> abs(math.log(0.001))/(math.log(2)**2)
14.37758756605116

When the desired error rate of 0.05, the ratio of m to n is about 6:

>>> import math
>>> abs(math.log(0.05))/(math.log(2)**2)
6.235224229572683

Above subject, m is a maximum of 320 million of n-5000000000, the false positive rate was about 0.04, the acceptable range:

>>> math.e**-((320/50.0)*(math.log(2)**2))
0.04619428041606246

application

Bloom filters are generally used to determine whether there is an element in the collection of large amounts of data:

1.  Cache penetration :

Cache penetration, refers to a data query the database does not necessarily exist. Under normal circumstances, the first query cache query, if the key does not exist or key has expired, then query the database and query the object into the cache. If every query a database does not exist in the key, because the cache is no data, each will go to query the database, the database is likely to have an impact.

One solution was to penetrate the buffer cache key does not exist a null value is returned directly in the cache layer. Drawbacks of doing so is too much empty the cache value takes up too much extra space, this can be solved by caching layer to the null set up a short expiration time.

Another solution is to use Bloom filters, a query key, the first Bloom filter used to filter the request is determined if the query key value is present, the research database; determining if the query request is not present, directly discarded.

2.  reptiles :

In the web crawler, the URL for de-duplication strategy.

3.  Spam address filtering

Because spammers could go on and register a new address, Email address Spam is a huge amount of collection. Use hash table storing billions of e-mail addresses may require hundreds of GB of memory, while the Bloom filter need only 1/8 to 1/4 the size of the hash table will be able to solve the problem. Bloom filter'll never miss a suspicious address in the blacklist. As for the problem of false positives, a common remedy is the establishment of a small white list, store those innocent e-mail address may be false positives.

4. Google的BigTable

Google's BigTable Bloom filter is also used to reduce the row or column is not present on the disk I / O.

5. Summary Cache

Summary Cache is a shared protocol between the proxy server Proxy Cache for. Bloom filter can be used to build Summary Cache, Cache each page by the URL unique identification, therefore the Proxy Cache contents can be expressed as a URL list. We can turn this URL list is represented by a collection Bloom filter.

Spread

To achieve delete elements, you can use Counting Bloom Filter. It will extend the standard Bloom filter bitmap for each one of a small counter (Counter), the corresponding element is inserted into the Counter k values ​​are incremented by one, decremented by one when the element deleted:

The cost is several times more storage space.

Guess you like

Origin www.cnblogs.com/linxiyue/p/11295463.html