Docking of Bloom Filter algorithm optimized for Scrapy crawler deduplication efficiency

First review the Scrapy-Redis deduplication mechanism. Scrapy-Redis stores the fingerprint of the Request in the Redis collection. The length of each fingerprint is 40. For example, 27adcc2e8979cdee0c9cecbbe8bf8ff51edefb61 is a fingerprint, and each bit is a hexadecimal number.

Let's calculate the storage space consumed in this way. Each hexadecimal number occupies 4 b, one fingerprint is represented by 40 hexadecimal numbers, the occupied space is 20 B, 10,000 fingerprints occupy 200 KB, and 100 million fingerprints occupy 2 GB. When the number of crawls reaches hundreds of millions of levels, the memory occupied by Redis will become very large, and this is only the storage of fingerprints. Redis also stores the crawling queue, and the memory usage will be further increased, not to mention the case of multiple Scrapy projects crawling at the same time. When the crawl reaches the scale of 100 million, the collection deduplication provided by Scrapy-Redis can no longer meet our requirements. So we need to use a more memory-saving deduplication algorithm Bloom Filter.

1. Understanding Bloom Filter

Bloom Filter, whose Chinese name is called Bloom filter, was proposed by Bloom in 1970. It can be used to detect whether an element is in a set. Bloom Filter's space utilization efficiency is very high, using it can greatly save storage space. Bloom Filter uses a bit array to represent a set to be detected, and can quickly determine whether an element exists in this set through a probability algorithm. Using this algorithm we can achieve the deduplication effect.

In this section we come to understand the basic algorithm of Bloom Filter and the method of docking Bloom Filter in Scrapy-Redis.

2. Bloom Filter algorithm

Use bit arrays in Bloom Filter to assist in the implementation of detection and judgment. In the initial state, we declare a bit array containing m bits, all of its bits are 0, as shown in the following figure.

Now we have a set to be detected, which is represented as S = {x1, x2,…, xn}. The next thing to do is to check whether an x ​​already exists in the set S. In the Bloom Filter algorithm, first use k independent and random hash functions to map each element x1, x2,…, xn in the set S to a bit array of length m, and the result obtained by the hash function Record as the position index, and then set the position index of the bit array to position 1. For example, we take k as 3, which means there are three hash functions. The results of x1 mapping through three hash functions are 1, 4, and 8, and the results of x2 mapping through three hash functions are 4, 6, respectively. , 10, then the five bits 1, 4, 6, 8, and 10 of the bit array will be set to 1, as shown in the following figure.

If there is a new element x, we have to determine whether x belongs to the S set. We still use k hash functions to find the mapping result for x. If the bit array positions corresponding to all results are 1, then x belongs to the set of S; if there is one other than 1, then x does not belong to the set of S.

For example, the result of the mapping of the new element x through three hash functions is 4, 6, and 8, the corresponding positions are all 1, then x belongs to the S set. If the result is 4, 6, 7, and the position corresponding to 7 is 0, then x does not belong to the S set.

Note that the relationship satisfied by m, n, and k here is m> nk, that is, the length m of the bit array is greater than the product of the set element n and the hash function k.

This method of determination is very efficient, but it comes at a price. It may mistake elements that do not belong to this set as belonging to this set. Let's estimate the error rate of this method. When all the elements of the set S = {x1, x2, ..., xn} are mapped into the m-bit bit array by k hash functions, the probability that a bit in this bit array is still 0 is:

The hash function is random, then the probability of any hash function selecting this bit is 1 / m, then 1-1 / m represents the probability that the hash function has never selected this bit, and S must be completely mapped. To the m-bit array, you need to do the hash operation of kn times, and the final probability is the kn power of 1-1 / m.

If an element x that does not belong to S is misjudged to be in S, then this probability is that the bit array position corresponding to the result of k hash operations is 1, and the probability of misjudgment is:

according to:

The probability of misjudgment can be transformed into:

Given m and n, the value of k that minimizes f can be found:

Here, the probability of misjudgment is summarized as follows:

The first column in the table is the value of m / n, the second column is the optimal k value, and the subsequent column is the probability of misjudgment of different k values. When the value of k is determined, as m / n increases, the probability of misjudgment gradually decreases. When the value of m / n is determined, the closer the k is to the optimal K value, the smaller the probability of misjudgment. In general, the probability of misjudgment is extremely small. Under the condition of tolerating this probability of misjudgment, it is worthwhile to greatly reduce the storage space and decision speed.

Next, we apply the Bloom Filter algorithm to the de-duplication process of the Scrapy-Redis distributed crawler to solve the problem of insufficient memory in Redis.

3. Connect with Scrapy-Redis

When implementing Bloom Filter, we must first ensure that the operating structure of Scrapy-Redis distributed crawling cannot be destroyed. We need to modify the source code of Scrapy-Redis to replace its deduplication class. At the same time, the implementation of Bloom Filter needs to rely on a bit array. Since the current architecture still depends on Redis, it is good to use Redis directly for the maintenance of the bit array.

First implement a basic hashing algorithm, which maps a value to a certain bit of an m-bit array after hashing. The code is as follows:

class HashMap(object):
    def __init__(self, m, seed):
        self.m = m
        self.seed = seed
    
    def hash(self, value):
        """
        Hash Algorithm
        :param value: Value
        :return: Hash Value
        """
        ret = 0
        for i in range(len(value)):
            ret += self.seed * ret + ord(value[i])        
        return (self.m - 1) & ret

A new HashMapclass has been created here . The constructor passes in two values, one is the number of bits in the m-bit array, and the other is the seed value seed. Different hash functions need to be different seed, so as to ensure that the results of different hash functions will not collide.

In hash()the implementation of the method, it valueis the content to be processed. Here we traverse valueeach bit, and use the ord()method to get the ASCII code value of each bit, and then confuse seediterative summation operation, and finally get a value. The result of this value is determined by valueand seedonly. We then perform a bitwise AND operation on this value and m to obtain the mapping result of the m-bit array, thus implementing a seedhash function determined by the string sum. When m is fixed, as long as the seedvalue is the same, the hash function is the same, and the same valuewill inevitably be mapped to the same position. So if you want to construct several different hash functions, you only need to change them seed. The above is the realization of a simple hash function.

Next we will implement Bloom Filter. Bloom Filter needs to use k hash functions. Here we need to specify the same m value and different seedvalues for these hash functions . The structure is as follows:

BLOOMFILTER_HASH_NUMBER = 6
BLOOMFILTER_BIT = 30

class BloomFilter(object):
    def __init__(self, server, key, bit=BLOOMFILTER_BIT, hash_number=BLOOMFILTER_HASH_NUMBER):
        """
        Initialize BloomFilter
        :param server: Redis Server
        :param key: BloomFilter Key
        :param bit: m = 2 ^ bit
        :param hash_number: the number of hash function
        """
        # default to 1 << 30 = 10,7374,1824 = 2^30 = 128MB, max filter 2^30/hash_number = 1,7895,6970 fingerprints
        self.m = 1 << bit
        self.seeds = range(hash_number)
        self.maps = [HashMap(self.m, seed) for seed in self.seeds]
        self.server = server
        self.key = key

Since we need to deduplicate billions of data, that is, n in the algorithm introduced above is more than 100 million, the number of hash functions k takes about 10 or so. And m> kn, where the value of m is about 1 billion, because this value is relatively large, so here is implemented with a shift operation, the number of bits is passed in, and it is defined as 30, and then a shift operation 1<<30is equivalent to The 30th power of 2 is equal to 1073741824, and the magnitude is exactly about 1 billion. Because it is a bit array, the size occupied by this bit array is 2 ^ 30 b = 128 MB. At the beginning, we calculated that the occupied space of Scrapy-Redis set deduplication is about 2 GB. It can be seen that the space utilization efficiency of Bloom Filter is extremely high.

Then we pass in the number of hash functions and use it to generate several different ones seed. Use different ones seedto define different hash functions, so that we can construct a list of hash functions. Traverse seed, construct objects with different seedvalues HashMap, and then save the HashMapobjects into variables mapsfor subsequent use.

In addition, serverthe Redis connection object keyis the name of this m-bit array.

Next, we want to implement two key methods: one is to determine whether the element is repeated, exists()and the other is to add the element to the collection. The insert()implementation is as follows:

def exists(self, value):
    """
    if value exists
    :param value:
    :return:
    """
    if not value:        
        return False
    exist = 1
    for map in self.maps:
        offset = map.hash(value)
        exist = exist & self.server.getbit(self.key, offset)    
    return exist
    
def insert(self, value):
    """
    add value to bloom
    :param value:
    :return:
    """
    for f in self.maps:
        offset = f.hash(value)
        self.server.setbit(self.key, offset, 1)

First look at the insert()method. The Bloom Filter algorithm will call the hash function one by one to perform operations on the elements placed in the set to obtain the mapping position in the m-bit array, and then set the corresponding position of the bit array to 1. In this code, we traverse the initialized hash function, and then call its hash()method to calculate the mapping position offset, and then use the Redis setbit()method to set this position to 1.

In the exists()method, we need to implement the logic to determine whether to repeat, and the method parameters valueare the elements to be determined. We first define a variable exist, traverse all the hash function pairs and valueperform the hash operation to obtain the mapping position, use the getbit()method to obtain the result of the mapping position, and perform the AND operation in a loop. In this way, only when getbit()the result obtained every time is 1, the last existis True, that is , it valuebelongs to this set. If only one getbit()of the results is 0, that is, there is a corresponding 0 bit in the m-bit array, then the final result existis False, which means that it valuedoes not belong to this set.

The implementation of Bloom Filter has been completed, we can use an example to test, the code is as follows:

conn = StrictRedis(host='localhost', port=6379, password='foobared')
bf = BloomFilter(conn, 'testbf', 5, 6)
bf.insert('Hello')
bf.insert('World')
result = bf.exists('Hello')
print(bool(result))
result = bf.exists('Python')
print(bool(result))

Here we first define a Redis connection object and then pass it to Bloom Filter. In order to avoid excessive memory usage, the number of bits passed here is relatively small, set to 5, and the number of hash functions is set to 6.

Call the insert()method of inserting Helloand Worldtwo strings, and then determines Helloand Pythonwhether the two strings is present, the final output of its results, results are as follows:

True
False

Obviously, the result is completely fine. In this way, we successfully implemented the Bloom Filter algorithm with the help of Redis.

Next, continue to modify the source code of Scrapy-Redis, replacing its dupefilter logic with Bloom Filter logic. Here is mainly RFPDupeFilterthe request_seen()method of modifying the class , implemented as follows:

def request_seen(self, request):
    fp = self.request_fingerprint(request)    
    if self.bf.exists(fp):        
        return True
    self.bf.insert(fp)    
        return False

Use the request_fingerprint()method to obtain the Request fingerprint, and call the Bloom Filter exists()method to determine whether the fingerprint exists. If it exists, it means that the Request is repeated and returned True, otherwise the method of Bloom Filter is called to insert()add and return the fingerprint False. In this way, Bloom Filter was successfully used to replace the Scrapy-Redis collection deduplication.

For the initial definition of Bloom Filter, we can __init__()modify the method to the following:

def __init__(self, server, key, debug, bit, hash_number):
    self.server = server
    self.key = key
    self.debug = debug
    self.bit = bit
    self.hash_number = hash_number
    self.logdupes = True
    self.bf = BloomFilter(server, self.key, bit, hash_number)

Wherein bitand the hash_numberneed to use from_settings()method to pass, modified as follows:

@classmethod
def from_settings(cls, settings):
    server = get_redis_from_settings(settings)
    key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
    debug = settings.getbool('DUPEFILTER_DEBUG', DUPEFILTER_DEBUG)
    bit = settings.getint('BLOOMFILTER_BIT', BLOOMFILTER_BIT)
    hash_number = settings.getint('BLOOMFILTER_HASH_NUMBER', BLOOMFILTER_HASH_NUMBER)    
    return cls(server, key=key, debug=debug, bit=bit, hash_number=hash_number)

Among them, constants DUPEFILTER_DEBUGand BLOOMFILTER_BITuniform definitions are in defaults.py, the default is as follows:

BLOOMFILTER_HASH_NUMBER = 6
BLOOMFILTER_BIT = 30

Now, we have successfully realized the connection between Bloom Filter and Scrapy-Redis.

4. Code in this section

The code address of this section is: https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.

5. Use

For ease of use, the code in this section has been packaged into a Python package and published to PyPi. The link is https://pypi.python.org/pypi/scrapy-redis-bloomfilter. You can use ScrapyRedisBloomFilter directly without having to implement it yourself.

We can install directly using pip, the command is as follows:

pip3 install scrapy-redis-bloomfilter -i https://pypi.douban.com/simple

The method used is basically similar to Scrapy-Redis, here are a few key configurations.

# Deduplication class, to use Bloom Filter Replacing DUPEFILTER_CLASS 
DUPEFILTER_CLASS = " scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter " 
# number of hash function, the default is 6, can modify 
BLOOMFILTER_HASH_NUMBER 6 = # 'bit of the Bloom Filter parameters, default 30, occupies 128MB space, go to heavyweight 100 million 
BLOOMFILTER_BIT = 30
  • DUPEFILTER_CLASSIt is the deduplication class. If you want to use Bloom Filter, you DUPEFILTER_CLASSneed to modify it to the deduplication class of the package.
  • BLOOMFILTER_HASH_NUMBERIt is the number of hash functions used by Bloom Filter. The default is 6, which can be modified according to the de-weighting.
  • BLOOMFILTER_BITThat is, BloomFilterthe bitparameters of the class introduced above , it determines the number of bits in the bit array. If it BLOOMFILTER_BITis 30, then the number of bits in the bit array is 2 to the 30th power, which will occupy 128 MB of Redis storage space, and the de-weighting level is about 100 million, that is, the corresponding crawling level is about 100 million. If the crawling level is 1 billion, 2 billion or even 10 billion, be sure to increase this parameter accordingly.

6. Test

The source code is accompanied by a test project, which is placed in the tests folder. The project uses ScrapyRedisBloomFilterduplicates. The implementation of Spider is as follows:

from scrapy import Request, Spider

class TestSpider(Spider):
    name = 'test'
    base_url = 'https://www.baidu.com/s?wd='

    def start_requests(self):
        for i in range(10):
            url = self.base_url + str(i)            
            yield Request(url, callback=self.parse)
        
        # Here contains 10 duplicated Requests    
        for i in range(100): 
            url = self.base_url + str(i)            
            yield Request(url, callback=self.parse)    

    def parse(self, response):
        self.logger.debug('Response of ' + response.url)

start_requests()The method first loops 10 times, constructs URLs with parameters 0-9, and then loops 100 times again, constructs URLs with parameters 0-99. Then here will contain 10 repeated Requests, we run the project test:

scrapy crawl test

The final output is as follows:

{'bloomfilter/filtered': 10, 
'downloader/request_bytes': 34021, 
'downloader/request_count': 100, 
'downloader/request_method_count/GET': 100, 
'downloader/response_bytes': 72943, 
'downloader/response_count': 100, 
'downloader/response_status_count/200': 100, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597), 
'log_count/DEBUG': 202, 
'log_count/INFO': 7, 
'memusage/max': 54153216, 
'memusage/startup': 54153216, 
'response_received_count': 100, 
'scheduler/dequeued/redis': 100, 
'scheduler/enqueued/redis': 100, 
'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}

The result of the first line of the final statistics:

'bloomfilter/filtered': 10,

This is the statistical result after Bloom Filter filtering. The number of filters is 10, that is, it successfully identifies 10 repeated Reqeusts and passes the test.

7. Conclusion

The above is the principle and docking implementation of Bloom Filter. The use of Bloom Filter can greatly save Redis memory. This solution is recommended when the amount of data is large.

Guess you like

Origin www.cnblogs.com/yunlongaimeng/p/12677601.html