The code for this handling is written in the dupefilter.py file, which defines the method for handling duplicate urls.

When scrapy starts, if the repeated url is configured to write to the file ( requests.seen), the file will be opened in an appending manner, and the previous data will be loaded from this file to the memory set () to save, when a new one is encountered. When the url comes, it will be searched in the collection of crawled urls by fingerprint calculation. If it does not exist, it will be added. If it needs to write a file, it will be written into the file. If it already exists, it will tell the upper layer that the calling url has been crawled. .

For details, please refer to  class RFPDupeFilter(BaseDupeFilter) the class.

So how to use the methods of this class in scrapy? When is it used and what is the process like?

This can be traced back to the Scheduler class defined in scrapy.core.scheduler.

Now let's take a look at what the Scheduler class has to do with filtering duplicate urls.

In the Scheduler class, when scheduling, the storage methods of memory queue and disk queue are used. Therefore, there is a method for joining the queue. Before joining the queue, it is necessary to  request check whether it is a duplicate. If it has been duplicated, Not joining the team.

1
if not  request.dont_filter  and self .df.request_seen(request)

There are two conditions for control here. The first is dont_filter in the configuration. If it is True, it means that it is not filtered. If it is False, it is to be filtered.
The following request_seen() is the default built-in filter method, which is the method in RFPDupeFilter(), which checks whether the request already exists.

Only if you want to filter and you have not seen this request, you will filter the url.

So it's very clear here. When the scheduler receives the  enqueue_request() call, it will check the duplicate judgment switch of the url. If it wants to filter, it must check whether the request already exists; if the check here is true, it will return directly, only When it is not established, there will be subsequent storage operations, that is, entering the queue.


Let's take a look at how scrapy judges the duplication of two urls.

The key function is  request_fingerprintthat this is the key implementation method for judging whether to repeat. ( scrapy.utils.request.request_fingerprint()).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def request_fingerprint(request, include_headers = None ):
     if include_headers:
         include_headers  = tuple ([h.lower()  for in sorted (include_headers)])
     cache  = _fingerprint_cache.setdefault(request, {})
     if include_headers  not in  cache:
         fp  = hashlib.sha1()
         fp.update(request.method)
         fp.update(canonicalize_url(request.url))
         fp.update(request.body  or '')
         if include_headers:
             for hdr  in include_headers:
                 if hdr  in request.headers:
                     fp.update(hdr)
                     for in request.headers.getlist(hdr):
                         fp.update(v)
         cache[include_headers]  = fp.hexdigest()
     return cache[include_headers]

By default, the calculated content includes method, formatted url, request body, and http headers are optional.

Different from the usual situation, the fingerprint calculation here is not simply comparing whether the urls are consistent. The result of the calculation is a string of hash hexadecimal numbers.

A question naturally arises here. If calculating fingerprints is not simply comparing urls, what is the request object? When request_fingerprint() is called, what calculations have the request gone through, is the url already downloaded when the request is passed here? Or did not download? If it has been downloaded, there will be a problem of repeated downloads, then the meaning of deduplication is very small; if it has not been downloaded, how do the contents of method, header, and body know?