Lecture 47: Significantly speed up, distributed crawler concept

We learned about the usage of Scrapy crawler framework in the previous few lessons. However, these frameworks are all running on the same host, and the crawling efficiency is relatively low. If multiple hosts can be crawled collaboratively, then the crawling efficiency will inevitably increase exponentially, which is the advantage of distributed crawlers.

Next, let's take a look at the basic principles of distributed crawlers and the process of Scrapy to implement distributed crawlers.

We have implemented the basic crawler function of Scrapy earlier. Although the crawler is asynchronous and multi-threaded, we can only run on one host, so the crawling efficiency is still limited, while the distributed crawler will have multiple The hosts are combined to complete a crawling task together, which will greatly improve the efficiency of crawling.

1. Distributed crawler architecture

Before understanding the distributed crawler architecture, first review the architecture of Scrapy, as shown in the figure.
Insert picture description here
There is a local crawling queue Queue in the Scrapy stand-alone crawler, which is implemented using the deque module. If a new Request is generated, it will be placed in the queue, and then the Request will be scheduled by the Scheduler. After that, the Request is handed over to the Downloader to perform crawling, and the simple scheduling architecture is shown in the figure.
Insert picture description here
If two Schedulers obtain Requests from the queue at the same time, each Scheduler will have its corresponding Downloader. What will happen to the crawling efficiency if the bandwidth is sufficient, normal crawling, and the queue access pressure is not considered? That's right, the crawling efficiency will double.

In this way, the Scheduler can be extended by multiple, and the Downloader can also be extended by multiple. The crawling queue Queue must always be one, which is the so-called shared crawling queue. Only in this way can it be ensured that after the Scheduler schedules a Request from the queue, other Schedulers will not schedule the Request repeatedly, and multiple Schedulers can be crawled simultaneously. This is the basic prototype of the distributed crawler, and the simple scheduling architecture is shown in the figure.
Insert picture description here
What we need to do is to run crawler tasks on multiple hosts at the same time for collaborative crawling, and the premise of collaborative crawling is to share the crawling queue. In this way, each host does not need to maintain its own crawling queue, but accesses the Request from the shared crawling queue. But each host has its own Scheduler and Downloader, so the scheduling and downloading functions are completed separately. If you do not consider the performance consumption of queue access, crawling efficiency can still be doubled.

2. Maintain the crawl queue

So how to maintain this queue? The first thing we need to consider is performance, so what database access efficiency is high? At this time, we naturally thought of Redis based on memory storage, and Redis also supports a variety of data structures, such as List List, Set Set, Sorted Set, etc. The access operation is also very simple, so we use Redis here. Maintain the crawl queue.

These types of data structure storage actually have their own merits, and the analysis is as follows:

  • The list data structure has lpush, lpop, rpush, rpop methods, so we can use it to implement a first-in first-out crawling queue, or a first-in-last-out stack crawling queue.

  • The elements of the set are unordered and non-repetitive, so we can easily implement a randomly sorted non-repeated crawling queue.

  • An ordered set has a score, and Scrapy's Request also has priority control, so we can use an ordered set to implement a priority scheduling queue.

We need to choose these different queues flexibly according to the needs of specific crawlers.

3. How to remove weight

Scrapy has an automatic deduplication function, and its deduplication uses collections in Python. This collection records the fingerprint of each Request in Scrapy, which is actually the hash value of the Request. We can look at the source code of Scrapy as follows:

import hashlib
def request_fingerprint(request, include_headers=None):
    if include_headers:
        include_headers = tuple(to_bytes(h.lower())
                                 for h in sorted(include_headers))
    cache = _fingerprint_cache.setdefault(request, {
    
    })
    if include_headers not in cache:
        fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b'')
        if include_headers:
            for hdr in include_headers:
                if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                        fp.update(v)
        cache[include_headers] = fp.hexdigest()
    return cache[include_headers]

request_fingerprint is the method to calculate the fingerprint of Request, and its method internally uses the sha1 method of hashlib. The calculated fields include the Method, URL, Body, and Headers of the Request. As long as there is a difference here, the calculated result will be different. The result of the calculation is the encrypted string, which is also the fingerprint. Each Request has a unique fingerprint. The fingerprint is a string. It is much easier to determine whether the string is repeated than to determine whether the Request object is repeated. Therefore, the fingerprint can be used as the basis for determining whether the Request is repeated.

So how do we determine whether it is a duplicate? Scrapy is implemented as follows:

def __init__(self):
    self.fingerprints = set()
    
def request_seen(self, request):
    fp = self.request_fingerprint(request)
    if fp in self.fingerprints:
        return True
    self.fingerprints.add(fp)

In the deduplication class RFPDupeFilter, there is a request_seen method, this method has a parameter request, its role is to detect whether the Request object is duplicate. This method calls request_fingerprint to obtain the fingerprint of the Request, and detects whether the fingerprint exists in the fingerprints variable, and fingerprints is a collection, and the elements of the collection are not repeated. If the fingerprint exists, it returns True, indicating that the Request is a duplicate, otherwise the fingerprint is added to the set. If the same Request is passed next time, and the fingerprint is the same, then the fingerprint already exists in the collection, and the Request object will be directly judged as a duplicate. In this way, the purpose of deduplication is achieved.

Scrapy's de-duplication process is to use the non-repetitive characteristics of collection elements to achieve Request de-duplication.

For distributed crawlers, we definitely can't use each crawler's own collection to remove duplication. Because in this way, each host maintains its own collection separately and cannot be shared. If multiple hosts generate the same Request, they can only de-duplicate each, and it will not be possible to de-duplicate each host.

In order to achieve deduplication of multiple hosts, this fingerprint set also needs to be shared. Redis just has a set storage data structure. We can use the Redis set as the fingerprint set, so the deduplication set is also shared. After each host generates a new Request, it will compare the fingerprint of the Request with the set. If the fingerprint already exists, it means that the Request is a duplicate. Otherwise, add the fingerprint of the Request to this set. Using the same principle and different storage structures, we can also achieve distributed Reqeust de-duplication.

4. Prevent interruption

In Scrapy, the Request queue of the crawler runtime is placed in memory. After the crawler is interrupted, the space in this queue is released and the queue is destroyed. So once the crawler is interrupted, running the crawler again is equivalent to a new crawling process.

To continue crawling after the interruption, we can save the Request in the queue, and read the saved data in the next crawl to get the queue of the last crawl. We can specify a storage path of the crawling queue in Scrapy. This path is identified by the JOB_DIR variable. We can use the following command to achieve:

scrapy crawl spider -s JOBDIR=crawls/spider

For more detailed usage, please refer to the official document, the link is: https://doc.scrapy.org/en/latest/topics/jobs.html.
In Scrapy, we actually save the crawling queue locally, and the second crawling can directly read and restore the queue. So do we still have to worry about this issue in a distributed architecture? No need. Because the crawling queue itself is stored in the database, if the crawler is interrupted, the Request in the database still exists, and the crawling will continue from the place where it was interrupted the next time it is started.

Therefore, when the Redis queue is empty, the crawler will crawl again; when the Redis queue is not empty, the crawler will continue to crawl from where it was interrupted last time.

Architecture implementation

We then need to implement this architecture in the program. First, a shared crawling queue needs to be implemented, as well as a deduplication function. In addition, you need to rewrite the implementation of a Scheduler so that it can access the Request from the shared crawl queue.

Fortunately, someone has implemented these logics and architectures and released them as a Python package called Scrapy-Redis.

In the next section, we will look at the source code implementation of Scrapy-Redis and its detailed working principle.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/108309236