https://blog.csdn.net/u012150179/article/details/38091411

A scrapy-redis implements distributed crawling analysis

The so-called scrapy-redis is actually scrapy+redis, in which the redis-py client is used for the operation of redis. The role of redis here and the direction of scrapy-redis have been translated (README.rst) in my fork repository (link: https://github.com/younghz/scrapy-redis ).
In the previous article, I have analyzed the center of crawler distribution using redis with the help of two related articles. It boils down to: all the urls (requests) obtained by crawlers are put into a redis queue, and all crawlers get requests (urls) from a single redis queue.
scrapy-redis has not been updated for a long time, how is it compatible with the newer version of scrapy I have also explained in the blog post (link: http://blog.csdn.net/u012150179/article/details/38087661 ), I may later Will rewrite scrapy-redis with a newer version of the scrapr interface.

Two distributed crawling implementation

1. Analysis of the built-in example in scrapy-redis

The use of the example has been explained in the README of the library, but there are many questions about running the spider in the example during initial contact. For example, where is the distributed embodiment? In what ways is it achieved? Secondly, it is difficult to find distributed shadows in the running results, it feels like two spiders are crawling their own things.
For the first question, I have already explained the settings.py in the translation and annotation of scrapy-redis. The second question is also what you need to do to implement your own example in 2.

2. Clearly verify the idea and coding implementation of scrapy-redis to achieve distributed.

(1) Ideas

Implement two crawlers, and define crawler A to crawl all links under the keyword business of dmoz.com (set by start_urls). The crawler B crawls all the links under the game, and observes the url of the link crawled when the two are running at the same time, whether it is the url of their own scope or the intersection of the two. In this way, since the crawling ranges defined by the two are inconsistent, the results can be obtained through the crawling phenomenon.

(2) Realization

The code is placed in the github repo ( https://github.com/younghz/scrapy-redis/ ). For easy observation, set DEPTH_LIMIT to 1.

(3) Phenomenon and Analysis

Phenomenon: It can be found that the two crawl the links under a single keyword at the same time (which one is crawled first depends on the start_urls of the crawler that runs the crawler first), and then crawl the links under another keyword after the completion.
Analysis: By crawling a single keyword at the same time, it can be shown that two crawlers are scheduled at the same time, which is the distribution of crawlers. And the crawler defaults to breadth-first search. The crawling steps are as follows:

i) First run crawler A (same for B), the crawler engine requests the link in start_urls in spider A and delivers the scheduler, and then the engine requests the crawled url from the scheduler and hands it to the downloader for download, and the downloaded response is given to the scheduler Spider, spider is linked according to the defined rules, and continues to pass the engine to the scheduler. (See the scrapy architecture for this series of processes). Among them, the request (url) sequence in the scheduler is implemented by redis queue, that is, the request (url) is pushed to the queue and popped out when the request is made.

 

ii) Then start B. Similarly, the start_urls of B are first handed over to the scheduler (note that it is the same as the scheduler in A), and when the engine of B requests to crawl the url, the url that the scheduler schedules to download to B is still A The url that has not been downloaded in (the default scheduling method is to schedule the returned url first, and it is breadth-first), which is that A and B download the unfinished link in A at the same time, and after the completion, download the required link of B at the same time.

iii) Question: How is the scheduling method in ii above implemented?
The SpiderPriorityQueue method is used by default in scrapy-redis, which is a non-FIFO and LIFO method implemented by sorted set.

3. Detailed analysis and attention points

Every time re-crawling is performed, the data stored in redis should be cleared, otherwise the crawling phenomenon will be affected.

4. Other

The difference between request and url:
the former is implemented by the latter through the function make_request_from_url, and this process is completed by the spider. The spider will return (return, yield) the request to the scrapy engine and then deliver the scheduler. The url is also defined in the spider or obtained by the spider.
spider和crawler:

A spider is different from a crawler. crawler contains spider. The architecture of scrapy is crawler, and the role of spider is to provide start_url, obtain the desired content according to the downloaded response analysis, continue to extract the url, etc.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325220527&siteId=291194637