The principle of scrapy-redis distributed crawler and its difference from scrapy

Scrapy is a general crawler framework, but does not support distributed
Scrapy-redis in order to more easily implement Scrapy distributed crawling , and provides some redis-based components (only components)

scrapy任务调度是基于文件系统,这样只能在单机执行crawl。

scrapy-redis将待抓取request请求信息和数据items信息的存取放到redis queue里,使多台服务器可以**同时执行crawl和items process**,大大提升了数据爬取和处理的效率。

Distributed adopts a master-slave structure to set up a Master server and multiple Slave servers. The Master side manages the Redis database and distributed download tasks. Slave deploys Scrapy crawlers to extract web pages and parse and extract data. The final parsed data is stored in the same MongoDb database.

实现原理
通过scrapy写了一个工程之后,是不能同时让两个电脑共同爬取
为什么不能呢?
因为你的scrapy运行在你的电脑的内存中,你的调度器运行在你的内存中。我的调度器运行在我的内存中
将**调度器**放到公共的地方,redis中,都放到我的redis中

实现结果:
我的windows:是服务端(redis放的地方,调度器),也是客户端
大家的windows:都是客户端

如何实现?
基于scrapy的组件,scrapy-redis组件
(1)调度器放到了redis中
(2)实现了一个管道,保存到redis中
(3)重写了两个类Spider(RedisSpider),CrawlSpider(RedisCrawlSpider)

pip install scrapy-redis
Insert picture description here
Insert picture description here
scrapy-redis scheduler cleverly realizes Duplication Filter de-duplication (DupeFilter set stores crawled requests) through the unique feature of redis set.
For the newly generated request of Spider, the fingerprint of the request is sent to the DupeFilter set of redis to check whether it is duplicated, and the non-duplicated request push is written to the request queue of redis.
The scheduler pops a request from the request queue of redis each time according to the priority, and sends the request to the spider for processing.
Send the items crawled by the spider to the item pipeline of scrapy-redis, and store the items crawled into the items queue of redis. It is easy to extract items from the items queue to realize the items processes cluster
summary
scrapy-redis cleverly uses redis to implement request queue and items queue , uses redis set to achieve request deduplication , and expands scrapy from a single machine to multiple Machine to achieve a larger scale crawler cluster
Reference article
https://piaosanlang.gitbooks.io/spiders/07day/section7.3.html

Guess you like

Origin blog.csdn.net/weixin_42540340/article/details/105099541