07-01 Distributed reptiles
An introduction
The original scrapy maintenance of Scheduler is a native of the task queue (Request object and callback information storage functions, etc.) + the machine to re-queue (stored visited url address)
So the key is to achieve a distributed crawling find on a dedicated host computer running a shared queue such as Redis,
then rewrite Scrapy the Scheduler, Scheduler to allow the new shared queue access Request, and remove duplicate Request request, so conclude, distributed key is to achieve three points:
# 1, a shared queue # 2, rewrite Scheduler, let it go either heavy or tasks to access a shared queue # 3, is to re-Scheduler customized rules (use of collection types redis)
The core functions of the above three components is scrapy-redis
# Installation:
PIP3 install scrapy- Redis
# Source:
D: \ python3.6 \ Lib \ Site-Packages Standard Package \ scrapy_redis
Two, scrapy-redis assembly
1, only scrapy-redis de-emphasis function
2, using scrapy-redis deduplication + Attempering distributed crawling
3, endurance of
4, to obtain the starting URL from the Redis
An introduction
The original scrapy maintenance of Scheduler is a native of the task queue (Request object and callback information storage functions, etc.) + the machine to re-queue (stored visited url address)
So the key is to achieve a distributed crawling find on a dedicated host computer running a shared queue such as Redis,
then rewrite Scrapy the Scheduler, Scheduler to allow the new shared queue access Request, and remove duplicate Request request, so conclude, distributed key is to achieve three points:
# 1, a shared queue # 2, rewrite Scheduler, let it go either heavy or tasks to access a shared queue # 3, is to re-Scheduler customized rules (use of collection types redis)
The core functions of the above three components is scrapy-redis
# Installation:
PIP3 install scrapy- Redis
# Source:
D: \ python3.6 \ Lib \ Site-Packages Standard Package \ scrapy_redis
Two, scrapy-redis assembly
1, only scrapy-redis de-emphasis function
2, using scrapy-redis deduplication + Attempering distributed crawling
3, endurance of
4, to obtain the starting URL from the Redis