Distributed distributed crawling reptile 07-01

 

07-01 Distributed reptiles

 

 

An introduction

The original scrapy maintenance of Scheduler is a native of the task queue (Request object and callback information storage functions, etc.) + the machine to re-queue (stored visited url address)

So the key is to achieve a distributed crawling find on a dedicated host computer running a shared queue such as Redis,
then rewrite Scrapy the Scheduler, Scheduler to allow the new shared queue access Request, and remove duplicate Request request, so conclude, distributed key is to achieve three points:

# 1, a shared queue 
# 2, rewrite Scheduler, let it go either heavy or tasks to access a shared queue 
# 3, is to re-Scheduler customized rules (use of collection types redis)

The core functions of the above three components is scrapy-redis

# Installation: 
PIP3 install scrapy- Redis 

# Source: 
D: \ python3.6 \ Lib \ Site-Packages Standard Package \ scrapy_redis

Two, scrapy-redis assembly

1, only scrapy-redis de-emphasis function

 

Shared use source code analysis to heavy queues +

2, using scrapy-redis deduplication + Attempering distributed crawling

View Code

3, endurance of

View Code

4, to obtain the starting URL from the Redis

View Code

 

 

An introduction

The original scrapy maintenance of Scheduler is a native of the task queue (Request object and callback information storage functions, etc.) + the machine to re-queue (stored visited url address)

So the key is to achieve a distributed crawling find on a dedicated host computer running a shared queue such as Redis,
then rewrite Scrapy the Scheduler, Scheduler to allow the new shared queue access Request, and remove duplicate Request request, so conclude, distributed key is to achieve three points:

# 1, a shared queue 
# 2, rewrite Scheduler, let it go either heavy or tasks to access a shared queue 
# 3, is to re-Scheduler customized rules (use of collection types redis)

The core functions of the above three components is scrapy-redis

# Installation: 
PIP3 install scrapy- Redis 

# Source: 
D: \ python3.6 \ Lib \ Site-Packages Standard Package \ scrapy_redis

Two, scrapy-redis assembly

1, only scrapy-redis de-emphasis function

 

Shared use source code analysis to heavy queues +

2, using scrapy-redis deduplication + Attempering distributed crawling

View Code

3, endurance of

View Code

4, to obtain the starting URL from the Redis

View Code

 

Guess you like

Origin www.cnblogs.com/cherish937426/p/11955312.html