scrapy distributed principle

Scrapy single architecture

In this machine maintains a crawling queue, Scheduler performs scheduling.
Here Insert Picture Description
What are the key multiple hosts collaboration is? Shared crawling queue.
Here Insert Picture Description

Distributed architecture reptiles

Here Insert Picture Description
Here Insert Picture Description

Queue What maintenance?

Redis, non-relational databases, Key-Value form of storage, flexible structure.

A data structure stored in the system memory, processing speed, and good performance.

Providing more storage queue structures, collections, etc., to facilitate maintenance queue.

How come heavy?

Redis providing a set of data structures, each fingerprint in Request Redis set.
Before joining Request to the Request Queue first verify that the Request has joined the collection of fingerprints.

If exists, the queue is not added to the Request, and if not, then added to the queue and the fingerprint Request to the collection.

How to prevent interruption?

When each slave Scrapy start will first determine the current Redis Request queue is empty.
If not empty, the next queue acquired from performing a Request crawling.
If empty, crawling is resumed, the first adding unit performs the crawl orientation Request queue.

Implementation of this framework?

Scrapy-Redis library implements the architecture described above, rewriting Scrapy scheduler, queues assembly. It can be easily achieved using the Scrapy distributed architecture.

https://github.com/rolando/scrapy-redis

Published 291 original articles · won praise 104 · views 410 000 +

Guess you like

Origin blog.csdn.net/Enjolras_fuu/article/details/104288001