Scrapy single architecture
In this machine maintains a crawling queue, Scheduler performs scheduling.
What are the key multiple hosts collaboration is? Shared crawling queue.
Distributed architecture reptiles
Queue What maintenance?
Redis, non-relational databases, Key-Value form of storage, flexible structure.
A data structure stored in the system memory, processing speed, and good performance.
Providing more storage queue structures, collections, etc., to facilitate maintenance queue.
How come heavy?
Redis providing a set of data structures, each fingerprint in Request Redis set.
Before joining Request to the Request Queue first verify that the Request has joined the collection of fingerprints.
If exists, the queue is not added to the Request, and if not, then added to the queue and the fingerprint Request to the collection.
How to prevent interruption?
When each slave Scrapy start will first determine the current Redis Request queue is empty.
If not empty, the next queue acquired from performing a Request crawling.
If empty, crawling is resumed, the first adding unit performs the crawl orientation Request queue.
Implementation of this framework?
Scrapy-Redis library implements the architecture described above, rewriting Scrapy scheduler, queues assembly. It can be easily achieved using the Scrapy distributed architecture.
https://github.com/rolando/scrapy-redis