The original scrapy in Scheduler maintenance of the current machine task queue (Request object and store the information callback function, etc.) + current de-emphasis queue (stored visited url address)
The key is to realize distributed need to find a dedicated host to run on top of a shared queue, such as redis. Scrapy then rewriting the Scheduler, Scheduler to allow the new shared queue access Request, and removing duplicate request Request
1, shared queues
2, rewrite Scheduler, whether it is heavy to perform tasks or to access the content shared queue
3, for the Scheduler tailor de-duplication rule (use of collection types redis)
# In scrapy redis deduplication shared queue # 1, link configuration settings in redis REDIS_HOST = ' localhost ' # host name REDIS_PORT = ' 6379 ' # port number REDIS_URL = ' redis: // User: Pass @ hostname: 9001 ' # connection url, in preference to the above configuration REDIS_PARAMS} = { # redis connection parameter REDIS_PARAMS [ ' redis_cls ' ] = ' myproject.RedisClient ' # specify the redis python module connector REDIS_ENCODING = ' UTF-. 8 ' # redis coding type #2, let scrapy using a shared queue to re- # using de-duplication function scrapy_redis provide, in fact, is the use of a set of redis achieved DUPEFILTER_CLASS = ' scrapy_redis.dupefilter.RFPDupeFilter ' # 3, you need to specify the name of Redis Key in the collection, Key = Request not repeated strings stored set DUPEFILTER_KEY = ' dupefilter:% (timestamp) S '
# Scrapy_redis to realize distributed acquisition schedule re + # configuration settings in the SCHEDULER = ' scrapy_redis.scheduler.Scheduler ' # scheduler will not be repeated tasks with the pickle serialized in a shared task queue, the default is to use priority queue (default), as well as other PriorityQueue (ordered set), FifoQueue (list), LifoQueue (list). SCHEDULER_QUEUE_CLASS = ' scrapy_redis.queue.PriorityQueue ' # on the request object stored in redis serialized, the default is to be serialized by the pickle SCHEDULER_SERIALIZER = ' scrapy_redis.picklecompat ' # after the request scheduler of the task sequence stored in redis the Key SCHEDULER_QUEUE_KEY = ' % (Spider) S: Requests ' # whether to remain in the original closed when the scheduler and to re-record, True = retention, False = Clear SCHEDULER_PERSIST = True # whether to clear the scheduler before starting and to re-record, True = empty, False = not cleared SCHEDULER_FLUSH_ON_START = False # when to go scheduler access to data, if it is empty, then wait up time (last no data, did not get to the data). If not immediately return will cause too many empty cycles, market share will soar straight cpu SCHEDULER_IDLE_BEFORE_CLOSE = 10 # deduplication rule, the time saved in redis corresponding Key SCHEDULER_DUPEFILTER_KEY = ' % (Spider) S: dupefilter ' # deduplication processing rules corresponding to the class, a character string task request_fingerprint (request) to be placed into the queue weight SCHEDULER_DUPEFILTER_CLASS = ' scrapy_redis.dupefilter.REPDupeFilter '
Persistent data
# Save When parse out what we want from the target site to the item objects, will be the engine to the pipeline for data persistence operations / saved to the specified database, scrapy_redis provides a pipeline component, can help us item stored in redis # saved to persistent redis the item when the specified key and serialization function REDIS_ITEMS_KEY = ' % (Spider) S: items ' REDIS_ITEMS_SERIALIZER = ' json.dumps '
# Obtained from redis in starting the URL of scrapy program reptiles destination site, once completed after crawling over, in case the target site content updates, and took the time if we want in this collection then you need to restart the scrapy project, which will become very troublesome, scrapy_redis provide a way for the start of the url get scrapy project from redis, if not scrapy will take some time and will not come directly to the end, so we just want to write a simple script, to the regular queue redis url placed on it a starting # written script, set the start of the url from redis Get Key REDIS_START_URLS_KEY = ' % (name) S: start_urls ' # URL to obtain the starting time, get it to the collection list or to obtain: True = the set, False = list REDIS_START_URLS_AS_SET = False # to obtain the starting URL when and if to True, it would use self.server.spop; False, then that is self.server.lpop