scrpy-- distributed reptiles

The original scrapy in Scheduler maintenance of the current machine task queue (Request object and store the information callback function, etc.) + current de-emphasis queue (stored visited url address)

 

The key is to realize distributed need to find a dedicated host to run on top of a shared queue, such as redis. Scrapy then rewriting the Scheduler, Scheduler to allow the new shared queue access Request, and removing duplicate request Request

  1, shared queues

  2, rewrite Scheduler, whether it is heavy to perform tasks or to access the content shared queue

  3, for the Scheduler tailor de-duplication rule (use of collection types redis)

 

 

# In scrapy redis deduplication shared queue 

# 1, link configuration settings in redis 
REDIS_HOST = ' localhost '   # host name 
REDIS_PORT = ' 6379 '   # port number 
REDIS_URL = ' redis: // User: Pass @ hostname: 9001 '   # connection url, in preference to the above configuration 
REDIS_PARAMS} = {   # redis connection parameter 
REDIS_PARAMS [ ' redis_cls ' ] = ' myproject.RedisClient '   # specify the redis python module connector 
REDIS_ENCODING = ' UTF-. 8 '   # redis coding type 

#2, let scrapy using a shared queue to re- 
# using de-duplication function scrapy_redis provide, in fact, is the use of a set of redis achieved 
DUPEFILTER_CLASS = ' scrapy_redis.dupefilter.RFPDupeFilter ' 

# 3, you need to specify the name of Redis Key in the collection, Key = Request not repeated strings stored set 
DUPEFILTER_KEY = ' dupefilter:% (timestamp) S '

 

 

#   Scrapy_redis to realize distributed acquisition schedule re + 

# configuration settings in the 

SCHEDULER = ' scrapy_redis.scheduler.Scheduler ' 

# scheduler will not be repeated tasks with the pickle serialized in a shared task queue, the default is to 
use priority queue (default), as well as other PriorityQueue (ordered set), FifoQueue (list), LifoQueue (list). 
SCHEDULER_QUEUE_CLASS = ' scrapy_redis.queue.PriorityQueue ' 

# on the request object stored in redis serialized, the default is to be serialized by the pickle 
SCHEDULER_SERIALIZER = ' scrapy_redis.picklecompat ' 

# after the request scheduler of the task sequence stored in redis the Key 
SCHEDULER_QUEUE_KEY = ' % (Spider) S: Requests ' 
#   whether to remain in the original closed when the scheduler and to re-record, True = retention, False = Clear
SCHEDULER_PERSIST = True 

# whether to clear the scheduler before starting and to re-record, True = empty, False = not cleared 
SCHEDULER_FLUSH_ON_START = False 

# when to go scheduler access to data, if it is empty, then wait up time (last no data, did not get to the data). If not immediately return will cause too many empty cycles, market share will soar straight cpu 
SCHEDULER_IDLE_BEFORE_CLOSE = 10 # deduplication rule, the time saved in redis corresponding Key 
SCHEDULER_DUPEFILTER_KEY = ' % (Spider) S: dupefilter ' # deduplication processing rules corresponding to the class, a character string task request_fingerprint (request) to be placed into the queue weight 
SCHEDULER_DUPEFILTER_CLASS = ' scrapy_redis.dupefilter.REPDupeFilter '



 

Persistent data

# Save When parse out what we want from the target site to the item objects, will be the engine to the pipeline for data persistence operations / saved to the specified database, scrapy_redis provides a pipeline component, can help us item stored in redis 

# saved to persistent redis the item when the specified key and serialization function 
REDIS_ITEMS_KEY = ' % (Spider) S: items ' 
REDIS_ITEMS_SERIALIZER = ' json.dumps '

 

 

 

# Obtained from redis in starting the URL of 
scrapy program reptiles destination site, once completed after crawling over, in case the target site content updates, and took the time if we want in this collection then you need to restart the scrapy project, which will become very troublesome, scrapy_redis provide a way for the start of the url get scrapy project from redis, if not scrapy will take some time and will not come directly to the end, so we just want to write a simple script, to the regular queue redis url placed on it a starting 

# written script, set the start of the url from redis Get Key 
REDIS_START_URLS_KEY = ' % (name) S: start_urls ' 

# URL to obtain the starting time, get it to the collection list or to obtain: True = the set, False = list 
REDIS_START_URLS_AS_SET = False   # to obtain the starting URL when and if to True, it would use self.server.spop; False, then that is self.server.lpop

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/tulintao/p/12005726.html
Recommended