The first thing to do is to install scrapy-redis. In fact, scrapy and distributed are not very different in editing. The differences are as follows
- Settings in settings.py:
# Use the deduplication component in Scrapy-redis instead of Scrapy's default deduplication DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Use the scheduler component in Scrapy-redis instead of Scrapy's default scheduler SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Allow pause, redis request records are not lost SCHEDULER_PERSIST = True # Enable one of the following three options, and enable the first one is best # Default Scrapy-redis request (in priority order) queue format SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" # Queue form, request first in first out #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" # Stack form, request FIFO #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# scrapy_redis.pipelines.RedisPipeline supports storing data in the Redis database and must be started ITEM_PIPELINES = { 'dongguan.pipelines.DongguanPipeline': 500, 'scrapy_redis.pipelines.RedisPipeline': 400, } # This is the address and port number of the specified redis database REDIS_HOST = '192.168.99.1' REDIS_PORT=6379
- Set in the crawler file:
# from scrapy.spiders import CrawlSpider, Rule from scrapy.spiders import Rule from scrapy_redis.spiders import RedisCrawlSpider
# class DongguanquestionSpider(CrawlSpider): class DongguanquestionSpider(RedisCrawlSpider): name = 'dongguanquestion' redis_key = 'DongguanquestionSpider:start_urls' # allowed_domains = ['wz.sun0769.com'] # start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] pagelinks = LinkExtractor(allow='page=\d+') questionlinks = LinkExtractor(allow='/question/\d+/\d+.shtml') rules = ( Rule(pagelinks), Rule(questionlinks, callback='parse_item'), ) # dynamic domain def __init__(self, *args, **kwargs): # Dynamically define the allowed domains list. domain = kwargs.pop('domain', '') self.allowed_domains = filter(None, domain.split(',')) super(DongguanquestionSpider, self).__init__(*args, **kwargs)
- execute code
python2 -m scrapy runspider dongguanquestion.py
on Ubuntu like this
sudo python2 -m scrapy runspider dongguanquestion.py