Some records about python crawler distributed crawler

The first thing to do is to install scrapy-redis. In fact, scrapy and distributed are not very different in editing. The differences are as follows

  • Settings in settings.py:
# Use the deduplication component in Scrapy-redis instead of Scrapy's default deduplication
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Use the scheduler component in Scrapy-redis instead of Scrapy's default scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Allow pause, redis request records are not lost
SCHEDULER_PERSIST = True
# Enable one of the following three options, and enable the first one is best
# Default Scrapy-redis request (in priority order) queue format
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# Queue form, request first in first out
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# Stack form, request FIFO
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# scrapy_redis.pipelines.RedisPipeline supports storing data in the Redis database and must be started
ITEM_PIPELINES = {
   'dongguan.pipelines.DongguanPipeline': 500,
   'scrapy_redis.pipelines.RedisPipeline': 400,

}
# This is the address and port number of the specified redis database
REDIS_HOST = '192.168.99.1'
REDIS_PORT=6379
  • Set in the crawler file:
# from scrapy.spiders import CrawlSpider, Rule
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
# class DongguanquestionSpider(CrawlSpider):
class DongguanquestionSpider(RedisCrawlSpider):
    name = 'dongguanquestion'
    redis_key = 'DongguanquestionSpider:start_urls'
    # allowed_domains = ['wz.sun0769.com']
    # start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0']
    pagelinks = LinkExtractor(allow='page=\d+')
    questionlinks = LinkExtractor(allow='/question/\d+/\d+.shtml')
    rules = (
        Rule(pagelinks),
        Rule(questionlinks, callback='parse_item'),
    )
    # dynamic domain
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(DongguanquestionSpider, self).__init__(*args, **kwargs)
  • execute code

        python2 -m scrapy runspider dongguanquestion.py
        on Ubuntu like this

        sudo python2 -m scrapy runspider dongguanquestion.py

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324990956&siteId=291194637