CentOS使用scrapy-splash

准备工作

  • 先完成简单scrapy项目
  • 安装docker
    • win下下载安装包安装
    • mac下下载安装包安装(尝试使用brew安装,安装启动过程非常复杂,最后选择使用安装包直接安装)
    • centos7下运行:

      yum install docker

  • redhat运行:

    yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
    
  • 安装 scrapy-splash
    pip install scrapy-splash
    
  • 启动docker服务
    • centos7

      service docker start

    • win下直接打开应用

    • mac下直接打开应用
  • 拉取镜像

    docker pull scrapinghub/splash
    
  • 运行镜像
    docker run -p 8050:8050 scrapinghub/splash
    
  • 配置splash服务(以下操作全部在settings.py):
    • 添加splash服务器地址:

      SPLASH_URL = ‘http://localhost:8050’

    • 将splash middleware添加到DOWNLOADER_MIDDLEWARE中:

      DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
      }
      
    • Enable SplashDeduplicateArgsMiddleware:
      SPIDER_MIDDLEWARES = {
          'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
      }
      
    • Set a custom DUPEFILTER_CLASS:
      DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
      
    • a custom cache storage backend:
      HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
      
  • 例子
    import json, scrapy
    
    lass MySpider(scrapy.Spider):
       name = 'example'
       allowed_domains = ['example.com']
       start_urls = ["http://example.com", "http://example.com/foo"]
    
       def start_requests(self):
         for url in self.start_urls:
           yield SplashRequest(url, self.parse, args={'wait': 0.5})
    
       def parse(self, response):
           # ...

猜你喜欢

转载自blog.csdn.net/zhao_5352269/article/details/83303075