准备工作
- 先完成简单scrapy项目
- 安装docker
- win下下载安装包安装
- mac下下载安装包安装(尝试使用brew安装,安装启动过程非常复杂,最后选择使用安装包直接安装)
- centos7下运行:
yum install docker
-
redhat运行:
yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
- 安装 scrapy-splash
pip install scrapy-splash
- 启动docker服务
- centos7
service docker start
-
win下直接打开应用
- mac下直接打开应用
- centos7
-
拉取镜像
docker pull scrapinghub/splash
- 运行镜像
docker run -p 8050:8050 scrapinghub/splash
- 配置splash服务(以下操作全部在settings.py):
- 添加splash服务器地址:
SPLASH_URL = ‘http://localhost:8050’
-
将splash middleware添加到DOWNLOADER_MIDDLEWARE中:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
- Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
- Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
- a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
- 添加splash服务器地址:
- 例子
import json, scrapy lass MySpider(scrapy.Spider): name = 'example' allowed_domains = ['example.com'] start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): # ...