版权声明:来一来,看一看,有钱的捧个人场,没钱的你不得捧个人场 https://blog.csdn.net/wait_for_eva/article/details/81698541
splash服务
压入
docker pull scrapinghub/splash
查看
docker ps -a
ID
docker inspect -f '{{.Id}}' docker_name
删除
docker rm docker_id
启动
docker run -f 8050:8050 scrapinghub/splash
停止
docker stop docker_id
强杀
docker kill docker_id
splash配置
服务地址
SPLASH_URL = 'http://127.0.0.1:8050'
中间件
DOWNLOADER_MIDDLEWARES = {
# splashcookie
'scrapy_splash.SplashCookiesMiddleware': 5,
# splash中间件
'scrapy_splash.SplashMiddleware': 10,
}
SPIDER_MIDDLEWARES = {
# 去重过滤器
'scrapy_splash.SplashDeduplicateArgsMiddleware': 8,
}
类指定
# 去重
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
代码使用
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['www.baidu.com']
start_urls = ['http://www.baidu.com/']
def start_requests(self):
for url in self.start_urls:
# request 专有替换
yield SplashRequest(url, self.parse)
def parse(self, response):
print(response.text)
request要用专门的SplashRequet