scrapy-splash

版权声明:来一来,看一看,有钱的捧个人场,没钱的你不得捧个人场 https://blog.csdn.net/wait_for_eva/article/details/81698541

splash服务

压入

docker pull scrapinghub/splash

查看

docker ps -a

ID

docker inspect -f '{{.Id}}' docker_name

删除

docker rm docker_id

启动

docker run -f 8050:8050 scrapinghub/splash

停止

docker stop docker_id

强杀

docker kill docker_id

splash配置

服务地址

SPLASH_URL = 'http://127.0.0.1:8050'

中间件 

DOWNLOADER_MIDDLEWARES = {
    # splashcookie
    'scrapy_splash.SplashCookiesMiddleware': 5,
    # splash中间件
    'scrapy_splash.SplashMiddleware': 10,
}
SPIDER_MIDDLEWARES = {
    # 去重过滤器
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 8,
}

类指定

# 去重
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

代码使用

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest


class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

    def start_requests(self):
        for url in self.start_urls:
            # request 专有替换
            yield SplashRequest(url, self.parse)

    def parse(self, response):
        print(response.text)

request要用专门的SplashRequet

猜你喜欢

转载自blog.csdn.net/wait_for_eva/article/details/81698541