python splash scrapy

1. 前言

slpash是一个渲染引擎，它有自己的api，可以直接访问splash服务的http接口，但也有对应的包python-splash方便调用。

1.1. python + splash简单调用

先从直接访问http接口开始。

import requests

from urllib.parse import quote

import re

lua = '''

function main(splash, args)

local treat = require("treat")

local response = splash:http_get("https://www.shou.edu.cn/")

return treat.as_string(response.body)

end

'''

url = 'http://splash:8050/execute?lua_source=' + quote(lua)

response = requests.get(url, auth=('admin', 'admin'))

ip = re.search('(\d+\.\d+\.\d+\.\d+)', response.text).group(1)

print(ip)

需要注意的是quote(lua)，需要转码。

使用的是splash的API。

比较简单的接口，更具体的接口方法见文档http://splash.readthedocs.io/en/stable/api.html#render-html。

案例：curl http://localhost:8050/render.html?url=https://www.baidu.com

url参数为目标网页地址。

2. scrapy+splash

2.1. 安装

pip install scrapy-splash

2.2. 使用

在scrapy_splash中定义了一个SplashRequest类，用户只需使用scrapy_splash.SplashRequst来替代scrapy.Request发送请求

该构造器常用参数如下：

url---待爬取的url地址

headers---请求头

cookies---cookies信息

args---传递给splash的参数，如wait\timeout\images\js_source等

cache_args--针对参数重复调用或数据量大大情况，让Splash缓存该参数

endpoint---Splash服务端点

splash_url---Splash服务器地址，默认为None

爬虫主体代码没什么变化。

spider

# -*- coding: utf-8 -*-

import scrapy

from scrapy.http import Request, FormRequest

from scrapy.selector import Selector

from scrapy_splash.request import SplashRequest, SplashFormRequest

class JdSpider(scrapy.Spider):

name = "jd"

def start_requests(self):

splash_args = {"lua_source": """

--splash.response_body_enabled = true

splash.private_mode_enabled = false

splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")

assert(splash:go("https://item.jd.com/5089239.html"))

splash:wait(3)

return {html = splash:html()}

"""}

yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave)

def onSave(self, response):

value = response.xpath('//span[@class="p-price"]//text()').extract()

print(value)

打开jdproject/settings.py, 修改：

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, # 不配置查不到信息

}

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

SPLASH_URL = "http://192.168.99.100:8050/" # 自己安装的docker里的splash位置

DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

3. 官方文档

文档来源：https://pypi.org/project/scrapy-splash/

Add the Splash server address to settings.py of your Scrapy project like this:

SPLASH_URL = 'http://192.168.59.103:8050'

Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file and changing HttpCompressionMiddleware priority:

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

注意事项：

核心功能是修改url为splash_url+url，换言之就是访问splash服务器，然后返回结果，这一操作是在class SplashMiddleware(object):中完成
注意中间件的权重大小，

Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

This middleware is needed to support cache_args feature; it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.

Set a custom DUPEFILTER_CLASS:

设置过滤器，有点奇怪，其实理论上应该不需要单独搞一个过滤器的，只把splash作为一个插件，

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

If you use Scrapy HTTP cache then a custom cache storage backend is required. scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

If you use other cache storage then it is necesary to subclass it and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint.

4. 总结

可以把它理解为一个代理+浏览器，

cookie问题：由splash管理，不过建议还是在scrapy中管理，毕竟已有相关模板，另外一点是把splash的作用简单化，解耦。

代理问题：根据splash接口设置即可。