最近在做爬虫的项目, 遇到动态由js生成的html因为是动态生成了, Scrapy是抓取不到的. 因为现在网站单纯全部写成静态的html的不是太多了, 抓取动态js生成的html必然是个绕不过去的坎, 所以需要研究下主流是如何处理这个问题的。关于Scrapy如何爬取网页的内容可以参考之前Scrapy入门的文章。
比较主流的做法是通过Splash生成的服务, 爬虫程序通过Splash的服务从而抓取到js动态生成的内容, 在这里的Splash有点像代理的意思。
安装
通过Docker安装Scrapy, 这里就不介绍如何安装Docker和Docker-compose了, 可以参考之前Docker的文章
拉取Splash镜像
docker pull scrapinghub/splash
Splash镜像比较大, 如果是在虚拟机里面安装, 有可能会遇到虚拟机空间不足的问题, 要进行系统扩容, 扩容时候容易碰到Ubuntu系统不识别新的空间问题, 请参考虚拟机的Ubuntu扩容一文
启动Splash服务
docker run -p 8050:8050 scrapinghub/splash
我比较喜欢做成一个docker-compose文件, 之后用起来比较方便
version: '3'
services:
splash:
restart: always
image: scrapinghub/splash
container_name: splash
ports:
- 8050:8050
在浏览器里输入http://192.168.25.145:8050/, 如果看到下面的画面, 就证明splash安装成功了
在python中也要安装scrapy_splash
pip3 install scrapy_splash
scrapy爬虫设置
创建一个新的项目做为学习用
scrapy startproject jdproject
打开jdproject/spiders/jd.py, 修改内容:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy_splash.request import SplashRequest, SplashFormRequest
class JdSpider(scrapy.Spider):
name = "jd"
def start_requests(self):
splash_args = {"lua_source": """
--splash.response_body_enabled = true
splash.private_mode_enabled = false
splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
assert(splash:go("https://item.jd.com/5089239.html"))
splash:wait(3)
return {html = splash:html()}
"""}
yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave)
def onSave(self, response):
value = response.xpath('//span[@class="p-price"]//text()').extract()
print(value)
打开jdproject/settings.py, 修改:
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, # 不配置查不到信息
}
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
SPLASH_URL = "http://192.168.99.100:8050/" # 自己安装的docker里的splash位置
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
测试
这里我用的是https://item.jd.com/5089239.html
做测试,要拿产品价格
运行爬虫