Scrapy之Splash

最近在做爬虫的项目, 遇到动态由js生成的html因为是动态生成了, Scrapy是抓取不到的. 因为现在网站单纯全部写成静态的html的不是太多了, 抓取动态js生成的html必然是个绕不过去的坎, 所以需要研究下主流是如何处理这个问题的。关于Scrapy如何爬取网页的内容可以参考之前Scrapy入门的文章。

比较主流的做法是通过Splash生成的服务, 爬虫程序通过Splash的服务从而抓取到js动态生成的内容, 在这里的Splash有点像代理的意思。

安装

通过Docker安装Scrapy, 这里就不介绍如何安装Docker和Docker-compose了, 可以参考之前Docker的文章

拉取Splash镜像

docker pull scrapinghub/splash

Splash镜像比较大, 如果是在虚拟机里面安装, 有可能会遇到虚拟机空间不足的问题, 要进行系统扩容, 扩容时候容易碰到Ubuntu系统不识别新的空间问题, 请参考虚拟机的Ubuntu扩容一文

启动Splash服务

docker run -p 8050:8050 scrapinghub/splash

我比较喜欢做成一个docker-compose文件, 之后用起来比较方便

version: '3'
services:
  splash:
    restart: always
    image: scrapinghub/splash
    container_name: splash
    ports:
      - 8050:8050

在浏览器里输入http://192.168.25.145:8050/, 如果看到下面的画面, 就证明splash安装成功了

在python中也要安装scrapy_splash

pip3 install scrapy_splash

scrapy爬虫设置

创建一个新的项目做为学习用

scrapy startproject jdproject

打开jdproject/spiders/jd.py, 修改内容:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy_splash.request import SplashRequest, SplashFormRequest


class JdSpider(scrapy.Spider):
    name = "jd"

    def start_requests(self):
        splash_args = {"lua_source": """
                    --splash.response_body_enabled = true
                    splash.private_mode_enabled = false
                    splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
                    assert(splash:go("https://item.jd.com/5089239.html"))
                    splash:wait(3)
                    return {html = splash:html()}
                    """}
        yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave)

    def onSave(self, response):
        value = response.xpath('//span[@class="p-price"]//text()').extract()
        print(value)

打开jdproject/settings.py, 修改:

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,  # 不配置查不到信息
}

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'

SPLASH_URL = "http://192.168.99.100:8050/"  # 自己安装的docker里的splash位置
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

测试

这里我用的是https://item.jd.com/5089239.html做测试,要拿产品价格

运行爬虫

发布了69 篇原创文章 · 获赞 8 · 访问量 9397

猜你喜欢

转载自blog.csdn.net/u011414629/article/details/103101915