【爬虫】Scrapy配合Selenium爬取京东动态加载的商品信息

【原文链接】https://www.cnblogs.com/cnkai/p/7570116.html

在之前的一篇实战之中，我们已经爬取过京东商城的数据，但是前面的那一篇其实是有一个缺陷的，不知道你看出来没有，下面就来详细的说明和解决这个缺陷。

我们在京东搜索页面输入关键字进行搜索的时候，页面的返回过程是这样的，它首先会直接返回一个静态的页面，页面的商品信息大致是30个，之所以说是大致，因为有几个可能是广告商品，之后，当我们鼠标下滑的使用，京东后台使用Ajax技术加载另外的30个商品数据，我们看上去是60个数据，其实这60个数据是分两次加载出来的，而且只是在你鼠标下滑到一定的位置才会加载那另外的30个数据。

当你点击页面最后的第二页的时候，仔细观察新的url你会发现它的页面显示是第三页。下面是初始第一面和点击第二页之后的url：

# 第一页的url
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=1&s=1&click=0

# 点击第二页的url
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E6%89%8B%E6%9C%BA&cid2=653&cid3=655&page=3&s=59&click=0

page参数一个是1，一个是3。因此可以得知，京东的第二页的信息是使用Ajax加载出来的，而不是直接请求url的形式。如果我们想要拿到另外的30个信息，一方面是需要js渲染，另一方面是实现滚动条下拉，触发Ajax请求。

知道了过程，下面就是着手解决这个问题，由于Scrapy框架只能加载静态数据，因此我们需要另外的工具来配合Scrapy实现爬取页面的完整信息。

我们的技术路线是这样的，使用selenium加Firefox来实现目的。实现的过程是这样的，将selenium作为scrapy的下载中间件（Downloader Middleware），执行js脚本实现滚动条的下拉，并且实现js的渲染。

下面就来演示。

scrapy startproject Jingdong

spider.py文件

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 24 17:17:24 2018

@author: Administrator
"""

from scrapy import Spider,Request
from selenium import webdriver

class JingdongSpider(Spider):
    name = 'jingdong'

    def __init__(self):
        #NotADirectoryError: [WinError 267] 目录名称无效。: 'E:\\software\\python\\geckodriver-v0.21.0-win64\\geckodriver.exe'
        #self.browser = webdriver.Firefox('E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
        self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
        self.browser.set_page_load_timeout(30)

    def closed(self,spider):
        print("spider closed")
        self.browser.close()

    def start_requests(self):
        start_urls = ['https://search.jd.com/Search?keyword=%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA&enc=utf-8&wq=%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA&pvid=86343bd54f6a4ed3baa79001538b52e7'.format(str(i)) for i in range(1,2,2)]
        for url in start_urls:
            yield Request(url=url, callback=self.parse)


    def parse(self, response):
        selector = response.xpath('//ul[@class="gl-warp clearfix"]/li')
        print(len(selector))
        print('---------------------------------------------------')

这里将webdriver定义在spider文件的好处是，不需要每一次请求url都打开和关闭浏览器。
其中的closed()方法，是在爬虫程序结束之后，自动关闭浏览器。
由于这里是演示之用，我们就以一个页面来测试，看一下最后的结果是不是返回60条数据，如果是60条左右就证明我们的selenium起作用了，如果仅仅是30条左右的数据，就证明失败。

middlewares.py

下面的这个文件是主要逻辑实现。在程序中，我们执行了js脚本，来实现滚动条下拉，触发Ajax请求，之后我们简短的等待，来加载Ajax。在scrapy官方文档中，对下载中间件有着比较详细的说明，当某个下载中间件返回的是response对象的时候，之后的下载中间件将不会被继续执行，而是直接返回response对象。

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time

class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        if spider.name == 'jingdong':
            try:
                spider.browser.get(request.url)
                spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
            except TimeoutException as e:
                print('超时')
                spider.browser.execute_script('window.stop()')
            time.sleep(2)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)

class JingdongSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class JingdongDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

之后，我们只需要在设置中将当前的下载中间件添加到settings.py文件中，就可以实现了。

# -*- coding: utf-8 -*-

# Scrapy settings for Jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Jingdong'

SPIDER_MODULES = ['Jingdong.spiders']
NEWSPIDER_MODULE = 'Jingdong.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Jingdong (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#    'myproject.middlewares.CustomDownloaderMiddleware': 543,
DOWNLOADER_MIDDLEWARES = {
    'Jingdong.middlewares.SeleniumMiddleware': 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'Jingdong.pipelines.JingdongPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

To put our spider to work, go to the project’s 最高一层的目录 and run:

scrapy crawl jingdong

控制台结果：

2018-07-24 18:13:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://search.jd.com/Search?keyword=%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA&enc=utf-8&wq=%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA&pvid=86343bd54f6a4ed3baa79001538b52e7> (referer: None)
60

大功告成！