【爬虫】使用 Scrapy + Selenium 爬取动态加载页面的内容

上一篇文章里面我们使用 Python Scrapy 爬取静态网页中所有文字：https://blog.csdn.net/sinat_40431164/article/details/81102476

但是有个问题，当我们把要访问的URL修改为：http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2的时候，可以发现爬取的内容里面没有“车型论坛”和“主题论坛”两个板块。

有时候，我们天真无邪的使用urllib库或Scrapy下载HTML网页时会发现，我们要提取的网页元素并不在我们下载到的HTML之中，尽管它们在浏览器里看起来唾手可得。

这说明我们想要的元素是在我们的某些操作下通过js事件动态生成的。举个例子，我们在刷QQ空间或者微博评论的时候，一直往下刷，网页越来越长，内容越来越多，就是这个让人又爱又恨的动态加载。爬取动态页面目前来说有两种方法：

分析页面请求
selenium模拟浏览器行为

下面我们就来讲一讲如何运用Selenium模拟浏览器行为。

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject URLCrawler

Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py under the URLCrawler/spiders directory in your project:

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 17:55:45 2018

@author: Administrator
"""

from scrapy import Spider,Request
from selenium import webdriver

class MySpider(Spider):
    name = "my_spider"

    def __init__(self):
        self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
        self.browser.set_page_load_timeout(30)

    def closed(self,spider):
        print("spider closed")
        self.browser.close()

    def start_requests(self):
        start_urls = ['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i in range(1,2,2)]
        for url in start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)
        print('---------------------------------------------------')

middlewares.py

加入以下内容：

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time

class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        if spider.name == 'my_spider':
            try:
                spider.browser.get(request.url)
                spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
            except TimeoutException as e:
                print('超时')
                spider.browser.execute_script('window.stop()')
            time.sleep(2)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)

settings.py

添加以下内容：

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'URLCrawler.middlewares.SeleniumMiddleware': 543,
}

How to run our spider

To put our spider to work, go to the project’s 最高一层的目录 and run:

scrapy crawl my_spider

发现下载下来的网页和用浏览器访问该网页的内容一样！

如果仅仅需要文字内容，那么将spider中的parse方法改成：

    def parse(self, response):
        '''domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)'''
        
        #textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
        textlist_with_scripts = response.selector.xpath('//text()[normalize-space(.)]').extract()
        #with open('filename_no_scripts', 'w', encoding='utf-8') as f:
        with open('filename_with_scripts', 'w', encoding='utf-8') as f:
            for i in range(0, len(textlist_with_scripts)):
                text = textlist_with_scripts[i].strip()
                f.write(text + '\n')
        print('---------------------------------------------------')

The End.