Usage Scrapy framework selenium obtain dynamic data Case] - 1565074766.513577

Original link: http://106.13.73.98/__/143/

Introduced


       When using data Scrapy framework crawling some sites, often the dynamic load data page. If it is used directly Scrapy url initiate a request, it is absolutely not get the data dynamically loaded. But we will find through observation, initiated the request to its url through the browser will load the corresponding dynamic data out. So, if we want to get data dynamically loaded in Scrapy, you must use selenium to operate the browser, and then initiates a request to its url through the browser to retrieve data dynamically loaded.

case study


  • Demand news data crawling under Netease news of domestic, international, military, aviation sector:

  • Requirements Analysis : When you click a hyperlink into the plate section of the page, find the current page to show the news out of the data is dynamically loaded. Therefore, we need to use selenium operate their browser to get data dynamically loaded.

Scrapy use of selenium in the process:


  1. Constructor override reptiles, in the method using selenium example of a browser object (the object as the browser is instantiated only once).
  2. Rewrite reptile file closed method, in its internal close the browser object. This method is called at the end of reptiles.
  3. Rewrite download (or crawlers) middleware process_response way for the method of response object to intercept and tamper response page data is stored.
  4. Open the downloaded middleware in the configuration file.

Code Display


Reptile file:

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from aip import AipNlp  # pip install baidu-aip
from Test.items import TestItem


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://news.163.com/']  # 网易首页

    # 用于存放四大板块详情页的超链接,在下载中间件中会用到
    plate_page = []

    # 百度AI,用于提取文章关键字和类型,详见文档:https://ai.baidu.com/docs#/NLP-Python-SDK/cf2f8fbe
    __APP_ID = '15225447'
    __API_KEY = 's5m43BMMEGGPaFGxeX3SsY7m'
    __SECRET_KEY = 'Lca9FEGpWNZW6yd8WWAHAyCyLovmi6rb'
    client = AipNlp(__APP_ID, __API_KEY, __SECRET_KEY)


    def __init__(self):
        # 实例化一个谷歌浏览器对象,在下载中间件中会用到
        self.bro = webdriver.Chrome(executable_path=r'V:\Folder\software\谷歌浏览器V69-71版本的驱动\chromedriver.exe')
        # executable_path:指定你的谷歌浏览器驱动


    # 重写父类方法,用于关闭浏览器
    def closed(self, spider):
        """此方法在爬虫程序结束时执行,注意:只会执行一次"""
        self.bro.quit()
        print('爬取结束')


    def parse(self, response):
        # 获取所有板块的链接
        all_li_list = response.xpath('//div[@class="ns_area list"]/ul/li')
        # 提取指定板块的链接(3-国内,4-国际,6-军事,7-航空)
        sign_li_list = [all_li_list[i] for i in [3, 4, 6, 7]]

        # 下面将对提取的四大板块进行解析,并访问其页面
        for li in sign_li_list:
            url = li.xpath('./a/@href').extract_first()
            self.plate_page.append(url)
            # 访问每个版块页面
            yield scrapy.Request(url, callback=self.parse_plate_page)
            # callback:指定回调函数,即解析的方法


    # 用于解析四大板块页面
    def parse_plate_page(self, response):
        # 注意,此页面中的数据是动态加载的,在下载中间件中来获取动态数据
        div_list = response.xpath('//div[@class="ndi_main"]/div')  # 获取所有文章对应的<div>

        # 提取文章基本信息
        for div in div_list:
            if not div.xpath('./a/img/@alt'): continue  # 过滤其它标签格式的文章
            item = TestItem()
            item['title'] = div.xpath('./a/img/@alt').extract_first()  # 文章标题
            item['img_url'] = div.xpath('./a/img/@src').extract_first()  # 文章标题图片对应的链接
            detail_url = div.xpath('./a/@href').extract_first()  # 文章详情页的url
            # 访问每篇文章的详情页面
            yield scrapy.Request(detail_url, callback=self.parse_detail_page, meta={'item': item})
            # meta={'item': item}:将当前的item对象传入解析方法


    # 用于解析所有文章的详情页面
    def parse_detail_page(self, response):
        # 我们先提取出传过来的参数
        item = response.meta['item']

        # 获取文章的所有内容并全部解析后保存
        content = response.xpath('//div[@id="endText"]//text()').extract()
        item['content'] = ''.join(content).strip(' \n\t')

        # 下面将调用百度AI接口,提取文章的关键字和类型:
        # 实测中有编码问题,这里我们将其替换为空来解决
        args = {'title': item['title'].replace(u'\xa0',u''), 'content': item['content'].replace(u'\xa0',u'')}
        # 开始提取文章关键字
        keys = self.client.keyword(**args)
        item['keys'] = ' '.join([dct.get('tag') for dct in keys.get('items')])  # 保存文章关键字
        # 开始提取文章类型
        kinds = self.client.topic(**args)
        item['kind'] = kinds.get('item')['lv1_tag_list'][0]['tag']  # 保存文章类型

        # 将准备好的item对象提交给管道,剩下的就是保存数据了
        yield item 

Data structure template file:
python import scrapy class TestItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() # 文章标题 img_url = scrapy.Field() # 文章标题图片对应的链接 content = scrapy.Field() # 文章内容 keys = scrapy.Field() # 文章关键字 kind = scrapy.Field() # 文章类型

Middleware file:

from time import sleep from
from scrapy.http import HtmlResponse  # 用于生成响应对象


# 下载中间件 class TestDownloaderMiddleware(object):

    def process_response(self, request, response, spider):
        """
        此方法用于拦截响应
        :param request: 当前响应对应的请求
        :param response: 响应
        :param spider: 爬虫类对象
        :return:
        """

        # 我要要在这里拦截四大板块页面的响应,来获取动态加载的内容

        # 对于不是访问四大板块页面的,直接放行:
        if request.url not in spider.plate_page:
            return response


        # 能走到这里的,必然是四大板块的响应,下面将篡改响应对象

        # 获取在爬虫类中创建好的浏览器对象
        bro = spider.bro
        # 向板块页面发起GET请求
        bro.get(url=request.url)
        sleep(1)

        # 鼠标滚轮下滚,连滚两次(用于获取更多动态加载的数据)
        js = 'window.scrollTo(0, document.body.scrollHeight);'
        bro.execute_script(js)
        sleep(0.5)
        bro.execute_script(js)
        sleep(0.5)

        # 获取页面源码,这里有我们需要的动态加载的数据
        page_text = bro.page_source
        # 创建一个新的响应对象,并将动态加载到的数据存入该对象中,然后返回该对象
        return HtmlResponse(url=bro.current_url, body=page_text, encoding='utf-8', request=request)
        # bro.current_url:请求的url 

Pipe file:

"""去吧,创建你的数据表:
create table test01(
  id int primary key auto_increment,  -- 自增id
  title varchar(128),  -- 标题
  img_url varchar(128),  -- 标题图片对应的链接
  keyword varchar(64),  -- 关键字
  kind varchar(32),  -- 文章类型
  content text  -- 文章内容
);
"""

import pymysql


class TestPipeline(object):

    # 重写父类方法,用于建立MySQL链接,并创建一个游标
    def open_spider(self, spider):
        """此方法在运行应用时被执行,注意:只会被执行一次"""
        self.conn = pymysql.Connect(
            host='localhost',
            port=3306,
            user='zyk',
            password='user@zyk',
            db='test',  # 指定你要使用的数据库
            charset='utf8'  # 指定数据的编码格式
        )  # 建立MySQL链接

        # 创建游标
        self.cursor = self.conn.cursor()


    def process_item(self, item, spider):
        # 我们先准备好sql语句
        sql = 'insert into test01(title, img_url, keyword, kind, content) values(%s, %s, %s, %s, %s)'

        # 开始执行事务
        try:
            self.cursor.execute(sql, (item['title'], item['img_url'], item['keys'], item['kind'], item['content']))  # 写入数据
            self.conn.commit()  # 提交
            print(item['title'], '已保存')
        except Exception as e:
            self.conn.rollback()  # 回滚
            print(e)

        return item


    # 重写父类方法,用于关闭MySQL链接
    def close_spider(self, spider):
        """此方法在结束应用时被执行,注意:只会被执行一次"""
        self.cursor.close()  # 关闭游标
        self.conn.close()  # 关闭连接 

Profiles:

# 伪装请求身份载体(User-Agent) 
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87
Safari/537.36'

# 是否遵守robots协议 
ROBOTSTXT_OBEY = False

# 开启的线程数 
CONCURRENT_REQUESTS = 100

# 禁用cookie来提升爬取效率 
COOKIES_ENABLED = False

# 提高日志级别来降低CPU的占用率,以提升爬取效率 
LOG_LEVEL = 'ERROR'

# 禁用重新请求(对失败的rul)来提升爬取效率 
RETRY_ENABLED = False

# 开启管道 
ITEM_PIPELINES = {
    'Test.pipelines.TestPipeline': 300, 
}

# 启用下载中间件 
DOWNLOADER_MIDDLEWARES = {
    'Test.middlewares.TestDownloaderMiddleware': 543, 
} 

Guess you like

Origin www.cnblogs.com/gqy02/p/11308965.html