scrapy_selenium crawls Ajax, JSON, XML web pages: Douban Movie

Yiniu cloud agent

Introduction

In the development process of web crawlers, we often encounter some dynamically loaded web pages, whose data is not directly embedded in HTML, but asynchronously obtained through Ajax, JSON, XML, etc. These web pages are difficult to parse directly for traditional scrapy crawlers. So, how do we use scrapy_selenium to crawl web pages in these data formats? This article will introduce you to the basic principles and usage of scrapy_selenium, and give you a practical case.

overview

scrapy_selenium is a crawler framework that combines scrapy and selenium. It allows us to use selenium in scrapy to control the browser, so as to crawl dynamic web pages. The main features of scrapy_selenium are:

  • It provides a SeleniumRequest class that allows us to send selenium requests in scrapy instead of normal HTTP requests.
  • It provides a SeleniumMiddleware class that allows us to process selenium responses in scrapy instead of normal HTML responses.
  • It provides a SeleniumSpider class that allows us to use selenium in scrapy to write crawler logic instead of the normal scrapy.Spider class.

text

To use scrapy_selenium to crawl web pages in Ajax, JSON, XML and other data formats, we need to follow the following steps:

  • Install the scrapy_selenium library. We can use the pip command to install the scrapy_selenium library as follows:
pip install scrapy-selenium
  • Configure scrapy_selenium settings. We need to add the following to the settings.py file:
# 设置selenium驱动程序的路径
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = '/path/to/chromedriver'
# 设置selenium驱动程序的选项
SELENIUM_DRIVER_ARGUMENTS = ['--headless'] # 使用无头模式
# 启用selenium中间件
DOWNLOADER_MIDDLEWARES = {
    
    
    'scrapy_selenium.SeleniumMiddleware': 800
}
  • Write a selenium crawler. We need to inherit the SeleniumSpider class and rewrite the start_requests method and parse method as follows:
from scrapy_selenium import SeleniumRequest, SeleniumSpider

class MySpider(SeleniumSpider):
    name = 'my_spider'

    def start_requests(self):
        # 发送selenium请求,指定回调函数和元数据
        yield SeleniumRequest(
            url='https://example.com', # 目标网址
            callback=self.parse, # 回调函数
            meta={
    
    'proxy': self.get_proxy()} # 元数据,包含代理信息
        )

    def parse(self, response):
        # 处理selenium响应,提取数据或跟进链接
        # response为一个SeleniumResponse对象,它包含了driver属性,即浏览器驱动对象
        driver = response.driver # 获取浏览器驱动对象
        data = driver.find_element_by_xpath('//div[@id="data"]') # 通过xpath定位数据元素
        print(data.text) # 打印数据内容

    def get_proxy(self):
        #设置亿牛云 爬虫加强版代理
        #获取代理信息,返回一个字符串,格式为'user:pass@host:port'        
        proxyHost = "www.16yun.cn"
        proxyPort = "3111"
        proxyUser = "16YUN"
        proxyPass = "16IP"
        return f'{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}'

the case

In order to demonstrate how scrapy_selenium crawls web pages in Ajax, JSON, XML and other data formats, we take Douban Movie as an example to crawl its movie list and details page. We can find that the movie list of Douban Movies is loaded asynchronously through Ajax, and the movie details page is returned in JSON format. Our goal is to crawl the name, rating, profile and poster image of each movie and save it locally.

  • First, we need to create a scrapy project and install the scrapy_selenium library:
scrapy startproject douban
cd douban
pip install scrapy_selenium
  • Then, we need to configure scrapy_selenium settings, modify the settings.py file as follows:
# 设置selenium驱动程序的路径
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = '/path/to/chromedriver'
# 设置selenium驱动程序的选项
SELENIUM_DRIVER_ARGUMENTS = ['--headless'] # 使用无头模式
# 启用selenium中间件
DOWNLOADER_MIDDLEWARES = {
    
    
    'scrapy_selenium.SeleniumMiddleware': 800
}
# 设置图片管道
ITEM_PIPELINES = {
    
    
    'scrapy.pipelines.images.ImagesPipeline': 300
}
# 设置图片存储路径
IMAGES_STORE = 'images'
  • Next, we need to write a selenium crawler and create the douban/spiders/douban.py file as follows:
from scrapy_selenium import SeleniumRequest, SeleniumSpider
from douban.items import DoubanItem

class DoubanSpider(SeleniumSpider):
    name = 'douban'

    def start_requests(self):
        # 发送selenium请求,指定回调函数和元数据
        yield SeleniumRequest(
            url='https://movie.douban.com/', # 目标网址
            callback=self.parse, # 回调函数
            meta={
    
    'proxy': self.get_proxy()} # 元数据,包含代理信息
        )

    def parse(self, response):
        # 处理selenium响应,提取数据或跟进链接
        # response为一个SeleniumResponse对象,它包含了driver属性,即浏览器驱动对象
        driver = response.driver # 获取浏览器驱动对象
        movies = driver.find_elements_by_xpath('//div[@class="list"]/a') # 通过xpath定位电影元素列表
        for movie in movies: # 遍历每部电影元素
            item = DoubanItem() # 创建一个DoubanItem对象,用于存储数据
            item['name'] = movie.get_attribute('title') # 获取电影名称属性,并赋值给item['name']
            item['url'] = movie.get_attribute('href') # 获取电影详情页链接属性,并赋值给item['url']
            yield SeleniumRequest( # 发送selenium请求,请求电影详情页,并指定回调函数和元数据
                url=item['url'], 
                callback=self.parse_detail, 
                meta={
    
    'item': item, 'proxy': self.get_proxy()} # 元数据,包含item对象和代理信息
            )

    def parse_detail(self, response):
        # 处理selenium响应,提取数据或跟进链接
        # response为一个SeleniumResponse对象,它包含了driver属性,即浏览器驱动对象
        driver = response.driver # 获取浏览器驱动对象
        item = response.meta['item'] # 获取元数据中的item对象
        data = driver.find_element_by_xpath('//div[@id="info"]') # 通过xpath定位数据元素
        item['rating'] = data.find_element_by_xpath('.//strong').text # 获取评分元素的文本,并赋值给item['rating']
        item['summary'] = data.find_element_by_xpath('.//span[@property="v:summary"]').text # 获取简介元素的文本,并赋值给item['summary']
        item['image_urls'] = [data.find_element_by_xpath('.//img[@rel="v:image"]').get_attribute('src')] # 获取海报图片元素的链接,并赋值给item['image_urls']
        yield item # 返回item对象

    def get_proxy(self):
        #设置亿牛云 爬虫加强版代理
        #获取代理信息,返回一个字符串,格式为'user:pass@host:port' 
        proxyHost = "www.16yun.cn"
        proxyPort = "3111"
        proxyUser = "16YUN"
        proxyPass = "16IP"
        return f'{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}'

epilogue

Through the above introduction and cases, we can understand that scrapy_selenium is a very powerful and flexible crawler framework, which allows us to easily crawl web pages in Ajax, JSON, XML and other data formats, and No need to write complex JavaScript code or use other tools. scrapy_selenium can also be combined with other components and functions of scrapy, such as image pipeline, proxy middleware, data storage, etc., so as to improve the efficiency and quality of crawlers.

Guess you like

Origin blog.csdn.net/ip16yun/article/details/132408943