[Python_Scrapy study notes (14)] File pipeline based on Scrapy framework to realize file grabbing (multi-level page grabbing based on Scrapy framework)

Preface

This article introduces how to implement file crawling based on the file pipeline of the Scrapy framework (multi-level page crawling based on the Scrapy framework), and takes the PPT template of crawling the first PPT website as an example to demonstrate, and at the same time crawling the data of this website The method is also a detailed explanation of how to use the Scrapy framework to crawl multi-level pages.

text

1. Principle of scraping files with Scrapy framework

The Scrapy framework provides the file pipeline class FilesPipeline to grab files. When using it, you need to import it from scrapy.pipelines.files import FilesPipelineand rewrite the get_media_requests() method. If there are requirements for the saved file name, you need to rewrite the file_path() method; through the settings.py file The FILES_STORE attribute in can set the file saving path.

2. Implementation steps of scraping files using Scrapy framework

  1. Crawler files: link files yield to pipelines

  2. Pipeline files: Import and inherit scrapy's FilesPipeline class, override the get_media_requests() method and file_path() method;

    from scrapy.pipelines.files import FilesPipeline
    class XxxPipeline(FilesPipeline):
         def get_media_requests(self,xxx):
             # 把链接交给调度器入队列
         def file_path(self,xxx):
             # 此处生成文件名
             return filename
    
  3. settings.py: In the global configuration file, specify the location where the file is saved through FILES_STORE="path".

3. Scrapy framework capture file case

  1. Case requirements: Crawl the PPT templates of various categories classified in the columns of the first PPT website and save them locally
    Insert image description here

  2. First-level page analysis:
    First-level page url address: http://www.1ppt.com/xiazai/
    Extracting data:
    li_list = "//div[@class='col_nav clearfix']/ul/li"The first one is not
    1.1> Column classification name: ./a/text()
    1.2> Column classification link: ./a/@href: The path to access resources needs to be the same as https ://www.1ppt.com for splicing
    Insert image description here

  3. Secondary page analysis: Entering a certain column category
    to extract data: li_list = "//div/dl/dd/ul[@class='tplist']/li"
    2.1>PPT name: ./h2/a/text()
    2.2>Enter details page link: ./h2/a/@href: The path to access resources needs to be spliced ​​with https://www.1ppt.com
    Insert image description here

  4. Third-level page analysis: Third-level page: Entering the PPT details page
    Extracting data:
    3.1> Link to the download page //ul[@class='downurllist']/li/a/@href: The path to access resources needs to be spliced ​​with https://www.1ppt.com
    Insert image description here

  5. Level 4 page analysis: Entering the PPT download page
    to extract data: //ul[@class='downloadlist']/li[@class='c1']/a/@href
    4.1> Specific PPT download link --> Hand it over to the project pipeline file for processing
    Insert image description here

  6. Create a Scrapy project: write the items.py file.
    It should be clear here that only the name of the major category and the PPT download link need to be saved.

    import scrapy
    
    
    class PptItem(scrapy.Item):
        # 定义什么? 管道文件需要什么?
        # 大分类名称、具体PPT文件名、PPT下载链接
        parent_name = scrapy.Field()
        ppt_name = scrapy.Field()
        download_url = scrapy.Field()
    
    
  7. Write crawler file:

    import scrapy
    from ..items import PptItem
    import json
    
    class PptSpider(scrapy.Spider):
        name = "ppt"
        allowed_domains = ["www.1ppt.com"]
        start_urls = ["http://www.1ppt.com/xiazai/"]
    
        def parse(self, response):
            """
            一级页面的解析函数,提取数据:分类名称和链接
            :param response:
            :return:
            """
            li_list = response.xpath("//div[@class='col_nav clearfix']/ul/li")
            for li in li_list[1:]:  # 从第二个元素开始
                item = PptItem()  # 创建item对象
                item["parent_name"] = li.xpath("./a/text()").get()
                class_href = "https://www.1ppt.com" + li.xpath("./a/@href").get()
                # 将class_href交给调度器入队列
                yield scrapy.Request(url=class_href, meta={
          
          "meta1": item}, callback=self.parse_second_page)
    
        def parse_second_page(self, response):
            """
            二级页面的解析函数,提取数据:PPT名称 & 进入详情页链接
            :return:
            """
            # 接受上一个解析函数传过来的meta对象
            meta1 = response.meta["meta1"]
            # 开始解析提取数据
            li_list = response.xpath("//div/dl/dd/ul[@class='tplist']/li")
            for li in li_list:
                item = PptItem()  # 创建item对象
                item["parent_name"] = meta1["parent_name"]
                item["ppt_name"] = li.xpath("./h2/a/text()").get()
                ppt_info_url = "https://www.1ppt.com" + li.xpath("./h2/a/@href").get()
                # 将ppt_info_url交给调度器入队列
                yield scrapy.Request(url=ppt_info_url, meta={
          
          "meta2": item}, callback=self.parse_third_page)
    
        def parse_third_page(self, response):
            """
            三级页面的解析函数,提取数据:进入下载页的链接
            :param response:
            :return:
            """
            meta2 = response.meta["meta2"]
            enter_download_page = "https://www.1ppt.com" + response.xpath("//ul[@class='downurllist']/li/a/@href").get()
            # 直接交给调度器入队列
            yield scrapy.Request(url=enter_download_page, meta={
          
          "item": meta2}, callback=self.parse_fourth_page)
    
        def parse_fourth_page(self, response):
            """
            四级页面的解析函数,提取数据:具体PPT的下载链接
            :param response:
            :return:
            """
            item = response.meta["item"]
            item["download_url"] = response.xpath("//ul[@class='downloadlist']/li[@class='c1']/a/@href").get()
            # 一条完整的item数据提取完成,交给管道文件处理
            yield item
    
    

    Note: The use of meta parameters in the fourth-level page

  8. Import and inherit scrapy's FilesPipeline class in the pipeline file, and override the get_media_requests() method and file_path() method:

    import os
    import scrapy
    from scrapy.pipelines.files import FilesPipeline
    
    
    class PptPipeline(FilesPipeline):
        def get_media_requests(self, item, info):
            """
            重写get_media_requests()方法,将文件下载链接交给调度器入队列
            :param item:
            :param info:
            :return:
            """
            yield scrapy.Request(url=item["download_url"], meta={
          
          "item": item})
    
        def file_path(self, request, response=None, info=None, *, item=None):
            """
            重写file_path(),调整文件保存路径及文件名
            :param request:
            :param response:
            :param info:
            :param item:
            :return:
            """
            item = request.meta["item"]
            # filename:工作总结PPT/xxxxxxxxxxppt.zip
            filename = '{}/{}{}'.format(
                item["parent_name"],
                item["ppt_name"],
                os.path.splitext(item["download_url"])[1]
            )
            return filename
    
    
  9. In the global configuration file, specify the location where the file is saved through FILES_STORE="path"

    FILES_STORE = "./pptfiles/"
    
  10. Create a run.py file to run the crawler:

    from scrapy import cmdline
    cmdline.execute("scrapy crawl ppt".split())
    
  11. running result:
    Insert image description here
    Insert image description here

Guess you like

Origin blog.csdn.net/sallyyellow/article/details/130206148