[Python_Scrapy study notes (13)] Image pipeline based on Scrapy framework to achieve image capture

Image pipeline based on Scrapy framework to realize image capture

Preface

This article introduces how to implement image capture based on the image pipeline of the Scrapy framework, and takes capturing 360 images as an example.

text

1. Principle of image capture using Scrapy framework

Use the image pipeline class ImagesPipeline provided by the Scrapy framework to grab page images. You need to import them when using them, and rewrite the get_media_requests() method. If there are requirements for the saved file name, you need to rewrite the file_path() method, in the settings.py file The IMAGES_STORE attribute in can set the file saving path.

from scrapy.pipelines.images import ImagesPipeline

2. Implementation steps of scraping images using Scrapy framework

  1. Crawler file: extract the image link and directly yield it to the pipeline file for processing;

  2. Pipeline file: Import and inherit scrapy's ImagesPipeline class, override the get_media_requests() method and file_path() method;

    from scrapy.pipelines.images import ImagesPipeline
     class XxxPipeline(ImagesPipeline)def get_media_requests(self,xxx):
            pass
        def file_path(self,xxx):
            #处理文件名
            return filename
    
  3. settings.py: In the global configuration file, specify the location where the file is saved through IMAGES_STORE="path".

3. Scrapy framework capture image case

  1. Case requirements: Grab beauty pictures of 360 pictures and save them locally./image/xxx.jpg

  2. URL address: https://image.so.com/?src=tab_web

  3. Crawl page posturl address: https://image.so.com/zjl?sn={}&ch=beauty
    Insert image description here

  4. F12 packet capture analysis:
    Insert image description here
    Insert image description here

  5. Check the network source code to get the json file of the required data:
    Insert image description here

  6. Create a Scrapy project: write the items.py file

    import scrapy
    
    
    class SoItem(scrapy.Item):
        # 图片链接
        image_url = scrapy.Field()
        # 图片标题
        image_title = scrapy.Field()
    
  7. Write crawler file:

    import scrapy
    import json
    from ..items import SoItem
    
    
    class SoSpider(scrapy.Spider):
        name = "so"
        allowed_domains = ["image.so.com"]
        # start_urls = ["http://image.so.com/"]
        url = 'https://image.so.com/zjl?sn={}&ch=beauty'
    
        def start_requests(self):
            """
            生成所有要抓取的url地址,一次性交给调度器入队列
            :return:
            """
            for sn in range(30, 151, 30):
                page_url = self.url.format(sn)
                yield scrapy.Request(url=page_url, callback=self.parse)
    
        def parse(self, response):
            """
            提取图片的链接
            :param response:
            :return:
            """
            html = json.loads(response.text)
            for one_image_list in html["list"]:
                item = SoItem()
                item["image_url"] = one_image_list["qhimg_url"]
                item["image_title"] = one_image_list["title"]
                # 图片链接提取完成后,直接交给管道文件处理即可
                yield item
    
    
  8. Import and inherit scrapy's ImagesPipeline class in the pipeline file, and override the get_media_requests() method and file_path() method:

    import scrapy
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class SoPipeline(ImagesPipeline):
        # 重写 get_media_requests()方法,将图片的链接交给调度器入队列即可
        def get_media_requests(self, item, info):
            yield scrapy.Request(url=item["image_url"], meta={
          
          "title": item['image_title']})
    
        # 重写file_path()方法 处理文件路径及文件名
        def file_path(self, request, response=None, info=None, *, item=None):
            image_title = request.meta['title']
            filename = image_title + '.jpg'  # 拼接图片名称
            return filename
    
    
  9. In the global configuration file, specify the location where the file is saved via IMAGES_STORE="path"

    # 指定图片保存路径
    # 会存放到images下的full文件夹
    IMAGES_STORE = './images/'
    
  10. Create a run.py file to run the crawler:

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl so".split())
    
    

Guess you like

Origin blog.csdn.net/sallyyellow/article/details/130206117