Scrapy ImagesPipeline download images

1. What is ImagesPipeline?

      ImagesPipeline is scrapy's own class, used to process images (download images to the local when crawling).

2. Advantages of ImagesPipeline:

  1. Convert downloaded images into common jpg and rgb formats
  2. Avoid repeated downloads
  3. Thumbnail generation
  4. Image size filter
  5. Asynchronous download

3. ImagesPipeline workflow

  • Crawl an item and put the urls of the image into the image_urls field
  • The item returned from the spider is passed to the item pipeline
  • When the item is passed to the imagepipeline, the scrapy scheduler and downloader will be called to complete the scheduling and downloading of the urls in image_urls.
  • After the image download is completed successfully, information such as the image download path, URL, and checksum will be filled in the images field.

4. Use ImagesPipeline to download full-page pictures of beautiful women

    (1) Web page analysis

 

 Details page information

(2) Create project

    scrapy start project Uis 
    cd Uis 
    scrapy genspider -t crawl ai_img xx.com

 (3) Modify the setting.py file

        USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
        ROBOTSTXT_OBEY = False
        DOWNLOAD_DELAY = 1

(4) Write the spider file ai_img.py

   First view the ImagesPipeline source file

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from ..items import *


class AiImgSpider(CrawlSpider):
    name = "ai_img"
    allowed_domains = ["netbian.com"]
    start_urls = ["http://www.netbian.com/mei/index.htm"]  # 起始url

    rules = (
        Rule(LinkExtractor(allow=r"^http://www.netbian.com/desk/(.*?)$"),  #详情页的路径
             callback="parse_item", follow=False),)

    def parse_item(self, response):
        #创建item对象
        item =UisItem()

        # 图片url ->保存到管道中 是字符串类型
        url_=response.xpath('//div[@class="pic"]//p/a/img/@src').get()

        #图片名称
        title_=response.xpath('//h1/text()').get()

        # 注意:必须是列表的形式
        item['image_urls']=[url_]
        item['title_']=title_

        return item

(5) Write item.py file

class UisItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    #默认字段image_urls,查看源码
    image_urls=scrapy.Field()
    title_=scrapy.Field()
    pass

(6) Write pipelines pipelines.py

    First view the ImagesPipeline source file

   1) Default saved folder

  2) Get the Item object

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import hashlib

from itemadapter import ItemAdapter
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.python import to_bytes

#继承ImagesPipeline
class UisPipeline(ImagesPipeline):

    # 重写1:直接修改默认路径
    # def file_path(self, request, response=None, info=None, *, item=None):
    #     image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
    #     # 修改默认文件夹路径
    #     return f"desk/{image_guid}.jpg"


    #重写2:需要修改文件夹和文件名
    def file_path(self, request, response=None, info=None, *, item=None):
        #获取item对象
        item_=request.meta.get('db')
        #获取图片名称
        image_guid = item_['title_'].replace(' ','').replace(',','')
        print(image_guid)
        # 修改默认文件夹路径
        return f"my/{image_guid}.jpg"

    # 重写-item对象的图片名称数据
    def get_media_requests(self, item, info):
        urls = ItemAdapter(item).get(self.images_urls_field, [])
        # 传递item对象
        return [Request(u, meta={'db':item}) for u in urls]

 (7) Set settings.py and open the image pipeline

ITEM_PIPELINES = {
    #普通管道
   # "Uis.pipelines.UisPipeline": 300,
   # "scrapy.pipelines.images.ImagesPipeline": 301, #图片的管道开启
    "Uis.pipelines.UisPipeline": 302,  #自定义图片的管道开启
}

# 保存下载图片的路径
IMAGES_STORE='./'

 (8) Run: scrapy crawl ia_img

 (9) Results display

 

 

Guess you like

Origin blog.csdn.net/hlx20080808/article/details/131062334