1. What is ImagesPipeline?
ImagesPipeline is scrapy's own class, used to process images (download images to the local when crawling).
2. Advantages of ImagesPipeline:
- Convert downloaded images into common jpg and rgb formats
- Avoid repeated downloads
- Thumbnail generation
- Image size filter
- Asynchronous download
3. ImagesPipeline workflow
- Crawl an item and put the urls of the image into the image_urls field
- The item returned from the spider is passed to the item pipeline
- When the item is passed to the imagepipeline, the scrapy scheduler and downloader will be called to complete the scheduling and downloading of the urls in image_urls.
- After the image download is completed successfully, information such as the image download path, URL, and checksum will be filled in the images field.
4. Use ImagesPipeline to download full-page pictures of beautiful women
(1) Web page analysis
Details page information
(2) Create project
scrapy start project Uis cd Uis scrapy genspider -t crawl ai_img xx.com
(3) Modify the setting.py file
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 1
(4) Write the spider file ai_img.py
First view the ImagesPipeline source file
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import *
class AiImgSpider(CrawlSpider):
name = "ai_img"
allowed_domains = ["netbian.com"]
start_urls = ["http://www.netbian.com/mei/index.htm"] # 起始url
rules = (
Rule(LinkExtractor(allow=r"^http://www.netbian.com/desk/(.*?)$"), #详情页的路径
callback="parse_item", follow=False),)
def parse_item(self, response):
#创建item对象
item =UisItem()
# 图片url ->保存到管道中 是字符串类型
url_=response.xpath('//div[@class="pic"]//p/a/img/@src').get()
#图片名称
title_=response.xpath('//h1/text()').get()
# 注意:必须是列表的形式
item['image_urls']=[url_]
item['title_']=title_
return item
(5) Write item.py file
class UisItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#默认字段image_urls,查看源码
image_urls=scrapy.Field()
title_=scrapy.Field()
pass
(6) Write pipelines pipelines.py
First view the ImagesPipeline source file
1) Default saved folder
2) Get the Item object
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import hashlib
from itemadapter import ItemAdapter
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.python import to_bytes
#继承ImagesPipeline
class UisPipeline(ImagesPipeline):
# 重写1:直接修改默认路径
# def file_path(self, request, response=None, info=None, *, item=None):
# image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
# # 修改默认文件夹路径
# return f"desk/{image_guid}.jpg"
#重写2:需要修改文件夹和文件名
def file_path(self, request, response=None, info=None, *, item=None):
#获取item对象
item_=request.meta.get('db')
#获取图片名称
image_guid = item_['title_'].replace(' ','').replace(',','')
print(image_guid)
# 修改默认文件夹路径
return f"my/{image_guid}.jpg"
# 重写-item对象的图片名称数据
def get_media_requests(self, item, info):
urls = ItemAdapter(item).get(self.images_urls_field, [])
# 传递item对象
return [Request(u, meta={'db':item}) for u in urls]
(7) Set settings.py and open the image pipeline
ITEM_PIPELINES = {
#普通管道
# "Uis.pipelines.UisPipeline": 300,
# "scrapy.pipelines.images.ImagesPipeline": 301, #图片的管道开启
"Uis.pipelines.UisPipeline": 302, #自定义图片的管道开启
}
# 保存下载图片的路径
IMAGES_STORE='./'
(8) Run: scrapy crawl ia_img
(9) Results display