使用scrapy框架爬取网页图片——详解

前言：使用scrapy框架爬取网页图片，并做持久化存储！使用scrapy做图片存储必须先下载 Pillow 库

安装方法：pip install Pillow

目标网址：https://sc.chinaz.com/tupian/huaxuetupian.html

spider爬虫对象源码：

import scrapy
from imgsPro.items import ImgsproItem
import time

class ImgsSpider(scrapy.Spider):
    # 爬虫文件名，运行文件的时候就是使用该名字
    name = 'imgs'
    # 允许请求的url，建议直接注释掉
    # allowed_domains = ['www.sc.chinaz.com.com']
    # 目标url
    start_urls = ['https://sc.chinaz.com/tupian/huaxuetupian.html']

    # scrapy自带的请求解析方法
    def parse(self, response):
        # 使用xpath匹配所有的图片页面url与名称
        tree = response.xpath('//div[@id="container"]/div')
        # 匹配img_page (href)
        img_page_url=tree.xpath('./div/a/@href').extract()
        # 匹配图片名称（alt）
        imgalt=tree.xpath('./div/a/@alt').extract()
        # 使用for循环遍历向每一页的图片url发送请求：
        for page,alt in zip(img_page_url,imgalt):
            # 拼接url（爬取的url很多都不是完整的！）
            page = 'https:' + page
            # print(page)
            # print(alt)
            # 实例化item对象（item就是一个空字典）
            item = ImgsproItem()
            # 将爬取到的值以字典形式存储到item中！
            item['alt'] =alt
            # print(item)
            # 单独向每个img_page_url发送请求：
            time.sleep(0.6)
            yield scrapy.Request(url=page, callback=self.imgs_parse,meta={'item':item})
    # 自定义的请求解析方法
    def imgs_parse(self,response):
        # 接收item（因为item是在parse方法中定义的所以需要在自定义方法imgs_parse接收！）
        item = response.meta['item']
        # print(item)
        # print(response.text)
        time.sleep(1.7)
        # 使用xpath解析图片页中的图片url（.extract_first()表示提取列表中的第一个值）
        img_url = response.xpath('//div[@class="imga"]/a/@href').extract_first()
        img_url = 'https:' + img_url # 拼接url
        item['img_url']=img_url  # j将获取到的url存储到item中方便提交给管道！
        # print(img_url) 
        yield item # 将图片名称跟图片的url通过item提交给管道做解析并存储！

item对象源码：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 定义爬虫文件中封装在item对象中的对象（atl是图片名称，img_url是图片地址）
    alt = scrapy.Field()
    img_url = scrapy.Field()
    pass

pilines管道对象源码：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
#下面的语句都是需要自己定义
#导入ImagesPipelin管道类（处理图片数据的！）
from scrapy.pipelines.images import ImagesPipeline
import time
import scrapy #导入scrapy（必须）
# 创建一个图片管道类
class ImgsproPipeline(ImagesPipeline):
    print('正在初始化img管道对象')

    #图片url请求方法：
    def get_media_requests(self, item, info):
        # print(item)
        # 手动发送requsts请求
        time.sleep(1.26)
        # print(item['img_url'])
        # print(item['alt'])
        # 向item中的图片url发送请求！（item是一个dict）
        yield scrapy.Request(url=item['img_url'])

    # 定义图片名称及路径：
    def file_path(self, request, response=None, info=None, *, item):
        # 定义图片存储名称
        imgName=request.url.split('/')[-1]
        print(f'正在下载：{imgName}')
        # 返回图片名称，写入到指定目录文件中
        return imgName
    def item_completed(self, results, item, info):
        # 返回item给下一个管道对象
        return item
    # 自定义一个__del__方法（方便最后执行！）
    def __del__(self):
        print('已全部下完毕！')

seting文件中的设置：

【adi：IMAGES_STORE = '图片存储路径'  该参数需要自己手动添加！图片路径中不包含图片名称！】

以上就是crapy爬取图片的全部内容！

使用scrapy框架爬取网页图片——详解

spider爬虫对象源码：

item对象源码：

pilines管道对象源码：

seting文件中的设置：

猜你喜欢