使用scrapy实现爬虫实例——图片爬取

继前三章中Spider爬取数据，Item收集数据后交给Pipeline对数据进行处理，本章在前三章的基础上继续实现对图片的爬取。

一、Spider爬取数据

spider文件夹中booksSpider.py代码：

from scrapy import Request
from scrapy.spiders import Spider
from booksScrapy.items import BooksscrapyItem

class booksSpider(Spider):
    name = 'books'  # 给爬虫起一个名字
    # 初始请求，用于获取起始的url
    def start_requests(self):
        url = 'http://books.toscrape.com/catalogue/category/books_1/index.html'
        yield Request(url)
    # 解析数据的函数
    def parse(self, response):
        item = BooksscrapyItem()  # 定义Item类的对象，用于保存一条数据
        li_selector = response.xpath('//ol[@class="row"]/li')
        for one_selector in li_selector:
            # 获取书名
            name = one_selector.xpath('article[@class="product_pod"]/h3/a/@title').extract()[0]
            # 价格
            price = one_selector.xpath('article[@class="product_pod"]/div[@class="product_price"]/p[1]/text()').extract()[0]
            #图片的url
            url = price = one_selector.xpath('article[@class="product_pod"]/div[@class="image_container"]/a/img/@src').extract()[0]
            #src = "../../../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"
            #使用  url.split("..")[-1] 来获取   media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg 这部分
            url = url.split("..")[-1]
            url = "http://books.toscrape.com"+ url
            item["name"] = name
            item["price"] = price
            item["img_url"] = url
            yield item
        # 下一页
        next_url = response.xpath('//li[@class="next"]/a/@href').extract()
        if next_url:
            next_url = "http://books.toscrape.com/catalogue/category/books_1/" + next_url[0]
            yield Request(next_url)

注意：如何获取图片的url是一个新学习的知识点，详情见上述代码

二、Item封装数据

items.py代码（增加了保存图片的字段）：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BooksscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # 书名
    name = scrapy.Field()
    # 价格
    price = scrapy.Field()
    # 图片的url
    img_url = scrapy.Field()

三、Pipeline处理数据

pipeline.py代码：
新增了一个管道来下载并保存图片

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BooksscrapyPipeline(object):
    def process_item(self, item, spider):
        #对数据进行处理—写入到txt文件中
        with open("mybooks.txt","a",encoding="utf-8") as f:
            oneStr = item["name"]+";"+item["price"]+";"+item["img_url"]+"\n" #txt中不会自动换行
            f.write(oneStr)
        return item

from scrapy.pipelines.images import ImagesPipeline #用于下载图片的管道
from scrapy import Request
from scrapy.exceptions import DropItem #异常库
import logging   #python中专门有一个logging来打印消息
logger = logging.getLogger("SaveImagePipeline")
# 图片管道，继承于  ImagesPipeline
class SaveImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):#用于下载图片的请求
        yield Request(url = item["img_url"])

    def item_completed(self, results, item, info):#判断是否正确下载
        if not results[0][0]:
            raise DropItem("下载失败")
        #打印日志
        logger.debug("下载图片成功")
        return item

操作setting.py，释放功能：

MAGES_STORE="./img"  #在当前文件夹下生成img文件

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'booksScrapy.pipelines.BooksscrapyPipeline': 300,
   'booksScrapy.pipelines.SaveImagePipeline': 400,
}

执行结果为：
系统自动在img文件夹下生成了一个full文件夹来保存图片
在这里插入图片描述

补充：对获取的图片进行操作，这里进行缩略和过滤操作在setting中设置，同时在pipeline中将图片名称改掉
pipeline.py代码为：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BooksscrapyPipeline(object):
    def process_item(self, item, spider):
        #对数据进行处理—写入到txt文件中
        with open("mybooks.txt","a",encoding="utf-8") as f:
            oneStr = item["name"]+";"+item["price"]+";"+item["img_url"]+"\n" #txt中不会自动换行
            f.write(oneStr)
        return item

from scrapy.pipelines.images import ImagesPipeline #用于下载图片的管道
from scrapy import Request
from scrapy.exceptions import DropItem #异常库
import logging   #python中专门有一个logging来打印消息
logger = logging.getLogger("SaveImagePipeline")
# 图片管道，继承于  ImagesPipeline
class SaveImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):#用于下载图片的请求
        yield Request(url = item["img_url"])

    def item_completed(self, results, item, info):#判断是否正确下载
        if not results[0][0]:
            raise DropItem("下载失败")
        #打印日志
        logger.debug("下载图片成功")
        return item

    #用于修改图片名称
    def file_path(self, request, response=None, info=None):
        #返回图片名称
        return request.url.split("/")[-1]

settings.py代码（只截取了本项目需要的部分，其他部分原封不动）：

IMAGES_STORE = "./img"  #在当前文件夹下生成img文件
#生成缩略图
IMAGES_THUMBS ={
   'small':(100,100),
   "big":(260,260)
}
#过滤掉图片过小(5×5)的图片
IMAGES_MIN_WIDTH:5
IMAGES_MIN_HEIGHT:5



# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'booksScrapy.pipelines.BooksscrapyPipeline': 300,
   'booksScrapy.pipelines.SaveImagePipeline': 400,
}

执行结果为：
在这里插入图片描述

四、总结

项目流程：
①spider中获取图片url
②item中定义url字段
③pipeline中新建管道（类）用于保存图片
（1）pipeline中定义方法用于请求图片的下载
（2）pipeline中定义方法判断是否下载成功
④setting中添加管道的执行以及图片的操作保存

文章在前三章基础上主要多了一个爬取网页图片并对图片进行操作，最后将图片保存的演练。

施施吖

发布了27 篇原创文章 · 获赞 7 · 访问量 2125

私信关注

Python爬虫——使用Scrapy实现图片的爬取（四）

使用scrapy实现爬虫实例——图片爬取

一、Spider爬取数据

二、Item封装数据

三、Pipeline处理数据

四、总结

猜你喜欢