【Scrapy中的图片和文件】scrapy系统内置的图片下载管道

理论基础:官方文档——https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html?highlight=image

三个基本操作:1、在items.py中定义image_urls 和 image字段

         2、在setting.py中定义ITEM_PIPELINES和IMAGES_STORE

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/path/to/valid/dir'#文件存储地址比如IMAGES_STORE = 'data/斗鱼主播图片/'
借鉴一个成功的实例:https://www.cnblogs.com/pythonClub/p/9856490.html

自己手写的实例

items.py下边

import scrapy


class Imagedemo1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls=scrapy.Field()
    image=scrapy.Field()

爬虫.py

import scrapy
from imagedemo1.items import Imagedemo1Item

class DemoSpider(scrapy.Spider):
    name = 'demo'
    allowed_domains = ['www.baidu.com']
    start_urls = ['https://gss3.bdstatic.com/7Po3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=a9a4a01c8a94a4c20e23e0293ef41bac/b64543a98226cffc613bfea1b4014a90f603ea94.jpg']

    def parse(self, response):
        item=Imagedemo1Item()
        image_url="https://gss3.bdstatic.com/7Po3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=a9a4a01c8a94a4c20e23e0293ef41bac/b64543a98226cffc613bfea1b4014a90f603ea94.jpg"
        item['image_urls']=[image_url]
        yield item

setting.py

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'data/斗鱼主播图片/'
# 该字段的值为XxxItem中定义的存储图片链接的image_urls字段
IMAGES_URLS_FIELD='image_urls'

经常出问题的一点就是在爬虫.py文件中,在官方的图片链接提交的时候,提交image_urls字段的属性是迭代的,因此需要item['image_urls']=[image_url]。

自定义的爬虫管道参考:https://blog.csdn.net/cnmnui/article/details/99850055

发布了56 篇原创文章 · 获赞 2 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/fan13938409755/article/details/104724698
今日推荐