The Scrapy Framework Series of Python Crawlers (21) - Rewrite the media pipeline class to realize the custom saving of picture names and multi-page crawling

Rewrite some methods of the framework's own media pipeline class to customize the name of the saved image:

  1. In the spider file, you need to get the picture list and yield item;
  2. Item needs to define a special field name: image_urls=scrapy.Field();
  3. Set the IMAGES_STORE storage path in the settings, if the path does not exist, the system will help us create it;
  4. To use the default pipeline, open it in the settings.py file: scrapy.pipelines.images.ImagesPipeline: 60,
    self-built pipeline needs to inherit ImagesPipeline and open the corresponding pipeline in settings.py;
  5. Can be rewritten according to official documents:
    get_media_requests
    item_completed

1. Crawler file:

# -*- coding: utf-8 -*-
import scrapy

import re
from ..items import BaiduimgPipeItem
import os
class BdimgSpider(scrapy.Spider):
    name = 'bdimgpipe'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']

    def parse(self, response):
        text=response.text
        image_urls=re.findall('"thumbURL":"(.*?)"',text)
        item=BaiduimgPipeItem()
        item["image_urls"]=image_urls
        yield item

2. Set special field names in the items.py file:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduimgPipeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls=scrapy.Field()

3. Open the self-built pipeline in the settings.py file and set the file storage path:

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   # 'baiduimg.pipelines.BaiduimgPipeline': 300,
   'baiduimg.pipelines.BdImagePipeline': 40,
   # 'scrapy.pipelines.images.ImagesPipeline': 60,
}

# IMAGES_STORE =r'C:\my\pycharm_work\爬虫\eight_class\baiduimg\baiduimg\dir0'
IMAGES_STORE ='C:/my/pycharm_work/爬虫/eight_class_ImagesPipeline/baiduimg/baiduimg/dir3'

4. Write pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.http import Request
import os

from scrapy.pipelines.images import ImagesPipeline        # 导入要使用的媒体管道类
from .settings import IMAGES_STORE

class BdImagePipeline(ImagesPipeline):
    image_num = 0
    print("spider的媒体管道类")
    def get_media_requests(self, item, info):  # 可以用来重写
        # 将图片的URL变成请求发给引擎
        '''
        req_list=[]
        for x in item.get(self.images_urls_field, []):      #本句相当于:item["images_urls"]得到图片URL列表
            req_list.append(Request(x))
        return req_list
        '''
        return [Request(x) for x in item.get(self.images_urls_field, [])]

    def item_completed(self, results, item, info):
        images_path = [x["path"] for ok,x in results if ok]

        for image_path in images_path:  # 通过os的方法rename实现图片保存名字的自定义!第一个参数为图片原路径;第二个参数为图片自定义路径
            os.rename(IMAGES_STORE+"/"+image_path,IMAGES_STORE+"/full/"+str(self.image_num)+".jpg")      # IMAGES_STORE+"/"+image_path是图片保存的原绝对路径;第二个参数是自定义的图片保存的新绝对路径(此处也放在IMAGES_STORE路径下)
            self.image_num+=1


'''
源码中一个可重写的方法:
    def item_completed(self, results, item, info):      #此方法也可以重写
        if isinstance(item, dict) or self.images_result_field in item.fields:
            item[self.images_result_field] = [x for ok, x in results if ok]
        return item

results详解:
url-从中下载文件的网址。这是从get_media_requests() 方法返回的请求的URL 。

path- FILES_STORE文件存储的路径(相对于)

checksum- 图片内容的MD5哈希

这是该results参数的典型值:
[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
]

而上面的方法item_completed()就是处理此results的,所以解读源码:
[x for ok, x in results if ok]
可知此列表推导式获取的是results中整个字典的值,然后赋给item再返回!

依此得出思路,可通过下面列表推导式获取results中图片存储的路径:
images_path = [x["path"] for ok,x in results if ok]

'''

insert image description here

5. Observation can find perfect realization:

insert image description here

Its workflow is like this:

(The following is about the file pipeline class, which is actually the same as the picture pipeline class. Because the picture pipeline class is an inherited file pipeline class, and both of them are inherited media pipeline classes~)

  1. In the crawler you can return an item and put the desired urls into the file_urls field.
  2. Items are returned from the crawler and enter the item pipeline.
  3. When an item arrives in the file pipeline, the urls in the file_urls field will be scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middleware will be reused), but with higher priority, on other pages Process them before being crawled. The item remains "locked" at a particular pipeline stage until the file download completes (or fails for some reason).
  4. After the file is downloaded, results will be populated with another field (files). This field will contain a list of dicts containing information about the downloaded file, such as the downloaded path, the original scraped url (obtained from the file_urls field), and the file checksum. The files in the file field list will keep the order of the original file_urls field. If some files fail to download, an error will be logged and the files will not appear in the files field.

Change the crawler file to achieve multi-page crawling:

  • Note: Set a download delay in the settings.py file! Otherwise there is a danger of being banned.
# -*- coding: utf-8 -*-
import scrapy

import re
from ..items import BaiduimgPipeItem
import os
class BdimgSpider(scrapy.Spider):
    name = 'bdimgpipe'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']
    page_url="https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%8C%AB%E5%92%AA&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E7%8C%AB%E5%92%AA&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={}&rn=30&gsm=1e&1588088573059="
    num = 0
    pagenum=1

    def parse(self, response):
        text=response.text
        image_urls=re.findall('"thumbURL":"(.*?)"',text)
        item=BaiduimgPipeItem()
        item["image_urls"]=image_urls
        yield item

        url=self.page_url.format(self.pagenum*30)
        self.pagenum+=1
        if self.pagenum == 3:			#想要多少就设置多少为中断!!!
            return
        yield scrapy.Request(url, callback=self.parse)
'''
F12观察原网页,每次加载更多图片,找到对应的URL,观察规律,发现其中pn参数的值随着页面的加载逐页增加30!!!
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%8C%AB%E5%92%AA&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E7%8C%AB%E5%92%AA&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1588088573059=
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%8C%AB%E5%92%AA&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word=%E7%8C%AB%E5%92%AA&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=60&rn=30&gsm=3c&1588088573138=
'''

The effect is very nice:
insert image description here

Expansion: Some settings of the media pipeline:

ITEM_PIPELINES = {
    
    'scrapy.pipelines.images.ImagesPipeline': 1}  启用
FILES_STORE = '/path/to/valid/dir'		文件管道存放位置
IMAGES_STORE = '/path/to/valid/dir'		图片管道存放位置
FILES_URLS_FIELD = 'field_name_for_your_files_urls'    自定义文件url字段
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'   自定义结果字段
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'  自定义图片url字段
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'  结果字段
FILES_EXPIRES = 90    文件过期时间   默认90天
IMAGES_EXPIRES = 90    图片过期时间   默认90天
IMAGES_THUMBS = {
    
    'small': (50, 50), 'big':(270, 270)}  缩略图尺寸                              # !!!直接在settings.py中写入此设置,再运行框架就会生成缩略图!!!十分方便,常用!!!
IMAGES_MIN_HEIGHT = 110   过滤最小高度
IMAGES_MIN_WIDTH = 110   过滤最小宽度
MEDIA_ALLOW_REDIRECTS = True    是否重定向

Guess you like

Origin blog.csdn.net/qq_44907926/article/details/131140697