The Scrapy Framework Series of Python Crawlers (19) - Download a certain cat picture in actual combat [Media Pipeline]

1. Introduce:

Let's look at a small case first: use scrapy to crawl a certain picture.

  • Material database URL: https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0& fb =0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA

1.1 Store directly locally without using pipelines:

①Create scrapy project and crawler file

'''
创建项目及爬虫文件:
1.scrapy startproject baiduimgs
2.cd baiduimgs
3.scrapy genspider bdimg www
'''

② Write the crawler file:

# -*- coding: utf-8 -*-
import scrapy

import re
import os
class BdimgSpider(scrapy.Spider):
    name = 'bdimgs'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']
    num=0

    def parse(self, response):
        text=response.text
        img_urls=re.findall('"thumbURL":"(.*?)"',text)
        for img_url in img_urls:
            yield scrapy.Request(img_url,dont_filter=True,callback=self.get_img)

    def get_img(self,response):
        img_data=response.body
        if not os.path.exists("dir"):
            os.mkdir("dir")
        filename="dir/%s.jpg"%self.num
        self.num+=1
        with open(filename,"wb") as f:
            f.write(img_data)

Notice:

  • Close the robots protocol in the settings.py file;
  • 加UA!!!

③Effect:

insert image description here

1.2 Use pipes for local storage:

① Write the crawler file:

# -*- coding: utf-8 -*-
import scrapy

import re
import os
from ..items import BaiduimgsItem	#引入创建字段的类
class BdimgSpider(scrapy.Spider):
    name = 'bdimgs'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']
    num=0

    def parse(self, response):
        text=response.text
        img_urls=re.findall('"thumbURL":"(.*?)"',text)
        for img_url in img_urls:
            yield scrapy.Request(img_url,dont_filter=True,callback=self.get_img)

    def get_img(self,response):
        img_data=response.body

        item=BaiduimgsItem()
        item["img_data"]=img_data
        yield item

② Create corresponding fields in the items.py file:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduimgsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_data=scrapy.Field()

③Write the pipeline file pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import os
class BaiduimgsPipeline(object):
    num=0
    def process_item(self, item, spider):
        if not os.path.exists("dir_pipe"):
            os.mkdir("dir_pipe")
        filename="dir_pipe/%s.jpg"%self.num
        self.num+=1
        img_data=item["img_data"]
        with open(filename,"wb") as f:
            f.write(img_data)
        return item

Note: To open the pipeline in the settings.py file! ! !

④ Effect:

insert image description here

Analysis: crawler files written under two storage methods:

  • Among them: there is a get_img() callback function, the previous article shows that the callback function must have, but carefully observe these two crawler files, you will find that this callback function is not very effective, our goal is directly image data, no need to Perform an additional series of extractions, so: this callback function is obviously cumbersome, then: is there a way to simplify it! ! !

2. This introduces the media pipeline class. Use as follows:

2.1 Change the crawler file to:

# -*- coding: utf-8 -*-
import scrapy

import re
import os
from ..items import BaiduimgsPipeItem
class BdimgSpider(scrapy.Spider):
    name = 'bdimgs'
    allowed_domains = ['image.baidu.com']
    start_urls = ['https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=index&pos=history&word=%E7%8C%AB%E5%92%AA']

    def parse(self, response):
        text=response.text
        image_urls=re.findall('"thumbURL":"(.*?)"',text)
        # 注意:此处给字段的值是图片的URL!!!
        item=BaiduimgsPipeItem()
        item["image_urls"]=image_urls
        yield item

2.2 Write the items.py file:

  • (Note: If you use the media pipeline class, the field name must be image_urls, because the default field name in the source code is this!!!)
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduimgsPipeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls=scrapy.Field()

2.3 If you use the media pipeline class, you don’t need to manage the pipelines.py file, just operate it directly in settings.py:

  • (Important: On the surface, no pipeline is used, because our pipelines.py file does not perform any operations, but in fact, because we use a specific field name, the media pipeline class is used secretly!!!)
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   # 'baiduimgs.pipelines.BaiduimgsPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 300,          # 注意:一定要开启此pipeline管道!
}
# 注意:一定要指定媒体管道存储的路径!
IMAGES_STORE = r'E:\Py_Spider_High\spiderpro\scrapy_1\baiduimgs\dir0'

2.4 Effect:

insert image description here

  • It should be noted that
    this article uses scrapy2.7 version, the above operation is not possible directly, we will find a WARNING, we need to download the pillow package.
    insert image description here

Guess you like

Origin blog.csdn.net/qq_44907926/article/details/130222972