Scrapy notes two (CrawlSpider crawls pictures and stores them)

Preface

Is the liver

Instance

Process and technical point analysis

  1. Create a new CHAHUA project with China Illustration Network as the target website, the name of the chahua crawler, and the start.py file as the execution file
  2. settings.py (protocol False, request header, pipeline, imageastore)
  3. chahua.py
  4. pipeline.py
  5. items.py

Key theory

1.Rule, Link Extractors are mostly used for crawling of the whole station

Rule defines the rules for extracting links.
Follow is a Boolean value that specifies whether the links extracted from the response according to this rule need to be followed up. If callback is None, follow is set to True by default, otherwise it is set to False by default.
When follow is True, the crawler will take out the url that meets the rules from the obtained response and crawl it again. If there is still a url that meets the rules in the response that is crawled this time, it will crawl again, and loop infinitely until it does not exist. The URL that meets the rules.
When follow is False, the crawler will only take out the url that meets the rules from the response of start_urls and request it.
2. LinkExtractor used alone

Can be used to extract the complete url

Code example

chahua.py

1. Import

from scrapy.spiders.crawl import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor

2.ruel formulation

start_urls = ['http://chahua.org/']
    rules = {
    
    
        # Rule(LinkExtractor(allow=r"http://www.chahua.org/"), follow=False,),
        Rule(LinkExtractor(allow=r"http://www.chahua.org/drawn/detail.php?id=554887&hid=3"), follow=False,callback="parse_detail")
    }

I encountered a problem here. There is also a callback parsing function for formulating specific page rules, but it cannot return the desired result.
As shown in the figure , I ca
Report an error
Insert picture description here
n’t solve it for a long time. I use crawlspider to find the formulated content. I have to turn my head and create a new ZCOOL project with ZCOOL as the target website. The name of zcool crawler is used for the next step of picture downloading and learning, and it can’t be hanged on the internet by illustration
3.zcool.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.crawl import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from ..items import ImagedownloadItem

class ZcoolSpider(CrawlSpider):
    name = 'zcool'
    allowed_domains = ['zcool.com.cn']
    # start_urls = ['http://zcool.com.cn/']
    start_urls = ['https://www.zcool.com.cn/discover/0!0!0!0!0!!!!2!0!1']

    rules = {
    
    
        # 翻页的url
        Rule(LinkExtractor(allow=r'.+0!0!0!0!0!!!!2!0!\d+'),follow=True),
    # 详情页面的url
        Rule(LinkExtractor(allow=r".+/work/.+html"),follow=False,callback="parse_detail")}


    def parse_detail(self, response):

        image_urls = response.xpath("//div[@class='work-show-box']//img/@src").getall()
        title_list = response.xpath("//div[@class='details-contitle-box']/h2/text()").getall()
        title = "".join(title_list).strip()
        print(title)
        item = ImagedownloadItem(title=title,image_urls=image_urls)
        yield item

The saved fields must have imge_urls and images
import from …items import ImagedownloadItem before writing items.py

import scrapy


class ImagedownloadItem(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

Note that this is ImagedownloadItem
4. After finishing writing zcool.py, enter the next difficulty ITEM_PIPELINES

Open the required pipeline in settings.py for saving data

ITEM_PIPELINES = {
    
    
   #'imagedownload.pipelines.ImagedownloadPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline':1
}

Dedicated to the image downloading pipeline.
Then set the image download path. The
next problem in images_store is to get the image folder path of this project
through the os module, first find the path of your own settings.py, then find the imagedownload folder path, and finally find the images The path
os.path.dirname ( file ) refers to the directory where the current file is obtained.
os.path.dirname (os.path.dirname ( file )) refers to the upper-level directory of the directory where the current file is obtained
and then spliced ​​with images, using os.path, join
os.path.join(os.path.dirname(os.path.dirname( file )),'images')

pipeline.py is like this

class ImagedownloadPipeline(object):
    def process_item(self,item,spider):
        return item

Final Results

Insert picture description here
Insert picture description here
Insert picture description here

Questions and summary

1. Add the domain domain name
2. Install the pillow library
3. Overridden settings report an error, the iron is that the settings file or the crawler is wrong, the wrong is wrong, the pot is back by itself o(╥﹏╥)o
4. Toss this picture download It took about five or six days. I wasted a lot of time. I didn’t achieve the effect and didn’t understand the pipeline. I rewrite the function. Now I can see the result. The knowledge was successfully corrected. The sample code was successfully downloaded. The rewrite function specifically defines the operation of the named file. Haven't mastered o(╥﹏╥)o
This problem needs to be solved before we
study the selenium of scrapy.

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113761415