scrapy framework for python

Goal: AI Design Fundamentals – Gathering Data

As an AI algorithm engineer, in the face of new needs, there are obviously thousands of methods, but none of the data. I always try my best to find data for a function, or find that the existing data set is not ideal and the matching degree is not high. This article will learn how to quickly download data resources (resources: text articles, images, videos).

  1. Data doesn't ask for help .
  2. Familiar with web page request library, urllib, requests, beautiful soup.
  3. Focus on learning the scrapy framework and learn to use this tool flexibly.

Learning Content:

The use of the scrapy framework gives me the same feeling as the application of the Django framework.
This section will briefly introduce the installation, commands and implementation process of Scrapy.

Install

Scrapy: Scrapy is written in Python. Using virtual environment installation, one line of code automatically matches and installs dependent libraries.
Installation: conda install -c conda-forge scrapy
Install in other ways, manually match and install the dependent library.
Install dependent libraries :
lxml an efficient XML and HTML parser
parsel, an html/xml data extraction library written on top of lxml, a
w3libmulti-purpose helper for handling URL and web page encoding, an
twistedasynchronous network framework
cryptographyand pyOpenSSL, to handle various network-level security need

basic command

The Scrapy tool provides multiple commands for various purposes, each accepting a different set of parameters and options.
scrapy <command> -hView the help of how to use the command
scrapy -hFramework usage help Global
command :
startprojectNew project; genspider; settings; runspider; ; shell; fetch; viewOnly version
Project command :
crawlClimb; check; list; edit;parsebench
insert image description here

Boost example

You can also refer to the introductory tutorial: Introductory tutorial: https://www.osgeo.cn/scrapy/intro/tutorial.html

  1. scrapy startproject <project_name> [project_dir]
    Create a directory called project_dir under project_name. If project_dir is not specified, project_dir will be the same as project_name.

    For example: switch to the corresponding virtual environment under the virtual environment Anaconda3activate web , . Input: scrapy startproject spider_oneCreate a new project.
    insert image description here
    The directory contains the following content:
    insert image description here
    To create a crawler spider is to create a new inherited class in the spider scrapy.Spider. Used as an access name.
    Under the spiders folder is the .py file with the spider. xx are two different types of spiders I built myself. (Note, the number of spiders is not based on the file name, but the inherited scrapy.Spiderclass)
    At the conda command line, cd spider_one, and then switch the IDE.
    Use pycharm to open the project, and set the project environment to the virtual environment where scrapy is currently installed.
    insert image description here
    Scrapy Manual Example Crawling Website Famous Quotes : scrapy startproject spider_one
    A website address specially used to illustrate the crawler framework: http://quotes.toscrape.com/
    The code is very simple. There are two points to be emphasized here: one is two options Device: css, xpath, The second is the 4 methods used to turn pages and reorganize URLs.

    Source code: See the quotes_spider.py below for the spiders file:

    # !/usr/bin/env python
    # -*- coding: utf-8 -*-
    # @Time    : 2022/3/23 15:41
    # @Author  : Haiyan Tan
    # @Email   : [email protected] 
    # @File    : quotes_spider.py
    
    # Begin to show your code!
    import scrapy  # 如何使用scrappy跟踪链接和回调的机制
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"  # 标识蜘蛛.它在一个项目中必须是唯一的,即不能为不同的爬行器设置相同的名称。使用时:scrapy crawl quotes
    
        # 创建一个可以返回请求列表或编写生成器的函数
        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                # 'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:  # 爬行器将从该请求开始爬行。后续请求将从这些初始请求中相继生成。
                yield scrapy.Request(url, callback=self.parse)  # 实例化 Response 对象,并调用与请求关联的回调方法(parse 方法)将响应作为参数传递。
    
        def parse(self, response, **kwargs):  # parse()是Scrapy的默认回调方法
            page = response.url.split('/')[-2]
            # filename = f'quote-{page}.html'
            # with open(filename, 'wb') as f:
            #     f.write(response.body)
            # self.log(f'Saved file {filename}')
            for quote in response.css('div.quote'):
                yield {
          
          
                    'text': quote.css('span.text::text').get(),
                    'author': quote.css('small.author::text').get(),
                    'tags': quote.css('div.tags a.tag::text').getall(),
                }
    
                next_page = response.css('li.next a::attr(href)').get()
                if next_page is not None:
                    # 递归地跟踪下一页的链接 4种方法 方法1
                    # next_page = response.urljoin(next_page)
                    # yield scrapy.Request(next_page, callback=self.parse)
    
                    # 方法2
                    # 不像Scrapy.Request, response.follow 直接支持相对URL-无需调用URLJOIN。
                    yield response.follow(next_page, callback=self.parse)  # 创建请求对象的快捷方式
    
                    # 要从iterable创建多个请求  方法3
                    # anchors = response.css('ul.pager a')
                    # yield from response.follow_all(anchors, callback=self.parse)
    
                    # 等价于 方法4
                    # yield from response.follow_all(css='ul.pager a', callback=self.parse)
    
    
     next_page = response.css('li.next a::attr(href)').get()
                if next_page is not None:
    

    4 ways to recursively follow links to the next page
    Method 1:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

    Method 2: response.follow directly supports relative URLs - no need to call URLJOIN.
    yield response.follow(next_page, callback=self.parse)# Create a shortcut for the request object

    Method 3: Create multiple requests through iterable
    anchors = response.css('ul.pager a')
    yield from response.follow_all(anchors, callback=self.parse)

    Method 4: Equivalent to Method 3
    yield from response.follow_all(css='ul.pager a', callback=self.parse)

    Modification of settings.py

    BOT_NAME = 'spider_one'
    
    SPIDER_MODULES = ['spider_one.spiders']
    NEWSPIDER_MODULE = 'spider_one.spiders'
    
    LOG_LEVEL = 'ERROR'
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
  2. Run the spider: scrapy crawl <spider>
    scrapy crawl quotes # The third step, run the crawl name= quotes
    or:scrapy runspider quotes_spider.py -o quotes.jl

For more details, please refer to the scrapy manual .

Advanced use: Multi-page image capture

Taking pictures is to add a pipeline and image storage mechanism to the process of taking text information.
Take the website: Homepage: https://699pic.com/
Search object dog URL: https://699pic.com/tupian/photo-gou.html

  1. First define the image naming, image URL and other fields in items.py.
    items.py

    import scrapy
    
    class MyImagesItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()
        image_path = scrapy.Field()
    
  2. settings.py Add or set configuration information in the configuration file .
    Set the path to save the picture, the maximum time to save the project is 90, and set the size of the picture. Set up the pipeline.

    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    # 图像、管道
    ITEM_PIPELINES = {
          
          
        'spider_one.pipelines.ImgPipeline': 300,
        # 'scrapy.pipelines.images.ImagesPipeline': 1,
    }
    # 设置图片保存的路径,项目保存的最长时间90,设置的图片尺寸大小
    IMAGES_STORE = r'E:\\Datasets\\obj_detect_data\\scrapy_images_0325\\'
    IMAGES_EXPIRES = 90
    IMAGES_MIN_HEIGHT = 100
    IMAGES_MIN_WIDTH = 100
    
  3. define pipelinepipelines.py

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    import scrapy
    from itemadapter import ItemAdapter
    from scrapy.pipelines.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    
    
    class ImgPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            return [scrapy.Request(x) for x in item.get(self.images_urls_field, [])]
    
        def item_completed(self, results, item, info):
            if isinstance(item, dict) or self.images_result_field in item.fields:
                item[self.images_result_field] = [x for ok, x in results if ok]
    
            return item
    
  4. Finally, define the spider that takes pictures under the spiders folder.img_spider.py

    # !/usr/bin/env python
    # -*- coding: utf-8 -*-
    # @Time    : 2022/3/24 16:29
    # @Author  : Haiyan Tan
    # @Email   : [email protected] 
    # @File    : img_spider.py
    
    # Begin to show your code!
    import scrapy
    from ..items import *
    
    
    class ImgSpider(scrapy.Spider):
        name = 'imgspider'
        allowd_domains = ['699pic.com', ]
        start_urls = [
            'https://699pic.com/tupian/photo-gou.html',
        ]
    
        def parse(self, response, **kwargs):
            items = MyImagesItem()
            # items['image_urls'] = [response.urljoin(response.xpath('//a/img/@data-original').get())]
            items['image_urls'] = response.xpath('//a/img/@data-original').getall()
            for i, urls in enumerate(items['image_urls']):
                items['image_urls'][i] = response.urljoin(urls)
            yield items
            yield from response.follow_all(xpath='//div[@class="pager-linkPage"]/a/@href', callback=self.parse)
            # next_page = response.xpath('//div[@class="pager-linkPage"]/a/@href').get()
    
  5. Run the project: scrapy crawl --nolog imgspider
    the number of pictures taken: 24527
    insert image description here


Debug:scrapy ERROR: Error processing {'image_urls':

Reason: Scrapy tries to process your string as a list of image URLs:
The modification is consistent with this test_item['image_urls'] = [image_urls].

Debug:ValueError: Missing scheme in request url:

Scrapy will deduplicate the URL of the request (RFPDupeFilter), and adding dont_filter will tell it that this URL does not participate in deduplication.
There are two ways to make requests unfiltered:

  • Add url to allowed_domains
  • dont_filter=TrueSetting the parameter to in the scrapy.Request() function True
    means that scrapy may filter out unproblematic urls for some reasons, and we can only prevent loss by adding such commands.

Guess you like

Origin blog.csdn.net/beauthy/article/details/124006459