Overview of scrapy crawler framework [basic use]

Scrapy framework

Architecture

  1. Engine - Engine: Processes data streams, triggers transactions.
  2. item - item: data structure, class.
  3. Schedule - Scheduler: handles the request queue.
  4. Download - Downloader: Request.
  5. Spiders - Spiders: Crawl logic and web parsing rules.
  6. item Pipeline - Item pipeline: processing result data, cleaning and warehousing, etc.
  7. Downloader Midddlewares - Downloader Middlewares
  8. Spider Midddlewares - Spider Middleware

data flow

command line invocation
[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-IIYMnys6-1638326526821) (https://note.youdao.com/yws/res/303/WEBRESOURCEf07c5e7dcbd7ba271b5b7e2023d740e8)]

  1. subproject
  2. Engine finds the corresponding Spider and gets the first url.
  3. The spider sends the first url to the dispatcher.
  4. The scheduler returns the url in the queue to the engine, and the engine sends the url to the Downloader to request the page through the middleware.
  5. After the downloader completes the request, it sends the response result to the engine, and the engine forwards it to the spider.
  6. The spider handles the response, extracting to an item or a new request. Return to engine.
  7. The engine sends the item to pipline and the url to the scheduler.

Project structure

scrapy startproject tutorial

insert image description here

scrapy genspider qutes qutes.toscrape.com

insert image description here

project

item

insert image description here

Used to store data objects, similar to dictionary usage (item['title']='title'). Define the required fields.

spider

insert image description here

  1. name: The name of the Spider subproject. (start subproject, scrapy crawl name , unique)
  2. allowed_domains: Domain names that this subproject is allowed to crawl.
  3. start_urls: A list of Spider's initial startup urls.
  4. parse(): After the url in start_urls is requested, the returned response body will be passed to this method for processing, and then return item or a new request.
def parse(self, response):
    article_list = response.css(".common-list")
    article_list = article_list.xpath(".//ul/li")
    for article in article_list:
        item = ArticleItem()
        item['link'] = article.xpath(".//a/@href").extract_first()
        item['title'] = article.xpath(".//a/@title").extract_first()
        item['content'] = article.xpath(".//a/text()").re_first('(.*)')
        # yield item # 返回item对象 交给piplines处理
        # yield scrapy.Request(url=item['link'],meta={'item':item},callback=self.parse_detail) 
        # 返回一个新的请求,请求之后的响应体交给自己定义的parse_detail方法处理,若有参数需要传递,用meta传递和提取。

pipline

After the data extraction is completed and the item is constructed, it can be saved to a file by adding command line parameters when starting the subproject, such as:

  • scrapy crawl quotes -o quotes.json # Save as JSON file
  • scrapy crawl quotes -o quotes.csv # Save as csv|xml|pickle|marshal file
  • scrapy crawl quotes -o ftp://user:[email protected]/path/to/quote/quotes/csv # can also save to remote

Besides, if you have more complex requirements, you can write in pipline:

  1. pipline usage

    1. Pipline is expressed as a Pipline class, and a process_item() method is defined in this class. After this class is enabled, this method will be called. This method must return an item object or a DropItem exception (requires import from scrapy.exceptions import DropItem).
    2. In the following example, two pipline classes are defined:
      • The first class, ArticlePipeline, hopes to clean the data in the item object. For example, intercept the first 50 words of the title.
      • The second class, MongoPipeline, wants to store item objects in the mongodb database. In the from_crawler class method, the global configuration in settings.py is obtained through the global configuration parameter cralwer. open_spider and close_spider are called when the spider is opened or closed, respectively.
        insert image description here
  2. pipeline call method

    1. After the spider returns the item, all the Pipeline class methods are called in priority order.
    2. After the Pipeline class is defined, if you need to call it, you need to add the class to ITEM_PIPELINES in settings.py.
      insert image description here

usage

Usage of spider

Define the action of crawling the website; analyze the crawled pages.
  1. name: the name of the crawler, which is a string that defines the name of the spider. The spider's name defines how Scrapy locates and initializes the spider, and it must be unique. A common naming method is the name of the domain name.
  2. allowed_domains: Domain names that are allowed to be crawled. This is an optional configuration. Connections outside this range will not be crawled.
  3. start_urls: start urls. Equivalent to the initial value of the scheduler.
  4. custom_settings: spider configuration, override global settings. class variable
  5. crawler: Set by from_crawler() to read configuration information.
  6. settings: It is a Setting object to get global variables.
  7. start_requests(): Initial request. Must return an iterable object.
  8. parse(): This function is called when the Response body has no callback function.
  9. closed(): Called when the spider is closed. Used to release resources, etc.

Download Middleware Usage of Download Middleware

When the scheduler sends the request to the downloader and the downloader sends the response to the spider, it will go through the download middleware

illustrate

Download middleware can be used to modify User-Agent, handle redirection, set proxy, retry on failure, set cookies and other functions.
  1. Each middleware contains one or more methods, of which there are three core methods,
  • process_request(request,spider) # Called before the request is sent by the engine to the downloader #return requests\response\None\exception
  • process_response(request,response,spider) # After the downloader executes the request, it sends the response body to the spider #return response\None\exception
  • process_exception(request,exception,spider) # Called when the downloader or process_request throws an exception #return requests\response\None
  1. scrapy provides middleware such as failure retry, automatic redirection, etc., where each middleware is executed in order of priority. For the process_request() method, the smaller the priority number, the more obvious it will be called, and for the process_response() method, the higher the priority, the earlier it will be called
return value
  1. The return value of process_request is one of None, Response object, Request object, or throws an IgnoreRequest exception
  • The return is None, and the subsequent middleware according to the priority then executes the process_request method (which can be understood as modifying the request method
  • Return as a Response object, execute the low-priority process_response method (equivalent to processing the response body information
  • Returned as a Request object, the low-priority middleware stops executing, and then adds this Request to the scheduler. (equivalent to refactoring a new request
  • Throws an IgnoreRequest exception, and the low-priority process_exception will be executed in turn. If it is still abnormal, the errorback() of the request will be called back. If it is still not processed, it will be ignored.
  1. The return value of process_response is one of the Response object, the Request object, or throws an IgnoreRequest exception.
  • It is returned as a Response object, and the low-priority middleware continues to execute process_response()
  • It is returned as a Request object, and process_response() in the low-priority middleware will not be called, and a new request will be sent to the scheduler.
  • Throws an IgnoreRequest exception and hands it to the request's callback() method for callback.
    3. The return value of process_exception is one of None, Response object, and Request object.
  • The return is None, and the process_exception of the middleware in the low priority continues to be called.
  • It is returned as a Response object, and the low-priority process_response continues to be called.
  • Returned as a Request object, low-priority middleware is no longer called. Generate a new request to send to the scheduler.

use

insert image description here

In the above example, a middleware is defined.

  • In the process_request method, the request header is randomly returned.
  • In the process_response method, replace the status code 302 of the response body with 200.
  • Then add to settings.py
    insert image description here

scrapy native download middleware

    DOWNLOADER_MIDDLEWARES_BASE = {
        # Engine side
        'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
        'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
        'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
        'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
        'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
        'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
        # Downloader side
    }

Usage of Spider Middleware

Before the downloader sends the response body to the spider and after the spider generates the item or request, it will be processed by the spider middleware.

illustrate

core method

Each middleware defines one or four of the following methods:

  1. process_spide_input(response,spider)
    1. Parameters: response - the response body to be processed, spider - the corresponding spider
    2. Return value: None or throw an exception
      • When it returns None, continue to call low-priority middleware.
      • When an exception is thrown, lower priority middleware will not be executed. Call the request's errback(), the output of errback will be re-input to the middleware, call process_spider_output().
  2. process_spider_output(response,result,spider)
    1. Parameters: response - the response body to be processed, result - an iterable object containing the request or item object, which is the return value of the Spider.
    2. Return value: an iterable object that returns the request or item
  3. process_spider_exception(response,exception,spider)
    1. Parameters: response - the body of the response to be processed, exception - the exception to be thrown, spider - the spider that threw the exception
    2. Return value: None or (an iterable object containing the request or item)
      • When it returns None, continue to call the low-priority intermediate to handle the exception.
      • When returned as an iterable, call the low-priority process_spider_output().
  4. process_start_requests(start_request,spider)
    1. Parameters: start_request - the iterable object containing the request, spider - the spider object
    2. Return value: an iterable object containing the request.
use
    class OffsiteMiddleware:
    
        def __init__(self, stats):
            self.stats = stats
    
        @classmethod
        def from_crawler(cls, crawler):
            o = cls(crawler.stats)
            crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
            return o
    
        def process_spider_output(self, response, result, spider):
            for x in result:
                if isinstance(x, Request):
                    if x.dont_filter or self.should_follow(x, spider):
                        yield x
                    else:
                        domain = urlparse_cached(x).hostname
                        if domain and domain not in self.domains_seen:
                            self.domains_seen.add(domain)
                            logger.debug(
                                "Filtered offsite request to %(domain)r: %(request)s",
                                {'domain': domain, 'request': x}, extra={'spider': spider})
                            self.stats.inc_value('offsite/domains', spider=spider)
                        self.stats.inc_value('offsite/filtered', spider=spider)
                else:
                    yield x

scrapy原生Spider Middleware

    SPIDER_MIDDLEWARES_BASE = {
        # Engine side
        'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
        'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
        'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
        'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
        'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
        # Spider side
    } 

Usage of item Pipline

Main functions of item pipline:

  • Clean HTML data
  • Validate crawled data, check crawled fields
  • Check for duplicates and discard duplicates
  • Save the crawling results to the database
core method

In addition to the necessary process_item, there are several other more practical methods:

  1. open_spider(spider)
  2. close_spider(spider)
  3. from_crawler(cls,crawler)

Scrapy provides a Pipeline that handles downloads, including file downloads and image downloads. The principle of downloading files and pictures is the same as that of crawling pages, so the download process supports asynchronous and multi-threading, and the download is very efficient.

Built-in image downloader Pipline
    class ImagesPipeline(FilesPipeline):
        """Abstract pipeline that implement the image thumbnail generation logic
    
        """
    
        MEDIA_NAME = 'image'
    
        # Uppercase attributes kept for backward compatibility with code that subclasses
        # ImagesPipeline. They may be overridden by settings.
        MIN_WIDTH = 0
        MIN_HEIGHT = 0
        EXPIRES = 90
        THUMBS = {}
        DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
        DEFAULT_IMAGES_RESULT_FIELD = 'images'
DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
DEFAULT_IMAGES_RESULT_FIELD = 'images'
        def __init__(self, store_uri, download_func=None, settings=None):
          ......
            if not hasattr(self, "IMAGES_RESULT_FIELD"):
                self.IMAGES_RESULT_FIELD = self.DEFAULT_IMAGES_RESULT_FIELD
            if not hasattr(self, "IMAGES_URLS_FIELD"):
                self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD
    
            self.images_urls_field = settings.get(
                resolve('IMAGES_URLS_FIELD'),
                self.IMAGES_URLS_FIELD
            )
        def get_media_requests(self, item, info):
            urls = ItemAdapter(item).get(self.images_urls_field, [])
            return [Request(u) for u in urls]

You can see that this Pipeline will read the image_urls field in item by default, and consider this field to be a list. In the get_media_requests method, iterate over the image_urls field and generate an iterable object of requests. to the scheduler request.
The ImagePipline class can be rewritten as needed:

from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline


class ImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield Request(item['image_url'])

Here, get_media_requestsg is rewritten to directly take the image_url field in the item to initiate a request.
Similarly

class ImagesPipeline(ImagesPipeline):
    ... ...
    def file_path(self, request, response=None, info=None, *, item=None):
        url = request.url
        file_name = url.split("/")[-1]
        return file_name

    def item_completed(self, results, item, info):
        images_paths = [x['path'] for ok, x in results if ok]
        if not images_paths:
            item['image_url'] = None
        return item

Rewrite file_path to take part of the link address of the image as the file name. After rewriting item_competed to the image link request, if the download fails, modify the image_url field of item to None.
Finally add to settings.py

ITEM_PIPELINES = {
   'tutorial.pipelines.ImagesPipeline': 301,
   'tutorial.pipelines.ArticlePipeline': 301,
   'tutorial.pipelines.MongoPipeline': 400,
}

Cookies pool docking

Proxy pool docking

crawl template

View templates
insert image description here

Create spider in template form
scrapy genspider -t crawl china tech.china.com
insert image description here

rules, crawl attribute rules, contains a list of one or more Rule objects. Each Rule defines the action of crawling a website.

Define Rules

For different crawling objects, different parsing functions should be implemented.

class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles']

    rules = (
        # Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow='article\/.*\.html', restrict_xpaths='//div[@id="left_side"]//div[@class="com_item"]'),
             callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(text(),"下一页")]'))
    )

    def parse_item(self, response):
        loadder = ChinaLoader(item=NewsItem(), response=response)
        loadder.add_xpath('title', '//h1[@id="chan_newsTtile"]/text()')
        loadder.add_value('content', '新闻内容')
        yield loadder.load_item()

gerapy framework

Common commands

(You need to start the scrapyd service first, and you need to install the scrapyd library (pip install scrapyd
insert image description here

  • pip install gerapy
  • gerapy init
  • gerapy migrate
  • gerapy createsuperuser
    • create user
  • gerapy runserver
    insert image description here

Configure the host

insert image description here

Add item

insert image description here

Package the project

insert image description here

Create a task

insert image description here

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326944990&siteId=291194637