[Scrapy Framework] "Version 2.4.0 Source Code" Spider Middleware (Spider Middleware) Detailed Chapter

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

Spider middleware is a hook framework of Scrapy's Spider processing mechanism, in which you can insert custom functions to process the response sent to the Spider for processing and process the requests and items generated from the Spider.

Activate spider middleware

To activate the Spider middleware component, add it to the SPIDER_MIDDLEWARES setting. This is a dict whose key is the middleware class path and its value is the middleware order.

SPIDER_MIDDLEWARES = {
    
    
    'myproject.middlewares.CustomSpiderMiddleware': 543,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}

All middleware settings

SPIDER_MIDDLEWARES_BASE = {
    
    
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}

Custom Spider middleware

class scrapy.spidermiddlewares.SpiderMiddleware

process_spider_input(self, response, spider)

For each response that passes through the spider middleware and enters the spider, this method will be called for processing. process_spider_input() should return None or raise an exception.

  1. If it returns None, Scrapy will continue to process the response and execute all other middleware until the response is handed over to the spider for processing.
  2. If it raises an exception, Scrapy will not call process_spider_input() of any other spider middleware, it will call the request errback (if any), otherwise it will enter process_spider_exception().

parameter:

  • response (Responseobject)-the response being processed
  • spider (Spiderobject)-the spider targeted by this response

process_spider_output(self, response, result, spider)

After processing the response, call this method with the result returned by Spider.
process_spider_output() must return an iterable Request, dict or Item object.

parameter:

  • response (Responseobject)-the response that generates this output from the spider
  • result (an iterable Request, dict or Item object)-the result returned by the spider
  • spider (Spiderobject)-the spider whose result is being processed

process_spider_exception(self, response, exception, spider)

This method will be called when the spider or process_spider_output() method (from the previous spider middleware) raises an exception. process_spider_exception() should return a None or an iterable Request, dict or Item object.

  1. If it returns None, Scrapy will continue to handle this exception, executing any other components in the middleware components below process_spider_exception(), until there are no remaining middleware components and the exception reaches the engine (it is logged and discarded).
  2. If it returns an iterable, then the process_spider_output() pipeline will start from the next spider middleware, and process_spider_exception() will not call any other one.

parameter:

  • response (Responseobject)-the response being processed when the exception was raised
  • exception (Exception Object)-raise an exception
  • spider (Spiderobject)-the spider that caused the exception

process_start_requests(self, start_requests, spider)

When the spider runs to start_requests(), the process_start_requests() method of the crawler middleware is called, receives an iterable (in the start_requests parameter) and must return another iterable Request object.

parameter:

  • start_requests (Iterable Request)-start request
  • spider (Spiderobject)-the spider to which the start request belongs

from_crawler(cls, crawler)

This class method is usually the entry function to access settings and signals

Summary of spider middleware

  1. When spider starts start_requests(), call process_start_requests() of spider middleware.

  2. After the response is successful, return to the spider callback function parse and call process_spider_input().

  3. When spider yield scrapy.Request() or yield item, call process_spider_output() method of spider middleware.

  4. When an Exception occurs in the spider, call the process_spider_exception() method of the spider middleware.

Built-in Spider middleware parameter reference

DepthMiddleware (depth middleware)

Used to track the depth of each request within the site to be crawled.

class scrapy.spidermiddlewares.depth.DepthMiddleware

Settings in spider.py

request.meta['depth'] = 0

parameter:

  • DEPTH_LIMIT The maximum depth to which any website will be allowed to crawl. If it is zero, no limit is imposed.
  • DEPTH_STATS_VERBOSE whether to collect the number of requests for each depth.
  • DEPTH_PRIORITY-Whether to determine the priority of the request based on the depth of the request.

HttpErrorMiddleware (error response middleware)

Filter unsuccessful (erroneous) HTTP responses that are not processed.

class scrapy.spidermiddlewares.httperror.HttpErrorMiddleware

Settings in spider.py

class MySpider(CrawlSpider):
    handle_httpstatus_list = [404]

parameter:

HTTPERROR_ALLOWED_CODES = [] # 不处理的包含非200状态代码的所有响应。
HTTPERROR_ALLOW_ALL = False # 传递所有响应,无论其状态码如何,默认Fasle

OffsiteMiddleware (filter scope middleware)

Filter out URL requests outside the domain covered by the spider.

classscrapy.spidermiddlewares.offsite.OffsiteMiddleware

Settings in spider.py

allowed_domains = ['http://xxxx.com'] # 只抓取该域名下内容

RefererMiddleware (request source middleware)

The Referer fills the request headers based on the URL that generated the response to the request.

class scrapy.spidermiddlewares.referer.RefererMiddleware

Settings in spider.py

REFERER_ENABLED = True # 是否启用引用中间件。
REFERRER_POLICY = "scrapy-default"
Options Method class used
“scrapy-default” (default) scrapy.spidermiddlewares.referer.DefaultReferrerPolicy
“no-referrer” scrapy.spidermiddlewares.referer.NoReferrerPolicy
“no-referrer-when-downgrade” scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy
“same-origin” scrapy.spidermiddlewares.referer.SameOriginPolicy
“origin” scrapy.spidermiddlewares.referer.OriginPolicy
“strict-origin” scrapy.spidermiddlewares.referer.StrictOriginPolicy
“origin-when-cross-origin” scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy
“strict-origin-when-cross-origin” scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy
“unsafe-url” scrapy.spidermiddlewares.referer.UnsafeUrlPolicy

UrlLength middleware

Filter out requests whose URL length exceeds URLLENGTH_LIMIT.

class scrapy.spidermiddlewares.urllength.UrlLengthMiddleware

Settings in spider.py

URLLENGTH_LIMIT = 200 # 允许抓取的URL的最大URL长度。

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113543492