[Scrapy Framework] "Version 2.4.0 Source Code" Downloader Middleware (Downloader Middleware) Detailed

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

The downloader middleware is a framework that hooks into Scrapy's request/response processing. This is a lightweight low-level system that is used to globally change Scrapy's requests and responses.
main effect:

  1. Modify and process the request before Scrapy sends the request to the website, such as changing the proxy ip, header, etc.
  2. Process the received response before passing the response to the engine, such as: re-request if the response fails, or process the failed one and then return it to the engine.
  3. Ignore some responses or requests.

Activate downloader middleware

To activate the downloader middleware component, please add it to the DOWNLOADER_MIDDLEWARES setting. This is a dict whose key is the middleware class path and its value is the middleware order.

DOWNLOADER_MIDDLEWARES = {
    
    
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # 禁用状态
}

All middleware settings

DOWNLOADER_MIDDLEWARES_BASE = {
    
    
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

Custom downloader middleware

class scrapy.downloadermiddlewares.DownloaderMiddleware

process_request(request, spider)

The process_request function is a function executed by the middleware after receiving the request request. The execution order of the function is executed according to the configuration order of the middleware in settings.py. If the function returns None, continue to execute the process_request method of the subsequent middleware. If the function returns response, the process_request method of the subsequent middleware will not be executed.

parameter:

  • request (Request object)-the processed request
  • spider (Spider object)-the spider corresponding to the request
def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        # 每个交给下载器的request对象都会经过该方法,并期望返回response

        # Must either:
        # 如果该方法返回的是None,则继续执行其他下载中间件的process_request方法送往下载器,直到合适的下载器函数被调用,该request被执行,返回response
        # - return None: continue processing this request
        # 如果该方法返回的是response,则终止当前流程,也终止继续调用其他process_request方法,将该response通过引擎返回给爬虫
        # - or return a Response object
        # 如果该方法返回的是request,则终止当前流程,也终止继续调用其他process_request方法,将request返回给调度器,大多数情况是更换新的request请求
        # - or return a Request object
        # 抛出IgnoreRequest异常, 该异常就会交个process_exception方法进行处理; 如果没有任何一个方法处理该异常
        # 那么该请求就直接被忽略了, 也不会记录错误日志.
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

process_response(request, response, spider)

The process_response function is a function executed after the views.py is executed. The process_response function has two parameters, one is request and the other is response. Response is the response object returned by the view function. The return value of the process_response function must be the HttpResponse object.

parameter:

  • request (Request object) – the request corresponding to the response
  • response (Response object) – the response being processed
  • spider (Spider object) – the spider corresponding to the response
def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.
    # 当下载器完成http请求,返回响应给引擎的时候调用

    # Must either;
    # 如果返回response,则继续执行,其他下载中间件也会处理该response,直至交给引擎再交给爬虫
    # - return a Response object
    # 如果返回request,则中间件终止,该request返回引擎再给调度器
    # - return a Request object
    # 抛出 IgnoreRequest 异常; 该请求就被忽略了且不做记录
    # - or raise IgnoreRequest
    return response

process_exception(request, exception, spider)

The process_exception(self, request, exception) function has two parameters. The exception is the Exception object generated by the view function exception. The execution order of the process_exception function is the reverse order of the middleware set in settings.py. The process_exception function is only in the view It is executed when an exception occurs in the function. The returned value can be None or an HttpResponse object. If it returns None, the process_exception method of the next middleware will continue to handle the exception. If it returns HttpResponse, the middleware will be called. The process_response method in.

parameter:

  • request (is a Request object)-the request that caused the exception
  • exception (Exception object) – the exception thrown
  • spider (Spider object) – the spider corresponding to the request
def process_exception(self, request, exception, spider):
    # Called when a download handler or a process_request()
    # 在下载处理程序或process_request()时调用
    # (from other downloader middleware) raises an exception.
	# (来自其他下载器中间件)引发异常
	
    # Must either:
    # - return None: continue processing this exception
    # -返回None:继续处理此异常
    # - return a Response object: stops process_exception() chain
    # -返回一个Response对象:停止process_exception()链
    # - return a Request object: stops process_exception() chain
    # -返回一个Request对象:停止process_exception()链
    pass

from_crawler(cls, crawler)

It is used to generate the scheduler based on the incoming parameters and the crawler object, so that the scheduler has the properties and configuration of the crawler.

parameter:

  • crawler (Crawlerobject)-a crawler that uses this middleware
    @classmethod
    def from_crawler(cls, crawler):
        # Scrapy使用此方法创建Spider。
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

Built-in downloader middleware parameter reference

CookiesMiddleware (Cookies middleware)

Used to deal with sites that require the use of cookies.

class scrapy.downloadermiddlewares.cookies.CookiesMiddleware

Support cookiejar session delivery

def parse(self, response):
    ......
    return scrapy.Request("http://www.example.com/otherpage",
        meta={
    
    'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

Configuration parameters in settings

COOKIES_ENABLED = True # 是否启用Cookies中间件。
COOKIES_DEBUG = True # Scrapy将记录请求中发送的所有cookie

DefaultHeadersMiddleware (default request header middleware)

, All default request headers specified, it is recommended to customize random request headers.

class scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware

Configuration parameters in settings

DEFAULT_REQUEST_HEADERS = {
    
    
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

DownloadTimeoutMiddleware (download timeout middleware)

Control the timeout settings for fetching data.

class scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware

Configuration parameters in settings

DOWNLOAD_TIMEOUT = 1

Configuration parameters in spider

download_timeout = 1

HttpAuthMiddleware (HTTP authentication middleware)

Authenticate all requests generated from some spiders, such as token verification.

class scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
from scrapy.spiders import CrawlSpider

class SomeIntranetSiteSpider(CrawlSpider):

    http_user = 'root'
    http_pass = 'password'
    token = 'xxxxxxxxxxxxxxxxxx'
   .......

HttpCacheMiddleware (Cache Middleware)

Provides low-level caching for all HTTP requests and responses. Must be used in conjunction with the cache storage backend and the caching strategy.

class scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware

Configuration parameters in settings

HTTPCACHE_ENABLED = True  # 是否启用HTTP缓存。
HTTPCACHE_EXPIRATION_SECS = 0 # 缓存的请求的到期时间,以秒为单位。
HTTPCACHE_DIR = 'httpcache' # 用于存储(低级)HTTP缓存的目录。空为将禁用HTTP缓存。
HTTPCACHE_IGNORE_HTTP_CODES = [] # 不要使用这些HTTP代码缓存响应。
HTTPCACHE_IGNORE_MISSING = True # 缓存中找不到的请求将被忽略,默认False。
HTTPCACHE_IGNORE_SCHEMES = ['file'] # 不使用这些URI方案缓存响应。
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_DBM_MODULE = 'dbm' # 在DBM存储后端中使用的数据库模块。此设置特定于DBM后端。
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy' # 实现缓存策略的类。
HTTPCACHE_GZIP = True # 使用gzip压缩所有缓存的数据,默认False。
HTTPCACHE_ALWAYS_STORE = True # 无条件缓存页面,默认False。
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = [] # 响应中的Cache-Control指令列表将被忽略。

Caching strategy:

  1. Dummy policy (virtual policy-default)
    finds the same request and responds immediately, and does not pass in content from the internet
class scrapy.extensions.httpcache.DummyPolicy
  1. The RFC2616 strategy is
    used in production and in continuous operation to save bandwidth and speed up crawling.
class scrapy.extensions.httpcache.RFC2616Policy

HTTP cache storage backend:

  1. Filesystem storage backend (default) The
    file system storage backend can be used for HTTP cache middleware.
class scrapy.extensions.httpcache.FilesystemCacheStorage

Configuration parameters in settings

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  1. DBM storage backend (DBM storage backend)
    DBM storage backend can also be used for HTTP cache middleware, use the HTTPCACHE_DBM_MODULE setting to change.
    Configuration parameters in settings
HTTPCACHE_STORAGE = 'sscrapy.extensions.httpcache.DbmCacheStorage'
  1. Custom storage backend
    Create and define a Python class to implement the cache storage backend.
class scrapy.extensions.httpcache.CacheStorage

HttpCompressionMiddleware (compression middleware)

Allows to send/receive compressed (gzip, deflate) traffic from the website. If brotlipy or zstandard are installed respectively, this middleware also supports decoding of brotli-compressed responses and zstd-compressed responses.

class scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware

Configuration parameters in settings

COMPRESSION_ENABLED = True # 是否启用压缩中间件。

HttpProxyMiddleware (proxy middleware)

Set the HTTP proxy for the request by setting the proxy element value of the object.

class scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

Same as the Python standard library module urllib.request

http_proxy = 'http://xxxx:xxxx'
https_proxy  = 'http://xxxx:xxxx'
no_proxy

RedirectMiddleware (redirect middleware)

Handle the redirection of the request according to the response status.

class scrapy.downloadermiddlewares.redirect.RedirectMiddleware

Configuration parameters in settings

REDIRECT_ENABLED = True # 是否启用重定向中间件。
REDIRECT_MAX_TIMES = 20 # 单个请求将遵循的最大重定向数。
HTTPERROR_ALLOWED_CODES = [302] # 添加过滤重新定向的响应码

MetaRefreshMiddleware (Meta refresh middleware)

Process request redirection based on meta-refresh html tag.

class scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware

Configuration parameters in settings

METAREFRESH_ENABLED = True # 是否启用Meta Refresh中间件。
METAREFRESH_IGNORE_TAGS = [] # 元标记忽略列表。
METAREFRESH_MAXDELAY = 100 # 重定向后的最大元刷新延迟(以秒为单位)。

RetryMiddleware (retry middleware)

After spider completes the crawling of all regular (non-failed) pages, it will collect failed pages during the crawling process and reschedule the schedule at the end.

class scrapy.downloadermiddlewares.retry.RetryMiddleware

Configuration parameters in settings

RETRY_ENABLED = True # 是否启用“重试”中间件。
RETRY_TIMES = 2 # 除首次下载外,最大重试次数。
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429] # 重试HTTP响应代码列表。

RobotsTxtMiddleware (RobotsTxt middleware)

Filter out requests prohibited by robots.txt exclusion criteria.

classscrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware

If this part completely complies with the agreement, don't climb, tired.

DownloaderStats (download statistics)

Used to store statistics about all requests, responses and exceptions that pass through it.

class scrapy.downloadermiddlewares.stats.DownloaderStats

Configuration parameters in settings

DOWNLOADER_STATS = True # 是否启用下载器统计信息收集。

UserAgentMiddleware (user headers middleware)

Spider overrides the default user headers and proxy middleware.

class scrapy.downloadermiddlewares.useragent.UserAgentMiddleware

AjaxCrawlMiddleware (AjaxCrawl Middleware)

Find the middleware of the "AJAX crawlable" page based on the meta-fragment html tag.

class scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware

Configuration parameters in settings

AJAXCRAWL_ENABLED = True # 是否启用AjaxCrawlMiddleware。
HTTPPROXY_ENABLED = True # 是否启用HttpProxyMiddleware。
HTTPPROXY_AUTH_ENCODING = "latin-1" # 上代理身份验证的默认编码HttpProxyMiddleware。

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113524585