The usage of Downloader Middleware in the use of Scrapy framework

Downloader Middleware is download middleware, which is a processing module between Scrapy's Request and Response. Let's first take a look at its architecture, as shown in the figure below.

The Scheduler takes out a Request from the queue and sends it to the Downloader to execute the download. This process will be processed by the Downloader Middleware. In addition, when the Downloader completes the Request download and returns the Response to the Spider, it will be processed by the Downloader Middleware again.

That is to say, the places where Downloader Middleware plays a role in the whole architecture are the following two:

  • Before the Scheduler dispatches the queued Request to the Doanloader for download, that is, we can modify it before the Request executes the download.

  • Before the Response generated after downloading is sent to the Spider, that is, we can modify the generated Resposne before it is parsed by the Spider.

The function of Downloader Middleware is very powerful, and functions such as modifying User-Agent, handling redirection, setting proxy, retrying on failure, setting cookies, etc. need to be implemented with it. Let's take a look at the detailed usage of Downloader Middleware.

1. Instructions for use

It should be noted that Scrapy has actually provided many Downloader Middleware, such as Middleware responsible for failure retry, automatic redirection and other functions, which are DOWNLOADER_MIDDLEWARES_BASEdefined by variables.

DOWNLOADER_MIDDLEWARES_BASEThe contents of the variable are as follows:

{
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

This is a dictionary format. The key name of the dictionary is the name of the Downloader Middleware built in Scrapy. The key value represents the priority of the call. The priority is a number. The smaller the number, the closer to the Scrapy engine. The larger the number, the closer to the Downloader. , Downloader Middleware with smaller numbers will be called first.

If the Downloader Middleware defined by yourself is to be added to the project, the DOWNLOADER_MIDDLEWARES_BASEvariables cannot be modified directly. Scrapy provides another setting variable DOWNLOADER_MIDDLEWARES, we can directly modify this variable to add our own defined Downloader Middleware, and disable DOWNLOADER_MIDDLEWARES_BASEthe Downloader Middleware defined inside. Let's take a look at the usage of Downloader Middleware in detail.

The core method

Scrapy's built-in Downloader Middleware provides basic functions for Scrapy, but in actual projects, we often need to define Downloader Middleware separately. Don't worry, the process is very simple, we only need to implement a few methods.

Each Downloader Middleware defines a class with one or more methods. The core methods are as follows.

  • process_request(request, spider)

  • process_response(request, response, spider)

  • process_exception(request, exception, spider)

We only need to implement at least one method to define a Downloader Middleware. Let's take a look at the detailed usage of these three methods.

1. process_request(request, spider)

Before the Request is dispatched to the Downloader by the Scrapy engine, the process_request()method will be called, that is, before the Request is dispatched from the queue and downloaded and executed by the Downloader, we can use the process_request()method to process the Request. The return value of the method must be one of None, Response object, Request object, or throw IgnoreRequestan exception.

process_request()The parameters of the method are as follows.

  • request, is the Request object, that is, the Request being processed.

  • spider, is the Spider object, that is, the Spider corresponding to this Request.

Different return types have different effects. The following summarizes the different return situations.

  • When the return is None, Scrapy will continue to process the Request, and then execute other Downloader Middleware process_request()methods, until the Downloader executes the Request and gets the Response. This process is actually the process of modifying the Request. Different Downloader Middleware modify the Request in turn according to the set priority order, and finally send it to the Downloader for execution.

  • process_request()When returned as a Response object, the sum method of the lower priority Downloader Middleware process_exception()will not be called continuously, and the methods of each Downloader Middleware will be called in process_response()turn. After the call is completed, the Response object is directly sent to the Spider for processing.

  • When returned as a Request object, the lower-priority Downloader Middleware process_request()methods will stop executing. This Request will be put back into the scheduling queue. In fact, it is a brand new Request waiting to be scheduled. If scheduled by Scheduler, all Downloader Middleware process_request()methods will be re-executed in order.

  • If IgnoreRequestan exception is thrown, all Downloader Middleware process_exception()methods will be executed in sequence. If there is no method to handle the exception, the Request errorback()method will call back. If the exception has not been handled, it is ignored.

2. process_response(request, response, spider)

After the Downloader executes the Request download, it will get the corresponding Response. The Scrapy engine will send the Response to the Spider for parsing. Before sending, we can use process_response()methods to process the Response. The return value of the method must be one of the Request object, the Response object, or throw an IgnoreRequest exception.

process_response()The parameters of the method are as follows.

  • request, is the Request object, that is, the Request corresponding to this Response.

  • response, is the Response object, that is, the processed Response.

  • spider, is the Spider object, that is, the Spider corresponding to this Response.

The following summarizes the different return situations.

  • When returned as a Request object, the lower-priority Downloader Middleware's process_response()methods will not continue to be called. The Request object will be put back into the scheduling queue to be scheduled, which is equivalent to a brand new Request. Then, the Request will be process_request()processed sequentially by the method.

  • When the Response object is returned, the lower priority Downloader Middleware process_response()method will continue to be called and continue to process the Response object.

  • If the IgnoreRequest exception is thrown, the Request errorback()method will call back. If the exception has not been handled, it is ignored.

3. process_exception(request, exception, spider)

When a Downloader or process_request()method throws an exception, such as throwing IgnoreRequestan exception, the process_exception()method is called. The return value of the method must be one of None, Response object, and Request object.

process_exception()The parameters of the method are as follows.

  • request, is the Request object, that is, the Request that generated the exception.

  • exception, is the Exception object, that is, the thrown exception.

  • spdier, is the Spider object, that is, the Spider corresponding to the Request.

The following summarizes the different return values.

  • When the return is None, the lower priority Downloader Middleware process_exception()will continue to be called in sequence until all methods are dispatched.

  • When returned as a Response object, the methods of the lower priority Downloader Middleware are process_exception()no longer called, and the process_response()methods of each Downloader Middleware are called in turn.

  • When returned as a Request object, the lower priority Downloader Middleware process_exception()will no longer be called, and the Request object will be put back into the scheduling queue to be scheduled, which is equivalent to a brand new Request. Then, the Request will be process_request()processed in turn by the method.

The above content is the detailed usage logic of these three methods. Before using them, please have a clear understanding of the handling of the return values ​​of these three methods. When customizing Downloader Middleware, you must also pay attention to the return type of each method.

Let's use an actual case to deepen our understanding of the usage of Downloader Middleware.

3. Project combat

Create a new project with the following command:

scrapy startproject scrapydownloadertest

Created a new Scrapy project named scrapydownloadertest. Enter the project, create a new Spider, and the command is as follows:

scrapy genspider httpbin httpbin.org

A new Spider is created, named httpbin, and the source code is as follows:

import scrapy
class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/']

    def parse(self, response):
        pass

Next we modify start_urlsit to: [http://httpbin.org/](http://httpbin.org/). Then parse()add a line of log output to the method to output responsethe attributes of the variable text, so that we can see the Request information sent by Scrapy.

Modify the Spider content as follows:

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        self.logger.debug(response.text)

Next, run the spider and execute the following command:

scrapy crawl httpbin

The Scrapy running result contains the Request information sent by Scrapy, as follows:

{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate,br", 
    "Accept-Language": "en", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Scrapy/1.4.0 (+http://scrapy.org)"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}

Let's observe the Headers. The User-Agent used by the Request sent by Scrapy is Scrapy/1.4.0 (+http://scrapy.org), which is actually set by Scrapy's built-in `UserAgentMiddleware`. The source code of `UserAgentMiddleware` is as follows shown:

from scrapy import signals

class UserAgentMiddleware(object):
    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

In the from_crawler()method, first try to get settingsit USER_AGENT, and then USER_AGENTpass it to the __init__()method for initialization, and its parameters are user_agent. USER_AGENTDefaults to a Scrapy string if no arguments are passed . Our new project has no settings USER_AGENT, so the user_agentvariable here is Scrapy. Next, in the process_request()method, user-agentset the headersvariable as a property of the variable, thus successfully setting the User-Agent. Therefore, User-Agent is process_request()set by the method of this Downloader Middleware.

There are two ways to modify the User-Agent of the request: one is to modify settingsthe USER_AGENTvariables in it; the other is to modify it through the Downloader Middleware process_request()method.

The first method is very simple, we only need to add a line USER_AGENTof definition in setting.py:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'

It is generally recommended to use this method to set. But if you want to set more flexible, such as setting random User-Agent, you need to use Downloader Middleware. So next we use Downloader Middleware to implement a random User-Agent setting.

Add a class to middlewares.py RandomUserAgentMiddlewareas follows:

import random

class RandomUserAgentMiddleware():
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2',
            'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1'
        ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)

We first __init__()define three different User-Agents in the class methods and represent them as a list. Next process_request(), the method is implemented. It has a parameter request, and we can directly modify requestthe property. Here we directly set the User-Agent requestof the attribute of the variable headers, and the setting content is the randomly selected User-Agent, so a Downloader Middleware is written.

However, to make it effective we need to call the Downloader Middleware again. In settings.py, DOWNLOADER_MIDDLEWARESuncomment it and set it to the following:

DOWNLOADER_MIDDLEWARES = {
   'scrapydownloadertest.middlewares.RandomUserAgentMiddleware': 543,
}

Next, we re-run the Spider, and we can see that the User-Agent has been successfully modified to a random User-Agent defined in the list:

{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate,br", 
    "Accept-Language": "en", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}

We process_request()successfully set up a random User-Agent by implementing Downloader Middleware and using the method.

In addition, Downloader Middleware has process_response()methods. After the Downloader downloads the Request, it will get the Response, and then the Scrapy engine will send the Response back to the Spider for processing. But before the Response is sent to the Spider, we can also use process_response()the method to process the Response. For example, modify the status code of Response here and RandomUserAgentMiddlewareadd the following code:

def process_response(self, request, response, spider):
    response.status = 201
    return response

We modify responsethe attribute of the variable statusto 201, and then it will responsereturn, and the modified Response will be sent to the Spider.

We then output the modified status code in the Spider, parse()and add the following output statement to the method:

self.logger.debug('Status Code: ' + str(response.status))

After re-running, the console output the following:

[httpbin] DEBUG: Status Code: 201

It can be found that the status code of Response has been successfully modified.

Therefore, if you want to post-process the Response, you can use the process_response()method.

There is also a process_exception()method, which is used to handle exceptions. We can call this method if we need exception handling. However, the frequency of use of this method is relatively low, and no example is demonstrated here.

Fourth, the code of this section

The source code of this section is: https://github.com/Python3WebSpider/ScrapyDownloaderTest.

V. Conclusion

This section explains the basic usage of Downloader Middleware. This component is very important and is the core of exception handling and anti-climbing processing. Later, we will apply this component in actual combat to deal with proxy, cookies and other content.



Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326715490&siteId=291194637