Lecture 44: The usage of powerful Middleware

In the Scrapy architecture, we can see that there is a concept called Middleware, which is called Middleware when translated in Chinese. There are two kinds of Middleware in Scrapy, one is Spider Middleware and the other is Downloader Middleware. This lesson Let's introduce them separately.

Usage of Spider Middleware

Spider Middleware is a hook framework that intervenes in Scrapy's Spider processing mechanism.

After the Downloader generates the Response, the Response will be sent to the Spider. Before it is sent to the Spider, the Response will first be processed by the Spider Middleware. After the Spider has generated the Item and Request, the Item and Request will also be processed by the Spider Middleware.

Spider Middleware has the following three functions .

  • We can process the Response before the Response generated by the Downloader is sent to the Spider, that is, before the Response is sent to the Spider.

  • We can process the Request before the Request generated by the Spider is sent to the Scheduler, that is, before the Request is sent to the Scheduler.

  • We can process the Item before the Item generated by the Spider is sent to the Item Pipeline, that is, before the Item is sent to the Item Pipeline.

Instructions for use

It should be noted, Scrapy is already provided many Spider Middleware, which are SPIDER_MIDDLEWARES_BASEdefined in this variable.

The contents of the SPIDER_MIDDLEWARES_BASE variable are as follows:

{
    
    
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}

Like Downloader Middleware, Spider Middleware is first added to the SPIDER_MIDDLEWARES setting, which will be merged with the Spider Middleware defined by SPIDER_MIDDLEWARES_BASE in Scrapy. Then sort according to the numerical priority of the key value to get an ordered list. The first Middleware is the closest to the engine, and the last Middleware is the closest to the Spider.

Core method

Scrapy's built-in Spider Middleware provides basic functions for Scrapy. If we want to extend its functionality, we only need to implement certain methods.

Each Spider Middleware defines one or more of the following methods, and the core methods are as follows.

  • process_spider_input(response, spider)

  • process_spider_output(response, result, spider)

  • process_spider_exception(response, exception, spider)

  • process_start_requests(start_requests, spider)

Only need to implement one of the methods to define a Spider Middleware. Let's take a look at the detailed usage of these 4 methods.

process_spider_input(response, spider)

When Response passes Spider Middleware, this method is called to process the Response.

There are two method parameters:

  • response, which is the Response object, which is the Response being processed;

  • spider, that is, the Spider object, that is, the spider corresponding to the response.

process_spider_input() should return None or throw an exception.

  • If it returns None, Scrapy will continue to process the Response and call all other Spider Middleware until the Spider processes the Response.

  • If it throws an exception, Scrapy will not call the process_spider_input() method of any other Spider Middleware and call the errback() method of Request. The output of errback will be re-input into the middleware in the other direction, processed by the process_spider_output() method, and process_spider_exception() is called when it throws an exception.

process_spider_output(response, result, spider)

This method is called when the Spider processes the response and returns the result.

There are three method parameters:

  • response, the Response object, the Response that generated the output;

  • result, an iterable object containing Request or Item objects, that is, the result returned by Spider;

  • spider, the Spider object, that is, the Spider corresponding to the result.

process_spider_output() must return an iterable object containing Request or Item objects.

process_spider_exception(response, exception, spider)

This method is called when the process_spider_input() method of Spider or Spider Middleware throws an exception.

There are three method parameters:

  • response, which is the Response object, which is the Response that was processed when the exception was thrown;

  • exception, the Exception object, the exception being thrown;

  • spider, the Spider object, that is, the spider that threw the exception.

process_spider_exception() must return a result, either None, or an iterable containing Response or Item objects.

  • If it returns None, Scrapy will continue to process the exception and call the process_spider_exception() method in other Spider Middleware until all Spider Middleware have been called.

  • If it returns an iterable object, the process_spider_output() method of other Spider Middleware will be called, and the other process_spider_exception() will not be called.

process_start_requests(start_requests, spider)

This method is called with the Request started by Spider as the parameter. The execution process is similar to process_spider_output(), except that it has no associated Response and must return Request.

There are two method parameters:

  • start_requests, which is an iterable object containing Requests, namely Start Requests;

  • spider, the Spider object, that is, the spider to which Start Requests belong.

It must return another iterable object containing the Request object.

Usage of Downloader Middleware

Downloader Middleware is the download middleware, which is a processing module between Scrapy's Request and Response.

The Scheduler takes a Request from the queue and sends it to the Downloader to perform the download. This process will be processed by the Downloader Middleware. In addition, when the Downloader returns the Response to the Spider after the Request is downloaded, it will be processed by the Downloader Middleware again.

In other words, Downloader Middleware plays a role in the following two positions in the entire architecture.

  • Before the Request dispatched by the Scheduler queue is sent to the Downloader for download, that is, we can modify it before the Request executes the download.

  • The response generated after downloading is sent to the Spider before, that is, we can modify the generated Resposne before being parsed by the Spider.

The function of Downloader Middleware is very powerful, and functions such as modifying User-Agent, processing redirection, setting proxy, failing retry, setting Cookies, etc. all need to be implemented with it. Let's take a look at the detailed usage of Downloader Middleware.

Instructions for use

It should be noted that Scrapy actually already provides many Downloader Middleware, such as Middleware responsible for failure retry , automatic redirection and other functions, which are defined by the DOWNLOADER_MIDDLEWARES_BASE variable.

The contents of the DOWNLOADER_MIDDLEWARES_BASE variable are as follows:

{
    
    
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

This is a dictionary format. The key name of the dictionary is the name of Downloader Middleware built in Scrapy. The key value represents the priority of the call. The priority is a number. The smaller the number, the closer to the Scrapy engine. The larger the number, the closer to the Downloader. . Each Downloader Middleware can define the process_request() and request_response() methods to process the request and response respectively. For the process_request() method, the lower the priority number, the first to be called. For the process_response() method, the priority The higher the number, the earlier it is called.

If the Downloader Middleware defined by yourself is to be added to the project, the DOWNLOADER_MIDDLEWARES_BASE variable cannot be modified directly. Scrapy provides another setting variable DOWNLOADER_MIDDLEWARES. We can directly modify this variable to add our own Downloader Middleware and disable the Downloader Middleware defined in DOWNLOADER_MIDDLEWARES_BASE. Let's take a look at the use of Downloader Middleware specifically.

Core method

The built-in Downloader Middleware of Scrapy provides basic functions for Scrapy, but in actual project, we often need to define Downloader Middleware separately. Don't worry, this process is very simple, we only need to implement a few methods.

Each Downloader Middleware defines one or more method classes, the core methods are as follows.

  • process_request(request, spider)

  • process_response(request, response, spider)

  • process_exception(request, exception, spider)

We only need to implement at least one method to define a Downloader Middleware. Let's take a look at the detailed usage of these three methods.

process_request(request, spider)

Before the Request is dispatched to the Downloader by the Scrapy engine, the process_request() method will be called, that is, before the Request is dispatched from the queue to the Downloader download execution, we can use the process_request() method to process the Request. The return value of the method must be one of None, Response object, Request object, or throw IgnoreRequest exception.

The process_request() method has the following two parameters.

  • request, which is the Request object, which is the processed Request;

  • spider, the Spider object, that is, the Spider corresponding to this Request.

Different return types have different effects. The different returns are summarized below.

  • When the return is None, Scrapy will continue to process the Request, and then execute the process_request() method of other Downloader Middleware, until the Downloader executes the Request and gets the Response. This process is actually the process of modifying the Request. Different Downloader Middleware sequentially modify the Request according to the set priority order, and finally push it to the Downloader for execution.

  • When the response object is returned, the process_request() and process_exception() methods of the lower priority Downloader Middleware will not be called continuously, and the process_response() method of each Downloader Middleware will be called in turn. After the call is completed, the Response object is directly sent to the Spider for processing.

  • When the request object is returned, the process_request() method of the Downloader Middleware with lower priority will stop executing. This Request will be put in the dispatch queue again. In fact, it is a brand new Request waiting to be dispatched. If it is scheduled by the Scheduler, then all the process_request() methods of Downloader Middleware will be executed in order again.

  • If an IgnoreRequest exception is thrown, all the process_exception() methods of Downloader Middleware will be executed in sequence. If there is no method to handle this exception, then the errorback() method of Request will call back. If the exception has not been handled, it will be ignored.

process_response(request, response, spider)

After Downloader executes the Request download, it will get the corresponding Response. The Scrapy engine will send the Response to Spider for analysis. Before sending, we can all use the process_response() method to process the Response. The return value of the method must be one of Request object, Response object, or throw IgnoreRequest exception.
The process_response() method has the following three parameters.

  • request is the Request object, which is the Request corresponding to this Response.

  • response is the Response object, which is the Response being processed.

  • spider is the Spider object, which is the Spider corresponding to this Response.

The following summarizes the different return situations:

  • When the response is a Request object, the process_response() method of the Downloader Middleware of lower priority will not be called. The Request object will be placed in the dispatch queue again to be dispatched, which is equivalent to a brand new Request. Then, the Request will be processed sequentially by the process_request() method.

  • When the response object is returned, the process_response() method of the lower priority Downloader Middleware will continue to be called to continue processing the Response object.

  • If an IgnoreRequest exception is thrown, the errorback() method of Request will call back. If the exception has not been handled, it will be ignored.

process_exception(request, exception, spider)

When the Downloader or process_request() method throws an exception, such as an IgnoreRequest exception, the process_exception() method will be called. The return value of the method must be None, Response object, Request object.

There are three parameters for the process_exception() method as follows.

  • request, the Request object, that is, the Request that generated the exception.

  • exception, the Exception object, which is the exception thrown.

  • spdier, the Spider object, the Spider corresponding to the Request.

The different return values ​​are summarized below.

  • When the return value is None, the process_exception() of Downloader Middleware with lower priority will be called sequentially until all methods are scheduled.

  • When the response object is returned, the process_exception() method of the lower priority Downloader Middleware will no longer be called, and the process_response() method of each Downloader Middleware will be called in turn.

  • When it is returned as a Request object, the process_exception() of the Downloader Middleware with a lower priority will no longer be called. The Request object will be placed in the dispatch queue again to be dispatched, which is equivalent to a brand new Request. Then, the Request will be processed sequentially by the process_request() method.

The above content is the detailed usage logic of these three methods. Before using them, please have a clear understanding of the handling of the return values ​​of these three methods. When customizing Downloader Middleware, you must also pay attention to the return type of each method.

Let's use a practical case to deepen our understanding of the usage of Downloader Middleware.

Project actual combat

Create a new project, the command is as follows:

scrapy startproject scrapydownloadertest

Created a new Scrapy project named scrapydownloadertest . Enter the project, create a new Spider, the command is as follows:

scrapy genspider httpbin httpbin.org

Create a new Spider, named httpbin, the source code is as follows:

import scrapy
class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/']

    def parse(self, response):
        pass

Next we modify start_urls to: ['http://httpbin.org/']. Then add a line of log output to the parse() method and output the text attribute of the response variable, so that we can see the Request information sent by Scrapy.
Modify Spider content as follows:

import scrapy

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        self.logger.debug(response.text)

Next, run this Spider and execute the following commands:

scrapy crawl httpbin

The Scrapy running result contains the Request information sent by Scrapy, and the content is as follows:

{
    
    "args": {
    
    }, 
  "headers": {
    
    
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate,br", 
    "Accept-Language": "en", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Scrapy/1.4.0 (+http://scrapy.org)"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}

Let's observe Headers. The User-Agent used by the Request sent by Scrapy is Scrapy/1.4.0(+http://scrapy.org), which is actually set by the built-in UserAgentMiddleware of Scrapy. The source code of UserAgentMiddleware is as follows:

from scrapy import signals

class UserAgentMiddleware(object):
    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

In the from_crawler() method, first try to obtain the USER_AGENT in the settings, and then pass the USER_AGENT to the init() method for initialization. The parameter is user_agent. If the USER_AGENT parameter is not passed, it will be set to the Scrapy string by default. Our newly created project does not set USER_AGENT, so the user_agent variable here is Scrapy. Next, in the process_request() method, set the user-agent variable as an attribute of the headers variable, so that the User-Agent is successfully set. Therefore, User-Agent is set by the process_request() method of this Downloader Middleware.
There are two ways to modify the User-Agent when requesting: one is to modify the USER_AGENT variable in the settings; the other is to modify it through the process_request() method of Downloader Middleware.

The first method is very simple, we only need to add a line of USER_AGENT definition in setting.py:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'

Generally recommended to use this method to set. But if you want to be more flexible, such as setting a random User-Agent, you need to use Downloader Middleware. So next we use Downloader Middleware to implement a random User-Agent setting.

Add a RandomUserAgentMiddleware class in middlewares.py, as shown below:

import random

class RandomUserAgentMiddleware():
    def __init__(self):
        self.user_agents = ['Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2',
            'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1'
        ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)

We first defined three different User-Agents in the init () method of the class and represented them with a list. Next, the process_request() method is implemented. It has a parameter request. We can directly modify the attributes of request. Here we directly set the User-Agent of the headers property of the request object, and set the content to be the User-Agent randomly selected, so that a Downloader Middleware is written.

However, to make it effective, we need to call this Downloader Middleware again. In settings.py, uncomment DOWNLOADER_MIDDLEWARES and set it as follows:

DOWNLOADER_MIDDLEWARES = {
    
    'scrapydownloadertest.middlewares.RandomUserAgentMiddleware': 543,}

Next, when we re-run Spider, we can see that User-Agent has been successfully modified to a random User-Agent defined in the list:

{
    
    "args": {
    
    }, 
  "headers": {
    
    
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip,deflate,br", 
    "Accept-Language": "en", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
  }, 
  "origin": "60.207.237.85", 
  "url": "http://httpbin.org/get"
}

We successfully set up a random User-Agent by implementing Downloader Middleware and using the process_request() method.

In addition, Downloader Middleware also has a process_response() method. The Downloader will get the Response after downloading the Request, and then the Scrapy engine will send the Response back to the Spider for processing. But before the Response is sent to the Spider, we can also use the process_response() method to process the Response. For example, modify the status code of Response and add the following code in RandomUserAgentMiddleware:

def process_response(self, request, response, spider):
    response.status = 201
    return response

We modify the status attribute of the response object to 201, and then return the response. The modified Response will be sent to the Spider.

We then output the modified status code in Spider, and add the following output statement in the parse() method:

self.logger.debug('Status Code: ' + str(response.status))

After re-running, the console output the following:

[httpbin] DEBUG: Status Code: 201

It can be found that the status code of Response has been successfully modified. Therefore, if you want to process the Response, you can use the process_response() method.

There is also a process_exception() method, which is used to handle exceptions. If we need exception handling, we can call this method. However, the frequency of use of this method is relatively low, so there is no need to demonstrate it here.

Code for this section

The source code of this section is:
https://github.com/Python3WebSpider/ScrapyDownloaderTest

Conclusion

This section explains the basic usage of Spider Middleware and Downloader Middleware. Using them we can easily realize the flexible processing of crawler logic, which needs to be mastered.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/107975267