From getting started with Scrapy to giving up 04: Downloader middleware, making crawlers more perfect

foreword

MiddleWare, as the name suggests, middleware. Mainly process requests (such as adding proxy IP, adding request headers, etc.) and processing responses

This article mainly introduces the concept of downloader middleware, how to use middleware and custom middleware.

MiddleWare classification

It is still the familiar architecture diagram.

Scrapy Architecture

From the figure, middleware is mainly divided into two categories:

  1. Downloader MiddleWare: Downloader middleware
  2. Spider MiddleWare: Spider middleware

This article mainly introduces the downloader middleware, first look at the official definition:

The downloader middleware is a hook framework between Scrapy's request/response processing. It is a lightweight, low-level system for globally modifying Scrapy requests and responses.

effect

As described in the architecture diagram, the downloader middleware is located between the engine and the downloader. When the engine sends the unprocessed request to the downloader, it will go through the downloader middleware. At this time, the request can be packaged in the middleware, such as modifying the request header information (setting UA, cookie, etc.) and adding proxy IP.

When the downloader sends the response of the website to the engine, it will also go through the downloader middleware, where we can process the response content.

Built-in downloader middleware

Scrapy has a lot of built-in downloader middleware for developers to use. When we start a Scrapy crawler, Scrapy will automatically help us enable these middleware. As shown in the picture:

built-in middleware

The picture shows the log information printed on the console when the Scrapy program is started. We found that Scrapy helped us enable a lot of downloader middleware and Spider middleware.

Here, let's take a look at how these built-in middleware work?

RetryMiddleware

In fact, these built-in middleware are used in conjunction with the configuration in settings. Here take RetryMiddleware as an example. Its main function is: when the request fails, the retry strategy can be enabled and the number of retries can be determined according to the RETRY_ENABLED and RETRY_TIMES configurations. Just sauce! !

Then the question comes again, with so many middleware, where can I find the correspondence between the settings configuration and the middleware? ?

Here I have two methods:

  1. Go to the official document, there is a link in the previous article
  2. Look at the source code comments, there are py files corresponding to middleware under the scrapy package

RetryMiddleware

The comments are written clearly, and the parameters obtained in the code are also clear at a glance.

custom middleware

Sometimes, the built-in middleware cannot meet our own needs, so we have to rely on ourselves and customize the middleware. All middleware are defined in middlewares.py .

We opened middlewares.py and found that a downloader middleware and a Spider middleware have been automatically generated inside.

First look at the self-generated downloader middleware template:

Downloader middleware

You can see that there are five main methods:

  1. from_crawler : class method for initializing middleware
  2. process_request : This method will be called when each request downloads the middleware, corresponding to step 4 of the architecture diagram
  3. process_response : process the response content returned by the downloader, corresponding to step 7 of the architecture diagram
  4. process_exception : When the downloader or processing request is abnormal, this method is called
  5. spider_opened : Built-in semaphore callback method, don't pay attention here, don't pay attention!

Here we mainly focus on 3, and let’s take a look at 4 and 5 by the way.

process_request()

This method has two parameters:

  1. request: the request initiated by the spider that needs to be processed
  2. spider: the spider corresponding to the request, the tentative semaphore details this object
def process_request(self, request, spider):
        # Called for each request that goes through the downloader middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

The main purpose here is to let everyone read the comments. The purpose of reading the comments is to tell everyone: this method must return a value .

  1. None : This return value is basically used. Indicates that this request can go to the next middleware for processing.
  2. request : Stop calling the process_request method, and put the request back into the queue for rescheduling
  3. response : No other process_request will be called, the response will be returned directly, and process_response will be executed.

Another one is that raise throws an exception. In fact, the return value is basically None. Others can only be understood at present. If you are interested, you can explore it yourself.

process_response()

This method has three parameters:

  1. request: the request corresponding to the response
  2. response: the processed response
  3. spider: the spider corresponding to the response
def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

The same is to look at the comments, there are two return values:

  1. response : The response content returned by the downloader is processed in the process_response of each middleware
  2. request : Stop calling the process_response method, the response will not reach the spider, and put the request back into the queue for rescheduling

Remember here, just return response .

process_exception()

def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

This method is to enter this method when the above two methods throw an exception. There are three return values, which mean the same as above, just use None.

Enabling and disabling middleware

Custom middleware sometimes duplicates the functions of built-in middleware, and it is also worried about overlapping each other's functions. So here we can choose to turn off the built-in middleware in the configuration.

I personally prefer to customize User-Agent middleware, but Scrapy has built-in UserAgentMiddleware middleware, which conflicts. If the built-in middleware has a low execution priority and is executed later, the built-in UA will override the custom UA. Therefore, we need to turn off this built-in UA middleware.

The DOWNLOADER_MIDDLEWARES parameter is used to set the downloader middleware. Among them, Key is the middleware path, and Value is the execution priority of the middleware. The smaller the number, the earlier the execution . When the Value is None , it means disabled.

# settings.py
DOWNLOADER_MIDDLEWARES = {
    
    
    # 禁用默认的useragent插件
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    # 启用自定义的中间件
    'ScrapyDemo.middlewares.VideospiderDownloaderMiddleware': 543,
}

In this way, the built-in UA middleware is disabled.

call priority

Secondly, we need to be clear: middleware is a chain call, a request will pass through each middleware successively according to the priority of the middleware, and the response will also be the same .

flow chart

As mentioned above, each middleware will set an execution priority, and the smaller the number, the earlier it will be executed. For example, the priority of middleware 1 is set to 200, and the priority of middleware 2 is set to 300.

When the spider initiates a request, the request will first be processed by process_request of middleware 1, and then processed by this method of middleware 2. After being processed by this method of all middleware, it will finally reach the downloader for website request. Then return the response content.

process_response is processed in reverse order. It first reaches this method of middleware 2, then reaches middleware 1, and finally returns the response to the spider for processing by the developer.

practice

Here we customize a downloader middleware to add User-Agent.

custom middleware

Define a middleware in middlewares.py:

class CustomUserAgentMiddleWare(object):

    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
        return None

    def process_response(self, request, response, spider):
        print(request.headers['User-Agent'])
        return response

enable middleware

For the sake of intuition, we do not modify the global configuration of settings.py, but still use the local configuration in the code.

import scrapy

class DouLuoDaLuSpider(scrapy.Spider):
    name = 'DouLuoDaLu'
    allowed_domains = ['v.qq.com']
    start_urls = ['https://v.qq.com/detail/m/m441e3rjq9kwpsc.html']

    custom_settings = {
    
    
        'DOWNLOADER_MIDDLEWARES': {
    
    
            # 禁用默认的useragent插件
            'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
            # 启用自定义的中间件
            'ScrapyDemo.middlewares.CustomUserAgentMiddleWare': 400
        }
    }


    def parse(self, response):
        pass

Here, the default UA middleware is first disabled, and then the custom UA middleware is enabled. And I put a breakpoint on the last line, Debug to see if the UA is set successfully.

Test Results

Start the program in Debug mode, here first disable the custom UA middleware.

disabled

As shown in the figure, the UA of the request is Scrapy. We remove the comment, start the UA middleware, and start the program test again.

enable

As shown in the figure, the UA of the request has become the UA I set in the middleware.

Set proxy IP

The proxy IP is still set in the process_request method.

code show as below:

request.meta["proxy"] = 'http://ip:port'

epilogue

The main function of the downloader middleware is to package the request. I personally customize the downloader middleware to dynamically set the UA and detect and replace the proxy IP in real time. As for other scenario requirements, the built-in downloader middleware is basically sufficient.

Of course, you can also develop Scrapy crawlers without learning the knowledge of downloader middleware, but downloader middleware will make your crawler more perfect.

I originally wanted to write the downloader middleware and spider middleware in one article, but the knowledge points are too fragmented, not easy to typesetting, and easy to confuse, so the spider middleware will be left in the next article, looking forward to the next encounter.



Post-95 young programmers, write about personal practice in daily work, from the perspective of beginners, write from 0 to 1, detailed and serious. The article will be published on the public account [ Getting Started to Give Up Road ], looking forward to your attention.

Thanks for every attention

Guess you like

Origin blog.csdn.net/CatchLight/article/details/119413778