Python Notes: Scrapy framework of the use of reptiles Downloader Middleware

Downloader Middleware functionality

Very strong in the Downloader Middleware features:

  • You can modify the User-Agent
  • Handling redirects
  • Proxy settings
  • Failure retry
  • Cookies and other settings

Downloader Middleware play a role in the overall architecture of the location are the following two:

  • Scheduler scheduling in the queue before being sent to the Request Doanloader download, that is, we can execute before downloading modify the Request.
  • After downloading the generated Response before sending to the Spider, that is, we can generate before you modify Resposne be resolved Spider.

Scrapy the built-Downloader Middleware

  • In Scrapy already provides many Downloader Middleware, such as: is responsible for failed retries, automatic redirection middleware:
  • They are defined to DOWNLOADER_MIDDLEWARES_BASE variable.
  • Note: The following configuration is a global configuration, do not change, if you want to modify, to modify the project configuration!
# 在python3.6/site-packages/scrapy/settings/default_settings.py默认配置中

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}
  • Dictionary format, where the number is the priority, the smaller the priority call.

Custom Downloader Middleware Middleware

We can add Downloader Middleware own custom variables set by DOWNLOADER_MIDDLEWARES project. Downloader Middleware which has three core methods:

1 ) process_request(request,spider)

  • As each request by downloading the middleware, the method is called, there is a requirement that the method must return any one of the following three: None, a return Response对象, a return Request对象or raise IgnoreRequest. Effects on the three return values are different.

  • None: Scrapy will continue to process the request, the middleware other corresponding method execution until the appropriate handler downloading (download handler) is called, the request is performed (which response is downloaded).

  • Response对象: Scrapy will not call any other process_request () or process_exception () method, or a corresponding download function; it will return the response. process_response middleware installed () method will be called when each response is returned.

  • Request对象: Scrapy stop calling process_request method and re-scheduling request to return. When the new request is performed to return the correspondingly intermediate chain will be called according to the downloading of response.

  • raise一个IgnoreRequest异常: Download the middleware process_exception installed () method is called. If there is no one way to handle the exception, the request of errback (Request.errback) method is called. If there is no code to handle exceptions thrown, the exception is ignored and not recorded.

2 ) process_response(request,response,spider)

  • process_responseThe return value is in three ways: response对象, , request对象orraise一个IgnoreRequest异常

  • If a return Response(incoming response may be the same, or may be new objects), which responseis the other middleware in the chain process_response()approach.

  • If it returns an Requestobject, the middleware chain to stop and return the request will be re-scheduled download. Processing is similar to process_request()the return done in the request.

  • If it throws a IgnoreRequest exception is invoked request的errback(Request.errback).

  • If the code does not thrown exception processing, the exception is ignored and not recorded (as different from other abnormalities).

  • Here we write a simple example is the above project, we continue to add the following code middleware:

    def process_response(self, request, response, spider):
        response.status = 201
        return response
    

3 ) process_exception(request,exception,spider)

  • When downloading processor (download handler) or process_request()when (download middleware) throws an exception (including IgnoreRequest abnormal), Scrapy call process_exception().

  • process_exception()It is a return of the three: return None, a Response 对象, or a Request 对象.

  • If it returns None, Scrapy will continue to handle the exception, then call other middleware installed process_exception()method until all are finished middleware called, the default exception handler is invoked.

  • If it returns a Response 对象middleware chain already installed the process_response()method is called. Scrapy will not be calling any other middleware process_exception()approach.

Project Practice

1) Create a new project douban, commands are as follows:

$ scrapy startproject douban

2) Create a Spider class name for dbbook, commands are as follows:

$ cd douban
$ scrapy genspider dbbook book.douban.com

Reptile write the code:

# -*- coding: utf-8 -*-
import scrapy

class DbbookSpider(scrapy.Spider):
    name = 'dbbook'
    allowed_domains = ['book.douban.com']
    start_urls = ['https://book.douban.com/top250?start=0']

    def parse(self, response):
        #print("状态:")
        pass

3) execute command reptiles, eliminate errors

$ scrapy crawl dbbook# Results returned a 403 error (server denied access).

Analysis: default request scrapy frame information in the User-Agent value: Scrapy / 1.5.0 (http://scrapy.org).

Solution: We can settings.py configuration file: Set USER_AGENT or DEFAULT_REQUEST_HEADERS information:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

# 或
# ...
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
}
# ...

There is another solution:

In the middlewares.pyfound file DoubanDownloaderMiddlewaretype, in which the process_requestmethod of processing is as follows:

def process_request(self, request, spider):
    #输出header头信息
    print(request.headers)
    #伪装浏览器用户
    request.headers['user-agent']='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
    return None

4) open Downloader Middleware Middleware

In the project's settings.py configuration file: On setting DOWNLOADER_MIDDLEWARES information:

DOWNLOADER_MIDDLEWARES = {
    'douban.middlewares.DoubanDownloaderMiddleware': 543,
}

Click here to see the code address

Published 369 original articles · won praise 169 · views 660 000 +

Guess you like

Origin blog.csdn.net/Tyro_java/article/details/103933163