Downloader Middleware functionality
Very strong in the Downloader Middleware features:
- You can modify the User-Agent
- Handling redirects
- Proxy settings
- Failure retry
- Cookies and other settings
Downloader Middleware play a role in the overall architecture of the location are the following two:
- Scheduler scheduling in the queue before being sent to the Request Doanloader download, that is, we can execute before downloading modify the Request.
- After downloading the generated Response before sending to the Spider, that is, we can generate before you modify Resposne be resolved Spider.
Scrapy the built-Downloader Middleware
- In Scrapy already provides many Downloader Middleware, such as: is responsible for failed retries, automatic redirection middleware:
- They are defined to DOWNLOADER_MIDDLEWARES_BASE variable.
- Note: The following configuration is a global configuration, do not change, if you want to modify, to modify the project configuration!
# 在python3.6/site-packages/scrapy/settings/default_settings.py默认配置中
DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# Downloader side
}
- Dictionary format, where the number is the priority, the smaller the priority call.
Custom Downloader Middleware Middleware
We can add Downloader Middleware own custom variables set by DOWNLOADER_MIDDLEWARES project. Downloader Middleware which has three core methods:
1 ) process_request(request,spider)
-
As each request by downloading the middleware, the method is called, there is a requirement that the method must return any one of the following three:
None
, a returnResponse对象
, a returnRequest对象
orraise IgnoreRequest
. Effects on the three return values are different. -
None
: Scrapy will continue to process the request, the middleware other corresponding method execution until the appropriate handler downloading (download handler) is called, the request is performed (which response is downloaded). -
Response对象
: Scrapy will not call any other process_request () or process_exception () method, or a corresponding download function; it will return the response. process_response middleware installed () method will be called when each response is returned. -
Request对象
: Scrapy stop calling process_request method and re-scheduling request to return. When the new request is performed to return the correspondingly intermediate chain will be called according to the downloading of response. -
raise一个IgnoreRequest异常
: Download the middleware process_exception installed () method is called. If there is no one way to handle the exception, the request of errback (Request.errback) method is called. If there is no code to handle exceptions thrown, the exception is ignored and not recorded.
2 ) process_response(request,response,spider)
-
process_response
The return value is in three ways:response对象
, ,request对象
orraise一个IgnoreRequest异常
-
If a return
Response
(incoming response may be the same, or may be new objects), whichresponse
is the other middleware in the chainprocess_response()
approach. -
If it returns an
Request
object, the middleware chain to stop and return the request will be re-scheduled download. Processing is similar toprocess_request()
the return done in the request. -
If it throws a IgnoreRequest exception is invoked
request的errback(Request.errback)
. -
If the code does not thrown exception processing, the exception is ignored and not recorded (as different from other abnormalities).
-
Here we write a simple example is the above project, we continue to add the following code middleware:
def process_response(self, request, response, spider): response.status = 201 return response
3 ) process_exception(request,exception,spider)
-
When downloading processor (download handler) or
process_request()
when (download middleware) throws an exception (including IgnoreRequest abnormal), Scrapy callprocess_exception()
. -
process_exception()
It is a return of the three: returnNone
, aResponse 对象
, or aRequest 对象
. -
If it returns
None
, Scrapy will continue to handle the exception, then call other middleware installedprocess_exception()
method until all are finished middleware called, the default exception handler is invoked. -
If it returns a
Response 对象
middleware chain already installed theprocess_response()
method is called. Scrapy will not be calling any other middlewareprocess_exception()
approach.
Project Practice
1) Create a new project douban, commands are as follows:
$ scrapy startproject douban
2) Create a Spider class name for dbbook, commands are as follows:
$ cd douban
$ scrapy genspider dbbook book.douban.com
Reptile write the code:
# -*- coding: utf-8 -*-
import scrapy
class DbbookSpider(scrapy.Spider):
name = 'dbbook'
allowed_domains = ['book.douban.com']
start_urls = ['https://book.douban.com/top250?start=0']
def parse(self, response):
#print("状态:")
pass
3) execute command reptiles, eliminate errors
$ scrapy crawl dbbook
# Results returned a 403 error (server denied access).
Analysis: default request scrapy frame information in the User-Agent value: Scrapy / 1.5.0 (http://scrapy.org).
Solution: We can settings.py configuration file: Set USER_AGENT or DEFAULT_REQUEST_HEADERS information:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
# 或
# ...
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
}
# ...
There is another solution:
In the middlewares.py
found file DoubanDownloaderMiddleware
type, in which the process_request
method of processing is as follows:
def process_request(self, request, spider):
#输出header头信息
print(request.headers)
#伪装浏览器用户
request.headers['user-agent']='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
return None
4) open Downloader Middleware Middleware
In the project's settings.py configuration file: On setting DOWNLOADER_MIDDLEWARES information:
DOWNLOADER_MIDDLEWARES = {
'douban.middlewares.DoubanDownloaderMiddleware': 543,
}