Scrapy——Downloader Middleware

Downloader middleware is interposed between the hook frame Scrapy request / response processing. Is a globally modify a lightweight Scrapy request and response of the underlying system.

When the main processing procedure Downloader Middleware transmission requests in request dispatcher and the web when the response returns the results to the spiders, so we can know from where to download the middleware is interposed between the hook Scrapy request / response process for modifying Scrapy request and response.

 

class Scrapy.downloadermiddleares.DownloaderMiddleware

process_request(request,spider)

As each request by downloading the middleware, the method is called, there is a requirement that the method must return any one of the following three: None, returns a Response object, returns to a Request object or raise IgnoreRequest. Effects on the three return values are different.

None: Scrapy will continue to process the request, the middleware other corresponding method execution until the appropriate handler downloading (download handler) is called, the request is performed (which response is downloaded).

Response object: Scrapy not call other low-priority process_request () or process_exception () method; which returns the Response, the Download Middleware each process_response () method is invoked sequentially, after the call is completed, the object directly transmits Response Spider to deal with.

Request object: Scrapy stop calling other low priority process_request method, the Request queue back to back on schedule, in fact, it is a new Request. When the new request is performed to return the correspondingly intermediate process_request () call will be re-order.

raise a IgnoreRequest exception: the installation of the downloaded middleware process_exception () method is called. If there is no one way to handle the exception, the request of errback (Request.errback) method is called. If there is no code to handle exceptions thrown, the exception is ignored and not recorded.

process_response(request, response, spider)

process_response return value is three: response objects, request the object, or raise an exception IgnoreRequest

Response: lower priority Downloader Middleware's process_response () method will continue to call, to continue processing the Response object.

Request object: lower priority Downloader Middleware's process_response () no longer continue to be invoked, the Request object will be back on the schedule queue inside waiting to be scheduled, which is equivalent to a new Request. Then, the Request will be processed sequentially process_request () method.

raise a IgnoreRequest anomaly: request of errback is called (Request.errback). If the code does not thrown exception processing, the exception is ignored and not recorded (as different from other abnormalities).

process_exception(request, exception, spider)

When the download processor (download handler) or process_request () (download middleware) throws an exception (including IgnoreRequest abnormal), Scrapy call process_exception ().

process_exception () is supposed to return in a three: Returns None, a Response object, or a Request object.

None: when the return to None, lower priority Downloader Middleware of process_exception () is invoked to continue sequentially until all of the methods are scheduled completion.

Response to: lower priority of process_exception Downloader Middleware () method is no longer continue to be invoked, each of process_response Downloader Middleware () method in turn are sequentially called.

Request object: lower priority Downloader Middleware's process_exception () no longer continue to be invoked, the Request object will be back on the schedule queue inside waiting to be scheduled, which is equivalent to a new Request. Then after, the Request will be process_request () method of sequential processing.

Case:

1, using process_request (Request, Spider ) proxy address before modification request link:

class ProxyMiddleare(object):
    logger = logging.getLogger(__name__)
    def process_request(self,request, spider):
        self.logger.debug("Using Proxy")
        request.meta['proxy'] = 'http://127.0.0.1:9743'
        return None

2, using process_response (Request, the respon- SE, Spider) modified status code

class ReqponseMiddleare(object):
    logger = logging.getLogger(__name__)
    def process_response(self,request, spider):
        self.logger.debug("modify response.status")
        request.status = 201
        return response

  Then the spider in the print status codes:

class HtttpbinSpider(Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_url = ['http://httpbin.org/get']

def parst(self, response):
    print ( 'status code:', reponse.status)

 

3, using process_exception (Request, excepti ON, Spider) modification request occurs after the agent and then re-request error

First request google, request can not succeed, wait 10s retry

class GoogleSpider(Spider):
    name = 'google'
    allowed_domains = ['www.google.com']
    start_url = ['http://www.google.xom']

def make_request_from_url(self, url):
    return scrapy.Request(url=url,meta={'download_timeout':10},callback=self.parse)

def parse(self,reponse):
    print(respoonse.text)

  If you do not want to retry the case can be retried middleware turn off, turn off the abnormal crawling after error

If you do not want to turn off the middleware retry error, then the middleware proxy agent read as follows, represents an exception when it came to the requesting agency plus and returns request, this one will again request Google

class ProxyMiddleare(object):
    logger = logging.getLogger(__name__)
    def process_expextion(self,request, exception,spider):
        self.logger.info("GET Exception")
        request.meta['proxy']='http://127.0.0.1:9743'
        return request

 

4, set cookies

General Settings

 

#cookie as a dictionary type data
def process_request(self, request, spider):
    request.cookies = cookies

 

Guess you like

Origin www.cnblogs.com/lanston1/p/11894856.html