How to catch and handle various exceptions in scrapy

Foreword
    When using scrapy for large crawling tasks (the crawling time is in days), no matter how good the host's network speed is, after crawling, you will always find that "item_scraped_count" in the scrapy log is not equal to the number of pre-seeded seeds, there is always a part Seed crawling fails. There are two types of failures as shown in the figure below (the following figure shows the log when scrapy crawling is completed):

Common exceptions in scrapy include but are not limited to: download error (blue area), http code 403/500 (orange area).

No matter what kind of exception, we can refer to scrapy's own retry middleware writing method to write our own middleware.

The text
     uses the IDE, and now any file in the scrapy project is typed with the following code:

from scrapy.downloadermiddlewares.retry import RetryMiddleware,
hold down the ctrl key, and click the left mouse button to click RetryMiddleware to enter the location of the project file where the middleware is located. You can also find the location of the middleware by viewing the file. The path is: site-packages / scrapy / downloadermiddlewares / retry.RetryMiddleware

The source code of the middleware is as follows:

class RetryMiddleware(object):
 
    # IOError is raised by the HttpCompression middleware when trying to
    # decompress an empty response
    EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError,
                           ConnectionRefusedError, ConnectionDone, ConnectError,
                           ConnectionLost, TCPTimedOutError, ResponseFailed,
                           IOError, TunnelError)
 
    def __init__(self, settings):
        if not settings.getbool('RETRY_ENABLED'):
            raise NotConfigured
        self.max_retry_times = settings.getint('RETRY_TIMES')
        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')
 
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)
 
    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response
 
    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            return self._retry(request, exception, spider)
 
    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1
 
        retry_times = self.max_retry_times
 
        if 'max_retry_times' in request.meta:
            retry_times = request.meta['max_retry_times']
 
        stats = spider.crawler.stats
        if retries <= retry_times:
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
 
            if isinstance(reason, Exception):
                reason = global_object_name(reason.__class__)
 
            stats.inc_value('retry/count')
            stats.inc_value('retry/reason_count/%s' % reason)
            return retryreq
        else:
            stats.inc_value('retry/max_reached')
            logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})

Looking at the source code, we can find that for the response that returns http code, the middleware will be processed by the process_response method. The processing method is relatively simple. It is probably to determine whether response.status is in the defined self.retry_http_codes collection. By looking forward, This collection is a list, defined in the default_settings.py file, defined as follows:

RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]
is to first determine whether the http code is in this set, if it is, enter the logic of retry, and return response directly if it is not in the set. In this way, the response to the HTTP code but the abnormal response has been implemented.

However, the handling of another exception is different. The exception just mentioned is precisely an HTTP request error (timeout), and when another exception occurs, it is a real code exception as shown below ( If not processed):

You can create a scrapy project, fill in an invalid url in start_url to simulate this kind of exception. It is more convenient that RetryMiddleware also provides a handling method for this type of exception: process_exception

By looking at the source code, you can analyze the approximate processing logic: first define a collection to store all types of exceptions, and then determine whether the incoming exception exists in the collection. If you do (do not analyze dont try), enter the retry logic, not in Just ignore it.

OK, now that you understand how scrapy catches exceptions, the general idea should also be there. A practical middleware template for exception handling is posted below:

from twisted.internet import defer
from twisted.internet.error import TimeoutError, DNSLookupError, \
    ConnectionRefusedError, ConnectionDone, ConnectError, \
    ConnectionLost, TCPTimedOutError
from scrapy.http import HtmlResponse
from twisted.web.client import ResponseFailed
from scrapy.core.downloader.handlers.http11 import TunnelError
 
class ProcessAllExceptionMiddleware(object):
    ALL_EXCEPTIONS = (defer.TimeoutError, TimeoutError, DNSLookupError,
                      ConnectionRefusedError, ConnectionDone, ConnectError,
                      ConnectionLost, TCPTimedOutError, ResponseFailed,
                      IOError, TunnelError)
    def process_response(self,request,response,spider):
        #捕获状态码为40x/50x的response
        if str(response.status).startswith('4') or str(response.status).startswith('5'):
            #随意封装,直接返回response,spider代码中根据url==''来处理response
            response = HtmlResponse(url='')
            return response
        #其他状态码不处理
        return response
    def process_exception(self,request,exception,spider):
        #捕获几乎所有的异常
        if isinstance(exception, self.ALL_EXCEPTIONS):
            #在日志中打印异常类型
            print('Got exception: %s' % (exception))
            #随意封装一个response,返回给spider
            response = HtmlResponse(url='exception')
            return response
        #打印出未捕获到的异常
        print('not contained exception: %s'%exception)


Spider parsing code example:

class TESTSpider(scrapy.Spider):
    name = 'TEST'
    allowed_domains = ['TTTTT.com']
    start_urls = ['http://www.TTTTT.com/hypernym/?q=']
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'TESTSpider.middlewares.ProcessAllExceptionMiddleware': 120,
        },
        'DOWNLOAD_DELAY': 1,  # 延时最低为2s
        'AUTOTHROTTLE_ENABLED': True,  # 启动[自动限速]
        'AUTOTHROTTLE_DEBUG': True,  # 开启[自动限速]的debug
        'AUTOTHROTTLE_MAX_DELAY': 10,  # 设置最大下载延时
        'DOWNLOAD_TIMEOUT': 15,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 4  # 限制对该网站的并发请求数
    }
    def parse(self, response):
        if not response.url: #接收到url==''时
            print('500')
            yield TESTItem(key=response.meta['key'], _str=500, alias='')
        elif 'exception' in response.url:
            print('exception')
            yield TESTItem(key=response.meta['key'], _str='EXCEPTION', alias='')

Note: The Order_code of this middleware cannot be too large. If it is too large, it will be closer to the downloader (Order executed by default middleware click here to view), it will take priority over RetryMiddleware to process the response, but this middleware is used for bottom, That is, when a response 500 enters the middleware chain, it needs to be processed by the retry middleware first, and cannot be processed by the middleware we wrote first. It does not have the function of retry. When the response 500 is received, the request is directly abandoned and the request is directly returned. This is unreasonable. Only after the retry, there are still abnormal requests should be handled by the middleware we wrote. At this time, you can handle whatever you want, such as retrying again and returning a reconstructed response.

Let's verify how it works (testing an invalid URL). The following figure shows the middleware not enabled:

Then enable the middleware to see the effect:

Published 150 original articles · praised 149 · 810,000 views

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/103452164