Rewrite RetryMiddleware scrapy of middleware

Rewrite scrapy middleware RetryMiddleware

You will inevitably encounter various errors made in the process of climbing, such as timeout or 404. But also in the pool with the proxy ip, not all agents are stable, so the failure of agents we need to do some processing, such as deletion. And because the request instability caused by the agent we need to re-initiate. This time it is necessary to rewrite RetryMiddleware, to achieve some of the operations they want.

Source understanding RetryMiddleware
rewrite RetryMiddleware

RetryMiddleware part of the source code

class RetryMiddleware(object):

    # 当遇到以下Exception时进行重试
    EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost, TCPTimedOutError, ResponseFailed, IOError, TunnelError)

    def __init__(self, settings):
        '''
        这里涉及到了settings.py文件中的几个量
        RETRY_ENABLED: 用于开启中间件,默认为TRUE
        RETRY_TIMES: 重试次数, 默认为2
        RETRY_HTTP_CODES: 遇到哪些返回状态码需要重试, 一个列表,默认为[500, 503, 504, 400, 408]
        RETRY_PRIORITY_ADJUST:调整相对于原始请求的重试请求优先级,默认为-1
        '''
        if not settings.getbool('RETRY_ENABLED'):
            raise NotConfigured
        self.max_retry_times = settings.getint('RETRY_TIMES')
        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')

    def process_response(self, request, response, spider):
        # 在之前构造的request中可以加入meta信息dont_retry来决定是否重试    
        if request.meta.get('dont_retry', False):
            return response

        # 检查状态码是否在列表中,在的话就调用_retry方法进行重试
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            # 在此处进行自己的操作,如删除不可用代理,打日志等
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        # 如果发生了Exception列表中的错误,进行重试
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            # 在此处进行自己的操作,如删除不可用代理,打日志等
            return self._retry(request, exception, spider)

You want the above operation has been marked out, as long as adding their own code to meet most of the requirements. details as follows:

class MyRetryMiddleware(RetryMiddleware):
    logger = logging.getLogger(__name__)

    def delete_proxy(self, proxy):
        if proxy:
            # delete proxy from proxies pool


    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            # 删除该代理
            self.delete_proxy(request.meta.get('proxy', False))
            time.sleep(random.randint(3, 5))
            self.logger.warning('返回值异常, 进行重试...')
            return self._retry(request, reason, spider) or response
        return response


    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            # 删除该代理
            self.delete_proxy(request.meta.get('proxy', False))
            time.sleep(random.randint(3, 5))
            self.logger.warning('连接异常, 进行重试...')

            return self._retry(request, exception, spider)

Wherein _retry method has the following effect:
1, for the retry_time request.meta +1
2, the retry_times max_retry_time and comparing if the former is less than equal to the latter, a new copy of the original request in the request using the copy method, and update retry_times, and dont_filter set to True to prevent duplicate url is filtered.

Source: https://blog.csdn.net/qq_33854211/article/details/78535963
https://www.v2ex.com/t/532912

Guess you like

Origin blog.csdn.net/xc_zhou/article/details/91482972