scrapy_AttributeError: 'generator' object has no attribute 'meta'''_'generator' 'dont_filter'

问题描述:初次使用craapy中间件的时候,重写了process_exception方法,目的是反复去调用,但是报错了,如下:

2018-12-26 20:50:57 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x00000000050EC080>>
Traceback (most recent call last):
  File "e:\anaconda3\lib\site-packages\scrapy\utils\signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "e:\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "e:\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 343, in request_scheduled
    redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'generator' object has no attribute 'meta'
Unhandled Error
Traceback (most recent call last):
  File "e:\anaconda3\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "e:\anaconda3\lib\site-packages\scrapy\crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 1267, in run
    self.mainLoop()
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 1276, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 902, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "e:\anaconda3\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "e:\anaconda3\lib\site-packages\scrapy\core\scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'

2018-12-26 20:50:57 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "e:\anaconda3\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
    self.crawler_process.start()
  File "e:\anaconda3\lib\site-packages\scrapy\crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 1267, in run
    self.mainLoop()
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 1276, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "e:\anaconda3\lib\site-packages\twisted\internet\base.py", line 902, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "e:\anaconda3\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "e:\anaconda3\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "e:\anaconda3\lib\site-packages\scrapy\core\scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'

代码如下:

middlewares.py
class ProxyMiddleware(object):
    logger = logging.getLogger(__name__)

    def process_exception(self, request, response, spider):
        self.logger.debug("Get GoogleMiddleware")
        return request

google.py

# -*- coding: utf-8 -*-
import scrapy


class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def make_requests_from_url(self, url):
        yield scrapy.Request(url=url, meta={'download_timeout': 10}, callback=self.parse)

    def parse(self, response):
        print(response.text)
settings.py
DOWNLOADER_MIDDLEWARES = {
   'Httpbinorg.middlewares.ProxyMiddleware': 543,
   'scrapy.downloadermiddlewares.retry.RetryMiddleware':None
}

解决:原因还未找到,百度了一下,没有答案,但是搜索出了一个方法,尝试了一下,竟然成功了 。

把spider文件中的yield换成return

修改后代码:

# -*- coding: utf-8 -*-
import scrapy


class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def make_requests_from_url(self, url):
        return scrapy.Request(url=url, meta={'download_timeout': 10}, callback=self.parse)

    def parse(self, response):
        print(response.text)

在去运行,程序正常结束。

猜你喜欢

转载自blog.csdn.net/jss19940414/article/details/85267727