Integrate selenium into scrapy framework

One of the first thoughts is to write selenium in the process_request of the download middleware. Such as the following code.

  middleware.py

from selenium import webdriver
from scrapy.http import HtmlResponse
class TestMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome()
        super().__init__()

    def process_request(self, request, spider):

        self.driver.get('xxx')
        return HtmlResponse(url=self.driver.current_url,body=self.driver.page_source,encoding='utf-8')

  But there is a problem with this, the open selenium cannot be closed

Second, consider putting the driver in the spider.

  The benefits are the following:

    1 Not every spider needs to be downloaded with selenium

    2 With multiple spiders running, opening selenium is equivalent to opening multiple processes.

  like this

  At present, the official recommendation award signal is bound to the crawler, and the class method from_crawler.

  spider.py

class YunqiSpider(scrapy.Spider):
    name = ' yunqi '
   
    def __init__(self):
        self.driver = webdriver.Chrome()
        super().__init__()
        dispatcher.connect(self.close_spider,signal=signals.spider_closed)

  middleware.py

from scrapy.http import HtmlResponse
class TestMiddleware(object):

    def process_request(self, request, spider):
        return HtmlResponse(url=spider.driver.current_url,body=spider.driver.page_source,encoding='utf-8')

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325228217&siteId=291194637