scrapy integration selenium

After the crawler has started, open a chrom browser, and use this browser to crawl data in the future

1 Create a bro object in the crawler

bro = webdriver.Chrome(executable_path='/Users/liuqingzheng/Desktop/crawl/cnblogs_crawl/cnblogs_crawl/chromedriver')

Define a class in middlewares.py:

from selenium.common.exceptions import TimeoutException
from scrapy.http import HtmlResponse  #传递js加载后的源代码,不会返回给download
class JSPageMiddleware(object):
    #通过chrome请求动态网页
    def process_request(self, request, spider):
        if spider.name == "JobBole":
            try:
                spider.browser.get(request.url)
            except TimeoutException:
                print('30秒timeout之后,直接结束本页面')
                spider.browser.execute_script('window.stop()')
            import time
            time.sleep(3)
            print("访问:{0}".format(request.url))

            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)
            '''编码默认是unicode'''

Code in spider:

Download middleware to use

name = "JobBole"
    allowed_domains = ["jobbole.com"]
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def __init__(self):
        '''chrome放在spider中,防止每打开一个url就跳出一个chrome'''
        self.browser=webdriver.Chrome(executable_path='E:/chromedriver.exe')
        self.browser.set_page_load_timeout(30)
        super(JobboleSpider, self).__init__()
        dispatcher.connect(self.spider_close,signals.spider_closed)

    def spider_close(self,spider):
        #当爬虫退出的时候关闭Chrome
        print("spider closed")
        self.browser.quit()

The main changes in integrating selenium into scrapy are these two places

The above chrome embedding selenium in scrapy is not asynchronous, so the efficiency will be worse.

Guess you like

Origin www.cnblogs.com/kai-/p/12687160.html