After the crawler has started, open a chrom browser, and use this browser to crawl data in the future
1 Create a bro object in the crawler
bro = webdriver.Chrome(executable_path='/Users/liuqingzheng/Desktop/crawl/cnblogs_crawl/cnblogs_crawl/chromedriver')
Define a class in middlewares.py:
from selenium.common.exceptions import TimeoutException from scrapy.http import HtmlResponse #传递js加载后的源代码,不会返回给download class JSPageMiddleware(object): #通过chrome请求动态网页 def process_request(self, request, spider): if spider.name == "JobBole": try: spider.browser.get(request.url) except TimeoutException: print('30秒timeout之后,直接结束本页面') spider.browser.execute_script('window.stop()') import time time.sleep(3) print("访问:{0}".format(request.url)) return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request) '''编码默认是unicode'''
Code in spider:
Download middleware to use
name = "JobBole" allowed_domains = ["jobbole.com"] start_urls = ['http://blog.jobbole.com/all-posts/'] def __init__(self): '''chrome放在spider中,防止每打开一个url就跳出一个chrome''' self.browser=webdriver.Chrome(executable_path='E:/chromedriver.exe') self.browser.set_page_load_timeout(30) super(JobboleSpider, self).__init__() dispatcher.connect(self.spider_close,signals.spider_closed) def spider_close(self,spider): #当爬虫退出的时候关闭Chrome print("spider closed") self.browser.quit()
The main changes in integrating selenium into scrapy are these two places
The above chrome embedding selenium in scrapy is not asynchronous, so the efficiency will be worse.
scrapy integration selenium
Guess you like
Origin www.cnblogs.com/kai-/p/12687160.html
Recommended
Ranking