scrapy can crawl static pages, but more and more website data is processed dynamically through js. In order to crawl this part of the data, it is necessary to process the dynamic pages processed by js. A simple way is to integrate js processing tools, the author chooses selenium here.
For the installation of scrapy, see the author's other articles. Here I am using win7 64 bit environment.
It is relatively simple to install selenium in the python environment . You can use the command to automatically obtain the latest version of selenium. The author installed selenium 3.0.2, see https://pypi.python.org/pypi/selenium/3.0.2 for details :
pip install selenium
After installing selenium, you need to install the driver of each browser, and selenium can be used normally. The author takes the mainstream browsers ie, Firefox, and chrome as examples. For the version of each browser, see http://docs.seleniumhq.org/ download/ :
1. IE browser IEDriverServer
The selenium official website gives the IEDriverServer download link for win7 64-bit: http://selenium-release.storage.googleapis.com/2.53/IEDriverServer_x64_2.53.1.zip
After downloading and unzipping, you can use:
iedriver = "D:\scrapy\selenium\driver\IEDriverServer.exe" driver = webdriver.Ie(iedriver)
2, chrome browser chromedriver
The download link is: http://chromedriver.storage.googleapis.com/2.25/chromedriver_win32.zip
After downloading and unzipping, you can use:
chromedriver = "D:\scrapy\selenium\driver\chromedriver.exe" driver = webdriver.Chrome(chromedriver)
3. Firefox browser geckodriver
The download link is: https://github.com/mozilla/geckodriver/releases
After downloading and unzipping, you can use:
firefoxdriver = "D:\scrapy\selenium\driver\geckodriver.exe" binary = FirefoxBinary("C:\Program Files (x86)\Mozilla Firefox\Firefox.exe") driver = webdriver.Firefox(executable_path=firefoxdriver,firefox_binary=binary)
Note: The firefox browser path needs to be specified. If it is not specified, an error will occur:
WebDriverException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
----------------Dividing line----------------
The above is the environment construction of selenium. Let's start integrating selenium with scrapy.
Generally speaking, there are two ways to integrate selenium into scrapy framework:
1. Create Middleware and call selenium in Middleware for dynamic loading. However, this method is too inflexible to load all connections through selenium indiscriminately, which will slow down the running efficiency of scrapy and is not recommended.
2. In the spider of scrapy, call selenium for the necessary web pages for dynamic loading. This method is more flexible, and can also do operations for specific websites. This method is mainly described below:
from scrapy.selector import Selector #Add selenium dependency declaration from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.firefox.firefox_binary import FirefoxBinary …… #Initialize the selenium object def __init__(self): CrawlSpider.__init__(self) #firefox firefoxdriver = "D:\scrapy\selenium\driver\geckodriver.exe" binary = FirefoxBinary("C:\Program Files (x86)\Mozilla Firefox\Firefox.exe") self.driver = webdriver.Firefox(executable_path=firefoxdriver,firefox_binary=binary) # Set page load limit time self.driver.set_page_load_timeout(10) self.driver.maximize_window() def __del__(self): self.driver.close() …… #Specific processing def parse_item(self, response): print response.url try: self.driver.get(response.url) except TimeoutException: #print 'time out after 10 seconds when loading page' self.driver.execute_script('window.stop()') #When the page loading time exceeds the set time, stop the loading by executing Javascript, and then the subsequent actions can be performed …… #You can use scrapy's Selector to process the loaded page, which is less intrusive #You can also do some advanced operations through selenium, or use selenium's page processing directly sel = Selector(text = self.driver.page_source)
In this way, js dynamic data can be loaded through selenium, and the page processing method of scrapy can be used.