win7 environment scrapy integrates selenium to crawl dynamic web pages

scrapy can crawl static pages, but more and more website data is processed dynamically through js. In order to crawl this part of the data, it is necessary to process the dynamic pages processed by js. A simple way is to integrate js processing tools, the author chooses selenium here.

For the installation of scrapy, see the author's other articles. Here I am using win7 64 bit environment.

It is relatively simple to install selenium in the python environment . You can use the command to automatically obtain the latest version of selenium. The author installed selenium  3.0.2, see https://pypi.python.org/pypi/selenium/3.0.2 for details :

pip install selenium

After installing selenium, you need to install the driver of each browser, and selenium can be used normally. The author takes the mainstream browsers ie, Firefox, and chrome as examples. For the version of each browser, see http://docs.seleniumhq.org/ download/  :

1. IE browser IEDriverServer

The selenium official website gives the IEDriverServer download link for win7 64-bit: http://selenium-release.storage.googleapis.com/2.53/IEDriverServer_x64_2.53.1.zip

After downloading and unzipping, you can use: 

 

iedriver = "D:\scrapy\selenium\driver\IEDriverServer.exe"
driver = webdriver.Ie(iedriver)

 2, chrome browser chromedriver

 

The download link is: http://chromedriver.storage.googleapis.com/2.25/chromedriver_win32.zip

After downloading and unzipping, you can use:

 

chromedriver = "D:\scrapy\selenium\driver\chromedriver.exe"
driver = webdriver.Chrome(chromedriver)

 3. Firefox browser geckodriver

 

The download link is: https://github.com/mozilla/geckodriver/releases

After downloading and unzipping, you can use: 

 

firefoxdriver = "D:\scrapy\selenium\driver\geckodriver.exe"
binary = FirefoxBinary("C:\Program Files (x86)\Mozilla Firefox\Firefox.exe")
driver = webdriver.Firefox(executable_path=firefoxdriver,firefox_binary=binary)

Note: The firefox browser path needs to be specified. If it is not specified, an error will occur:

 

WebDriverException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line

----------------Dividing line----------------

The above is the environment construction of selenium. Let's start integrating selenium with scrapy.

Generally speaking, there are two ways to integrate selenium into scrapy framework:

1. Create Middleware and call selenium in Middleware for dynamic loading. However, this method is too inflexible to load all connections through selenium indiscriminately, which will slow down the running efficiency of scrapy and is not recommended.

2. In the spider of scrapy, call selenium for the necessary web pages for dynamic loading. This method is more flexible, and can also do operations for specific websites. This method is mainly described below:

from scrapy.selector import Selector
#Add selenium dependency declaration
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

……


#Initialize the selenium object
	def __init__(self):
		CrawlSpider.__init__(self)  
		#firefox
		firefoxdriver = "D:\scrapy\selenium\driver\geckodriver.exe"
		binary = FirefoxBinary("C:\Program Files (x86)\Mozilla Firefox\Firefox.exe")
		self.driver = webdriver.Firefox(executable_path=firefoxdriver,firefox_binary=binary)

		# Set page load limit time
		self.driver.set_page_load_timeout(10)
		self.driver.maximize_window()

	def __del__(self):
		self.driver.close()

……

#Specific processing
	def parse_item(self, response):
		print response.url		
		try:
		    self.driver.get(response.url)
		except TimeoutException:
		    #print 'time out after 10 seconds when loading page'
		    self.driver.execute_script('window.stop()') #When the page loading time exceeds the set time, stop the loading by executing Javascript, and then the subsequent actions can be performed
		……
		#You can use scrapy's Selector to process the loaded page, which is less intrusive
		#You can also do some advanced operations through selenium, or use selenium's page processing directly
		sel = Selector(text = self.driver.page_source)

 In this way, js dynamic data can be loaded through selenium, and the page processing method of scrapy can be used.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326801660&siteId=291194637