Dynamic web crawl data
What is AJAX:
AJAX (Asynchronouse JavaScript And XML) Asynchronous JavaScript and XML. By carrying out a small amount of data exchanged with the server in the background, Ajax can make asynchronous page updates. This means that, for certain parts of the page to be updated without reloading the entire page. Traditional web page (do not use Ajax) If you need to update the content, you must reload the entire web page. Because the traditional format in terms of data transmission, using XML
syntax. Hence the name AJAX
, in fact, are basically using interactive data JSON
. AJAX using data loaded, even with JS, rendering data to the browser, in 右键->查看网页源代码
still can not see the data loaded by ajax, you can only be seen using html code for this url loaded.
Obtaining ajax data:
- Direct analysis interface ajax call. Then the code request this interface.
- Use Selenium + chromedriver simulate browser behavior to obtain data.
the way | advantage | Shortcoming |
---|---|---|
Analysis Interface | You may request the data directly. You do not need to do some parsing. Less code, high performance. | Analysis interface is more complex, especially some of the confusion by js interfaces have a certain js skills. Easy to find a reptile. |
selenium | Direct simulation of the behavior of the browser. The browser can request to use selenium can also request to. Reptiles and more stable. | The code amount. Low performance. |
Selenium + chromedriver obtain dynamic data:
Selenium
The equivalent of a robot. Some can simulate human behavior in the browser automatically handle some of the acts on the browser, such as clicks, fill data, delete the cookie so on. chromedriver
Driver is a Chrome
driver for a browser, he can drive using the browser. Of course, there are different driver for different browsers. The following lists the different browsers and their corresponding driver:
- Chrome:https://sites.google.com/a/chromium.org/chromedriver/downloads
- Firefox:https://github.com/mozilla/geckodriver/releases
- Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Selenium installation and chromedriver:
- Installation
Selenium
:Selenium
There are many versions of the language, have java, ruby, python and so on. We downloaded python version of it.pip install selenium
- Installation
chromedriver
: Once downloaded, put pure English directory permissions do not need it.
Getting Started:
Now with a simple example to obtain Baidu home page under the terms Selenium
and chromedriver
how quick start:
from Selenium Import the webdriver # absolute path of chromedriver driver_path R & lt = ' D: \ ProgramApp \ chromedriver \ chromedriver.exe ' # initialize a driver, and the specified path chromedriver Driver = webdriver.Chrome (executable_path = driver_path) # requests a web page driver.get ( " https://www.baidu.com/ " ) # get page source code by page_source Print (driver.page_source)
selenium common operations:
Please refer to more tutorials: http://selenium-python.readthedocs.io/installation.html#introduction
Close:
driver.close()
: Close the current page.driver.quit()
: Exit entire browser.
Positioning elements:
find_element_by_id
: According id to find an element. Equivalent to:submitTag = driver.find_element_by_id('su') submitTag1 = driver.find_element(By.ID,'su')
find_element_by_class_name
: Find elements based on the class name. Equivalent to:submitTag = driver.find_element_by_class_name('su') submitTag1 = driver.find_element(By.CLASS_NAME,'su')
find_element_by_name
: Depending on the value of the name attribute to find the elements. Equivalent to:submitTag = driver.find_element_by_name('email') submitTag1 = driver.find_element(By.NAME,'email')
find_element_by_tag_name
: According to the label name to find the elements. Equivalent to:submitTag = driver.find_element_by_tag_name('div') submitTag1 = driver.find_element(By.TAG_NAME,'div')
find_element_by_xpath
: To get the syntax elements based xpath. Equivalent to:submitTag = driver.find_element_by_xpath('//div') submitTag1 = driver.find_element(By.XPATH,'//div')
-
find_element_by_css_selector
: The selected element css selectors. Equivalent to:submitTag = driver.find_element_by_css_selector('//div') submitTag1 = driver.find_element(By.CSS_SELECTOR,'//div')
To be noted,
find_element
is to get the first element satisfies the condition.find_elements
It is to get all the elements to meet the conditions.
Operation form elements:
-
The operation input block: two steps. Step 1: Find the element. Step 2: Use
send_keys(value)
the data filled in. Sample code is as follows:inputTag = driver.find_element_by_id('kw') inputTag.send_keys('python')
Use
clear
method can clear the contents of the input box. Sample code is as follows:inputTag.clear()
-
Operation checkbox: due to the selected
checkbox
label in the page by clicking the mouse. So you want to selectcheckbox
the tab, then select the first label, and then executeclick
the event. Sample code is as follows:rememberTag = driver.find_element_by_name("rememberMe") rememberTag.click()
-
Select select: select elements can not directly click. Because the need to click on the selected element. This time selenium on special offers to select a class label
selenium.webdriver.support.ui.Select
. The acquired element as a parameter passed to this class, create the object. Since you can use this object to be selected. Sample code is as follows:from selenium.webdriver.support.ui Import Select # select the label, then use the Select create objects SelectTag = Select (driver.find_element_by_name ( " jumpMenu " )) # selected according to the index selectTag.select_by_index (1 ) # selected according to the value selectTag.select_by_value ( " http://www.95yueba.com " ) # selected based on visual text selectTag.select_by_visible_text ( " 95 show the client " ) # uncheck all options selectTag.deselect_all ()
-
Operation buttons: Action Button There are many ways. Such as click, right click, double-clicking. Here to tell a most commonly used. Is click. Direct call
click
function on it. Sample code is as follows:inputTag = driver.find_element_by_id('su') inputTag.click()
Behavior Chain:
Sometimes operation on the page might have a lot of steps, so this time you can use the mouse behavior chain class ActionChains
to complete. For example, now you move the mouse to click on an element and execute the event. Then the following sample code:
inputTag = driver.find_element_by_id('kw') submitTag = driver.find_element_by_id('su') actions = ActionChains(driver) actions.move_to_element(inputTag) actions.send_keys_to_element(inputTag,'python') actions.move_to_element(submitTag) actions.click(submitTag) actions.perform()
There is more mouse-related operations.
- click_and_hold (element): Click on but does not release the mouse.
- context_click (element): Right-click.
- double_click (element): double-click. For more please refer to: http://selenium-python.readthedocs.io/api.html
Operation Cookie:
- Get all
cookie
:for cookie in driver.get_cookies(): print(cookie)
- According acquisition value of key cookie:
value = driver.get_cookie(key)
- Delete all of the cookie:
driver.delete_all_cookies()
- Delete a
cookie
:driver.delete_cookie(key)
Page Wait:
Now more and more web pages using Ajax technology, so that the program can not determine when an element is fully loaded out. If the actual page you wait too long resulting in a dom element has not come out, but your code directly using this WebElement, the exception will be thrown NullPointer. To solve this problem. Selenium wait so provides two ways: one is to wait implicit, explicit one is waiting.
-
Implicit wait: call
driver.implicitly_wait
. So before you get the elements unavailable, it will first wait time of 10 seconds. Sample code is as follows:driver = webdriver.Chrome(executable_path=driver_path) driver.implicitly_wait(10) # 请求网页 driver.get("https://www.douban.com/")
-
Display waiting: waiting for is to show the display elements perform the operation only after obtaining certain conditions are met. You can also specify a maximum waiting time when, if more than this time then it throws an exception. It shows the wait should use the
selenium.webdriver.support.excepted_conditions
conditional expectation andselenium.webdriver.support.ui.WebDriverWait
to match complete. Sample code is as follows:from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit()
-
Wait for some other conditions:
- presence_of_element_located: an element has been loaded up.
- presence_of_all_emement_located: web pages to meet the conditions of all the elements are loaded up.
-
element_to_be_cliable: an element that can be clicked on.
More Conditions please refer to: http://selenium-python.readthedocs.io/waits.html
Switch the page:
Sometimes the window, there are many sub-tab page. This time is definitely needed for switching. selenium
It provides a feature called switch_to_window
to switch, switch to a specific page which can be from driver.window_handles
Locate. Sample code is as follows:
# Open a new page self.driver.execute_script ( " the window.open ( ' " + URL + " ') " ) # switch to this new page self.driver.switch_to_window (self.driver.window_handles [1])
Set up a proxy ip:
Sometimes frequent crawling some pages. Server will find that you are after sealing of reptiles your ip address. At this time we can change the proxy ip. Change the proxy ip, different browsers have different implementations. Here to Chrome
the browser as an example to explain:
from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument("--proxy-server=http://110.73.2.248:8123") driver_path = r"D:\ProgramApp\chromedriver\chromedriver.exe" driver = webdriver.Chrome(executable_path=driver_path,chrome_options=options) driver.get('http://httpbin.org/ip')
WebElement
element:
from selenium.webdriver.remote.webelement import WebElement
Each class is a class get out of your element.
There are some common attributes:
- get_attribute: the value of an attribute of this tag.
- screentshot: get the screenshot of the current page. This method can only be in
driver
use on.driver
The object class is inherited fromWebElement
.
Please read the relevant source code.