Dynamic web data Chapter IV of reptiles crawl Advanced

Dynamic web crawl data

What is AJAX:

AJAX (Asynchronouse JavaScript And XML) Asynchronous JavaScript and XML. By carrying out a small amount of data exchanged with the server in the background, Ajax can make asynchronous page updates. This means that, for certain parts of the page to be updated without reloading the entire page. Traditional web page (do not use Ajax) If you need to update the content, you must reload the entire web page. Because the traditional format in terms of data transmission, using XMLsyntax. Hence the name AJAX, in fact, are basically using interactive data JSON. AJAX using data loaded, even with JS, rendering data to the browser, in 右键->查看网页源代码still can not see the data loaded by ajax, you can only be seen using html code for this url loaded.

Obtaining ajax data:

  1. Direct analysis interface ajax call. Then the code request this interface.
  2. Use Selenium + chromedriver simulate browser behavior to obtain data.
the way advantage Shortcoming
Analysis Interface You may request the data directly. You do not need to do some parsing. Less code, high performance. Analysis interface is more complex, especially some of the confusion by js interfaces have a certain js skills. Easy to find a reptile.
selenium Direct simulation of the behavior of the browser. The browser can request to use selenium can also request to. Reptiles and more stable. The code amount. Low performance.

Selenium + chromedriver obtain dynamic data:

SeleniumThe equivalent of a robot. Some can simulate human behavior in the browser automatically handle some of the acts on the browser, such as clicks, fill data, delete the cookie so on. chromedriverDriver is a Chromedriver for a browser, he can drive using the browser. Of course, there are different driver for different browsers. The following lists the different browsers and their corresponding driver:

  1. Chrome:https://sites.google.com/a/chromium.org/chromedriver/downloads
  2. Firefox:https://github.com/mozilla/geckodriver/releases
  3. Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
  4. Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Selenium installation and chromedriver:

  1. Installation Selenium: SeleniumThere are many versions of the language, have java, ruby, python and so on. We downloaded python version of it.
     pip install selenium
    
  2. Installation chromedriver: Once downloaded, put pure English directory permissions do not need it.

Getting Started:

Now with a simple example to obtain Baidu home page under the terms Seleniumand chromedriverhow quick start:

from Selenium Import the webdriver 

# absolute path of chromedriver 
driver_path R & lt = ' D: \ ProgramApp \ chromedriver \ chromedriver.exe ' 

# initialize a driver, and the specified path chromedriver 
Driver = webdriver.Chrome (executable_path = driver_path)
 # requests a web page 
driver.get ( " https://www.baidu.com/ " )
 # get page source code by page_source 
Print (driver.page_source)

 

selenium common operations:

Please refer to more tutorials: http://selenium-python.readthedocs.io/installation.html#introduction

Close:

  1. driver.close(): Close the current page.
  2. driver.quit(): Exit entire browser.

Positioning elements:

  1. find_element_by_id: According id to find an element. Equivalent to:
     submitTag = driver.find_element_by_id('su')
     submitTag1 = driver.find_element(By.ID,'su')

     

  2. find_element_by_class_name: Find elements based on the class name. Equivalent to:
     submitTag = driver.find_element_by_class_name('su')
     submitTag1 = driver.find_element(By.CLASS_NAME,'su')

     

  3. find_element_by_name: Depending on the value of the name attribute to find the elements. Equivalent to:
    submitTag = driver.find_element_by_name('email')
     submitTag1 = driver.find_element(By.NAME,'email')

     

  4. find_element_by_tag_name: According to the label name to find the elements. Equivalent to:
    submitTag = driver.find_element_by_tag_name('div')
     submitTag1 = driver.find_element(By.TAG_NAME,'div')

     

  5. find_element_by_xpath: To get the syntax elements based xpath. Equivalent to:
    submitTag = driver.find_element_by_xpath('//div')
     submitTag1 = driver.find_element(By.XPATH,'//div')

     

  6. find_element_by_css_selector: The selected element css selectors. Equivalent to:

    submitTag = driver.find_element_by_css_selector('//div')
     submitTag1 = driver.find_element(By.CSS_SELECTOR,'//div')

     

    To be noted, find_elementis to get the first element satisfies the condition. find_elementsIt is to get all the elements to meet the conditions.

Operation form elements:

  1. The operation input block: two steps. Step 1: Find the element. Step 2: Use send_keys(value)the data filled in. Sample code is as follows:

    inputTag = driver.find_element_by_id('kw')
     inputTag.send_keys('python')

     

    Use clearmethod can clear the contents of the input box. Sample code is as follows:

     inputTag.clear()

     

  2. Operation checkbox: due to the selected checkboxlabel in the page by clicking the mouse. So you want to select checkboxthe tab, then select the first label, and then execute clickthe event. Sample code is as follows:

     rememberTag = driver.find_element_by_name("rememberMe")
     rememberTag.click()

     

  3. Select select: select elements can not directly click. Because the need to click on the selected element. This time selenium on special offers to select a class label selenium.webdriver.support.ui.Select. The acquired element as a parameter passed to this class, create the object. Since you can use this object to be selected. Sample code is as follows:

    from selenium.webdriver.support.ui Import Select
      # select the label, then use the Select create objects 
     SelectTag = Select (driver.find_element_by_name ( " jumpMenu " ))
      # selected according to the index 
     selectTag.select_by_index (1 )
      # selected according to the value 
     selectTag.select_by_value ( " http://www.95yueba.com " )
      # selected based on visual text 
     selectTag.select_by_visible_text ( " 95 show the client " )
      # uncheck all options 
     selectTag.deselect_all ()

     

  4. Operation buttons: Action Button There are many ways. Such as click, right click, double-clicking. Here to tell a most commonly used. Is click. Direct call clickfunction on it. Sample code is as follows:

    inputTag = driver.find_element_by_id('su')
     inputTag.click()

     

Behavior Chain:

Sometimes operation on the page might have a lot of steps, so this time you can use the mouse behavior chain class ActionChainsto complete. For example, now you move the mouse to click on an element and execute the event. Then the following sample code:

inputTag = driver.find_element_by_id('kw')
submitTag = driver.find_element_by_id('su')

actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag,'python')
actions.move_to_element(submitTag)
actions.click(submitTag)
actions.perform()

 

There is more mouse-related operations.

Operation Cookie:

  1. Get all cookie:
     for cookie in driver.get_cookies():
         print(cookie)

     

  2. According acquisition value of key cookie:
    value = driver.get_cookie(key)

     

  3. Delete all of the cookie:
    driver.delete_all_cookies()

     

  4. Delete a cookie:
    driver.delete_cookie(key)

     

Page Wait:

Now more and more web pages using Ajax technology, so that the program can not determine when an element is fully loaded out. If the actual page you wait too long resulting in a dom element has not come out, but your code directly using this WebElement, the exception will be thrown NullPointer. To solve this problem. Selenium wait so provides two ways: one is to wait implicit, explicit one is waiting.

  1. Implicit wait: call driver.implicitly_wait. So before you get the elements unavailable, it will first wait time of 10 seconds. Sample code is as follows:

    driver = webdriver.Chrome(executable_path=driver_path)
    driver.implicitly_wait(10)
    # 请求网页
    driver.get("https://www.douban.com/")

     

  2. Display waiting: waiting for is to show the display elements perform the operation only after obtaining certain conditions are met. You can also specify a maximum waiting time when, if more than this time then it throws an exception. It shows the wait should use the selenium.webdriver.support.excepted_conditionsconditional expectation and selenium.webdriver.support.ui.WebDriverWaitto match complete. Sample code is as follows:

     from selenium import webdriver
     from selenium.webdriver.common.by import By
     from selenium.webdriver.support.ui import WebDriverWait
     from selenium.webdriver.support import expected_conditions as EC
    
     driver = webdriver.Firefox()
     driver.get("http://somedomain/url_that_delays_loading")
     try:
         element = WebDriverWait(driver, 10).until(
             EC.presence_of_element_located((By.ID, "myDynamicElement"))
         )
     finally:
         driver.quit()

     

  3. Wait for some other conditions:

    • presence_of_element_located: an element has been loaded up.
    • presence_of_all_emement_located: web pages to meet the conditions of all the elements are loaded up.
    • element_to_be_cliable: an element that can be clicked on.

      More Conditions please refer to: http://selenium-python.readthedocs.io/waits.html

Switch the page:

Sometimes the window, there are many sub-tab page. This time is definitely needed for switching. seleniumIt provides a feature called switch_to_windowto switch, switch to a specific page which can be from driver.window_handlesLocate. Sample code is as follows:

# Open a new page 
self.driver.execute_script ( " the window.open ( ' " + URL + " ') " )
 # switch to this new page 
self.driver.switch_to_window (self.driver.window_handles [1])

 

Set up a proxy ip:

Sometimes frequent crawling some pages. Server will find that you are after sealing of reptiles your ip address. At this time we can change the proxy ip. Change the proxy ip, different browsers have different implementations. Here to Chromethe browser as an example to explain:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://110.73.2.248:8123")
driver_path = r"D:\ProgramApp\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path,chrome_options=options)

driver.get('http://httpbin.org/ip')

 

WebElementelement:

from selenium.webdriver.remote.webelement import WebElementEach class is a class get out of your element.
There are some common attributes:

  1. get_attribute: the value of an attribute of this tag.
  2. screentshot: get the screenshot of the current page. This method can only be in driveruse on.
    driverThe object class is inherited from WebElement.
    Please read the relevant source code.

Guess you like

Origin www.cnblogs.com/lcy0302/p/10990631.html