[Python crawler series tutorial 26-100] Miss sister teaches you how to obtain ajax data through Selenium, since then dynamic web pages are not terrible

What is AJAX

AJAX (Asynchronouse JavaScript And XML) asynchronous JavaScript and XML. By exchanging a small amount of data with the server in the background, Ajax can make web pages update asynchronously. This means that certain parts of the webpage can be updated without reloading the entire webpage.

Traditional web pages (not using Ajax) must reload the entire web page if the content needs to be updated. Because the traditional format of data transmission is XML syntax. So it's called AJAX. In fact, data interaction now basically uses JSON. The data loaded using AJAX, even if JS is used and the data is rendered into the browser, you still cannot see the data loaded via ajax in the right-click -> view webpage source code, only the html code loaded using this url.

Ways to get ajax data

1. Directly analyze the interface called by ajax. Then request this interface through code.
2. Use Selenium+chromedriver to simulate browser behavior to obtain data.

the way advantage Disadvantage
Analysis interface The data can be requested directly. No need to do some parsing work. The amount of code is small, and the performance is high. Analyzing the interface is more complicated, especially for some interfaces that are obfuscated by js, you must have a certain js foundation. It is easy to be spotted as a crawler.
selenium Directly simulate the behavior of the browser. What the browser can request can also be requested using selenium. The crawler is more stable. A lot of code. Low performance.

Selenium+chromedriver to obtain dynamic data

Selenium is equivalent to a robot. It can simulate some human behaviors on the browser, and automatically process some behaviors on the browser, such as clicking, filling data, deleting cookies, etc.
chromedriver is a driver that drives the Chrome browser, and it can be used to drive the browser.

Of course, there are different drivers for different browsers. The different browsers and their corresponding drivers are listed below:

  • Chrome:https://sites.google.com/a/chromium.org/chromedriver/downloads
  • Firefox:https://github.com/mozilla/geckodriver/releases
  • Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
  • Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Install Selenium and chromedriver:

1. Install Selenium: Selenium has many language versions, such as java, ruby, python, etc. We can download the python version.

   `pip install selenium   ` 

2. Install chromedriver: After the download is complete, place it in a pure English directory that does not require permission.

Mine is version: 87.0.4280.88

Chromedriver download address: https://chromedriver.storage.googleapis.com/index.html


After the download is complete, put it in the corresponding directory.

Quick start

Now let's take a simple example of getting Baidu homepage to talk about how to get started quickly with Selenium and chromedriver:

from selenium import webdriver

# chromedriver的绝对路径
driver_path = r'D:\ProgramApp\chromedriver\chromedriver.exe'

# 初始化一个driver,并且指定chromedriver的路径
driver = webdriver.Chrome(executable_path=driver_path)
# 请求网页
driver.get("https://www.baidu.com/")
# 通过page_source获取网页源代码
print(driver.page_source)

Insert picture description here

Close the page:

driver.close(): Close the current page.
driver.quit(): Quit the entire browser.

The following code is based on Baidu

Positioning element:

  1. find_element_by_id: Find an element based on id. Equivalent to:
    submitTag = driver.find_element_by_id('su')
    submitTag1 = driver.find_element(By.ID,'su')

  2. find_element_by_class_name: Find the element based on the class name. Equivalent to:
    submitTag = driver.find_element_by_class_name('su')
    submitTag1 = driver.find_element(By.CLASS_NAME,'su')

  3. find_element_by_name: Find the element based on the value of the name attribute. Equivalent to:
    submitTag = driver.find_element_by_name('email')
    submitTag1 = driver.find_element(By.NAME,'email')

  4. find_element_by_tag_name: Find the element based on the tag name. Equivalent to:
    submitTag = driver.find_element_by_tag_name('div')
    submitTag1 = driver.find_element(By.TAG_NAME,'div')

  5. find_element_by_xpath: Get elements according to xpath syntax. Equivalent to:
    submitTag = driver.find_element_by_xpath('//div')
    submitTag1 = driver.find_element(By.XPATH,'//div')

  6. find_element_by_css_selector: select elements based on css selector. Equivalent to:
    submitTag = driver.find_element_by_css_selector('//div')
    submitTag1 = driver.find_element(By.CSS_SELECTOR,'//div')

It should be noted that find_element is to get the first element that satisfies the condition. find_elements is to get all the elements that meet the conditions.

Manipulate form elements:

  1. Operation input box: divided into two steps. Step 1: Find this element. Step 2: Use send_keys(value) to fill in the data. The sample code is as follows:
 inputTag = driver.find_element_by_id('kw')   
 inputTag.send_keys('python')   

Use the clear method to clear the contents of the input box. The sample code is as follows:
inputTag.clear()

  1. Operate the checkbox: because you want to select the checkbox tag, you click on it with the mouse on the web page. Therefore, if you want to select the checkbox label, select this label first, and then execute the click event. The sample code is as follows:
rememberTag = driver.find_element_by_name("rememberMe")  
rememberTag.click()   
  1. Select select: The select element cannot be clicked directly. Because the element needs to be selected after clicking. At this time, selenium provides a class selenium.webdriver.support.ui.Select specifically for the select tag. Pass the obtained element as a parameter to this class to create this object. You can use this object for selection later. The sample code is as follows:
from selenium.webdriver.support.ui import Select   
 #选中这个标签,然后使用Select创建对象   
 selectTag = Select(driver.find_element_by_name("jumpMenu"))   
 #根据索引选择   
 selectTag.select_by_index(1)   
 #根据值选择   
 selectTag.select_by_value("http://www.95yueba.com")   
 #根据可视的文本选择   
 selectTag.select_by_visible_text("95秀客户端")   
 #取消选中所有选项   
 selectTag.deselect_all()   
  1. Operation buttons: There are many ways to operate buttons. For example, single click, right click, double click, etc. Here is one of the most commonly used. Just click. Just call the click function directly. The sample code is as follows:
 inputTag = driver.find_element_by_id('su')   
 inputTag.click()   

Behavior chain:

Sometimes the operation on the page may have many steps, then you can use the mouse behavior chain class ActionChains to complete at this time. For example, now you want to move the mouse to an element and execute a click event

inputTag = driver.find_element_by_id('kw')    
submitTag = driver.find_element_by_id('su')    
    
actions = ActionChains(driver)    
actions.move_to_element(inputTag)    
actions.send_keys_to_element(inputTag,'python')    
actions.move_to_element(submitTag)    
actions.click(submitTag)    
actions.perform()  

There are more mouse-related operations.

  • click_and_hold(element): Click but do not release the mouse.
  • context_click(element): Right click.
  • double_click(element): Double click. For more methods, please refer to: http://selenium-python.readthedocs.io/api.html

Cookie operation:

1. Get all cookies:

 for cookie in driver.get_cookies():   
     print(cookie)   

2. Obtain the value according to the key of the cookie:

 value = driver.get_cookie(key)   

3. Delete all cookies:

 driver.delete_all_cookies()   

4. Delete a cookie:

 driver.delete_cookie(key)   

Page waiting:

More and more web pages now use Ajax technology, so the program cannot determine when an element is fully loaded. If the actual page waiting time is too long and a certain dom element has not come out, but your code directly uses this WebElement, then a NullPointer exception will be thrown. To solve this problem. So Selenium provides two ways to wait: one is implicit waiting and the other is explicit waiting.

Implicit wait

Implicitly wait: call driver.implicitly_wait. Then, before getting the unavailable element, it will wait for 10 seconds. The sample code is as follows:

driver = webdriver.Chrome(executable_path=driver_path)   
driver.implicitly_wait(10)   
// 请求网页   
driver.get("https://www.douban.com/")   

Display waiting: Display waiting is to perform the operation of obtaining elements after a certain condition is satisfied. You can also specify a maximum time while waiting, and if it exceeds this time, an exception will be thrown. The display wait should be completed with the selenium.webdriver.support.excepted_conditionsexpected conditions and selenium.webdriver.support.ui.WebDriverWaitcooperation. The sample code is as follows:

 from selenium import webdriver   
 from selenium.webdriver.common.by import By   
 from selenium.webdriver.support.ui import WebDriverWait   
 from selenium.webdriver.support import expected_conditions as EC   
   
 driver = webdriver.Firefox()   
 driver.get("http://somedomain/url_that_delays_loading")   
 try:   
     element = WebDriverWait(driver, 10).until(   
         EC.presence_of_element_located((By.ID, "myDynamicElement"))   
     )   
 finally:   
     driver.quit()   
一些其他的等待条件:   
   
presence_of_element_located:某个元素已经加载完毕了。   
presence_of_all_emement_located:网页中所有满足条件的元素都加载完毕了。   
element_to_be_cliable:某个元素是可以点击了。   

For more conditions, please refer to: http://selenium-python.readthedocs.io/waits.html

Implicit wait

Implicitly wait: call driver.implicitly_wait. Then, before getting the unavailable element, it will wait for 10 seconds. The sample code is as follows:

driver = webdriver.Chrome(executable_path=driver_path)
driver.implicitly_wait(10)
# 请求网页
driver.get("https://www.douban.com/")

Display waiting: Display waiting is to perform the operation of obtaining elements after a certain condition is satisfied. You can also specify a maximum time while waiting, and if it exceeds this time, an exception will be thrown. Display waiting should be completed with the expected conditions of selenium.webdriver.support.excepted_conditions and selenium.webdriver.support.ui.WebDriverWait. The sample code is as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)

Switch page:

Sometimes there are many sub-tab pages in the window. It must be switched at this time. Selenium provides a switch_to_window to switch. The specific page to switch to can be found in driver.window_handles. The sample code is as follows:

# 打开一个新的页面(借助JavaScript语句)
self.driver.execute_script("window.open('"+url+"')")
# 切换到这个新的页面中
self.driver.switch_to_window(self.driver.window_handles[1])


# 页面切换

from selenium import webdriver

driver_path = r"G:\ProgramApp\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get('https://www.baidu.com/')

#### 借助JavaScript语句打开一个新的页面
driver.execute_script("window.open('https://www.douban.com/')")   #JavaScript语句
print(driver.window_handles)

#### 切换到新的页面中
driver.switch_to_window(driver.window_handles[1])
print(driver.current_url)
print(driver.page_source)

# 虽然在窗口中切换到了新的页面,但是driveer中还没有切换
# 如果想要在代码中切换到新的页面,并且做一下爬虫
# 那么应该使用driver.switch_to_window来切换到指定的窗口
# 从driver.window_handles中取出具体第几个窗口
# driver.window_handles是一个列表,里面装的都是窗口句柄
# 它会按照打开页面的顺序来存储窗口的句柄# 页面切换

from selenium import webdriver

driver_path = r"G:\ProgramApp\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)
driver.get('https://www.baidu.com/')

#### 借助JavaScript语句打开一个新的页面
driver.execute_script("window.open('https://www.douban.com/')")   #JavaScript语句
print(driver.window_handles)

#### 切换到新的页面中
driver.switch_to_window(driver.window_handles[1])
print(driver.current_url)
print(driver.page_source)

# 虽然在窗口中切换到了新的页面,但是driveer中还没有切换
# 如果想要在代码中切换到新的页面,并且做一下爬虫
# 那么应该使用driver.switch_to_window来切换到指定的窗口
# 从driver.window_handles中取出具体第几个窗口
# driver.window_handles是一个列表,里面装的都是

Set the proxy ip:

Sometimes crawling some web pages frequently. The server will block your ip address after discovering that you are a crawler. At this time we can change the proxy ip. Change the proxy ip, different browsers have different implementation methods. Here is an example of the Chrome browser:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://110.73.2.248:8123")
driver_path = r"D:\ProgramApp\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path,chrome_options=options)

driver.get('http://httpbin.org/ip')

WebElement element:

from selenium.webdriver.remote.webelement import The WebElement class is the class of each element obtained.
There are some commonly used attributes:

get_attribute: The value of an attribute of this tag.
screentshot: Get a screenshot of the current page. This method can only be used on the driver.
The object class of the driver is also inherited from WebElement.
For more, please read the relevant source code.

For more tutorials, please refer to: http://selenium-python.readthedocs.io/installation.html#introduction

Guess you like

Origin blog.csdn.net/weixin_54707168/article/details/114818998