python - Selenium (dynamic rendering pages crawled)

  • On one, learn Ajax, Ajax JS is actually a form of dynamic rendering of pages, by direct analysis of Ajax, the data can still be achieved by means of crawling or requests urllib.

 

  • But JS Ajax dynamic rendering of pages is more than one kind; there is as Taobao this page, even if the data is acquired Ajax, but Ajax interfaces contains a lot of encryption parameters, we find it difficult to direct law ,, it is difficult direct analysis of Ajax to grab.

 

  • To solve these problems, you can directly use simulation browser running to achieve, so that you can do in the browser to see what is, what is the source crawl, which is visible and can climb. JS so that we do not control internal pages with what algorithm to render the page, and leave the page background Ajax interfaces in the end what parameters.

 

  • Python provides many analog library browser running, such as Selenium, Splash, PyV8, Ghost and so on.

Selenium:

It is an automated testing tool, which allows you to drive your browser to perform specific actions , such as clicking, pull down other operations, but also can get the source code browser is currently rendered page, so you can be seen to climb.

 

1. Preparation:

Chrome browser installed;

ChromeDriver installation;

Selenium installation.

 

2. Basic use

 

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait  

browser = webdriver.Chrome()
try :
    browser.get ('https://www.baidu.com ')
    input = browser.find_element_by_id('kw')
    input.send_keys('Python')
    input.send_keys(Keys.ENTER)
    wait = WebDriverWait(browser, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
    print(browser.current_url)
    print(browser.get_cookies())
    print(browser.page_source)   //网页源代码
finally:
    browser.close()
    

 

 

Here's a look at the specific usage sewlenium

 

3. Statement browser object

selenium supports many browsers: Chrome, Firefox, Edge, etc., also supports the end of the phone Android, BlackBerry and other browsers also support non-browser interface PhantomJS.

 

initialization:

from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

 

4. Access page

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com ')
print(browser.page_source)  //打印源码
browser.close() 

 

5. Find the node

selenium may be driven browser perform various operations, such as filling the form, etc. to simulate a click. Another example we want to enter text into an input box, gotta know the location of the input box, right? Do not worry ,, selenium provides a range of ways to find nodes.

  • A single node

First View page source

 

 

 

Obtaining single-node approach:

      • find_element_by_id
      • find_element_by_name
      • find_element_by_xpath
      • find_element_by_link_text
      • find_element_by_partial_link_text
      • find_element_by_tag_name
      • find_element_by_class_name
      • find_element_by_css_selector

                           另外,Selenium还提供了通用方法:

      •  find_element() :需要传入两个参数:查找方式和值,例如:

 find_element(By.ID, id) 等价于  find_element_by_id(id)

 

获取多节点的方法:

    • find_elements_by_id
    • find_elements_by_name
    • find_elements_by_xpath
    • find_elements_by_link_text
    • find_elements_by_partial_link_text
    • find_elements_by_tag_name
    • find_elements_by_class_name
    • find_elements_by_css_selector

                           另外,Selenium还提供了通用方法:

      •  find_elements() :需要传入两个参数:查找方式和值,例如:

                                   find_elements(By.ID, id) 等价于  find_elements_by_id(id)

 

                      注:单节点和多节点就是单复数形式的区别。

 

  6. 节点交互

selenium 驱动浏览器来执行一些操作,其实就是让浏览器模拟执行一些动作,常见的用法有:

  • send_keys() :输入文字;
  • clear() :清空文字;
  • click() : 点击按钮。
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com ')
input = browser.find_element_by_id('q') //通过输入框的id,找到输入框(不一定能找到,淘宝的源代码可能更新了)
input.send_keys('iPhone')     //在输入框内输入iPhone 
time.sleep(1)  
input.clear()  //清空输入框内容
input.send_keys('Ipad')
button = browser.find_element_by_class_name('btn-search')  //查找“搜索”按钮(不一定能找到,淘宝的源代码可能更新了)
button.click()

 可参考官方文档:http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webelement

7.动作链

上面的例子,是针对于某个节点执行的,另外一些操作,没有特定的执行对象,比如鼠标拖动,键盘按键等,这些动作用另一种方式来执行,那就是动作链。

 

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source,target)
actions.perform()

运行结果:

 

 

 

 

可参考官方文档:http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains

 

 8. 执行 JavaScript

对于某些操作,selenium API 并没有提供。比如,下拉进度条,它可以直接模拟运行JS ,此时使用execute_script() 方法即可实现。

 

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')

通过这个方法,基本上API没有提供的所有功能都可以用执行JS 的方式来实现。

 

9. 获取节点信息

前面代码中,通过page_source 属性可以获取网页的源代码,然后可以利用解析库来提取信息。

而用selenium 可以获取节点,返回的是webElement 类型,可以直接获取节点信息(文本,属性等)

 

  • 获取属性

get_attribute()

 

  • 获取文本值

input.text

每个WebElement 节点都有 text 属性

 

  • 获取 id 、位置、标签名和大小
  • input.id
  • input.location
  • input.tag_name
  • input.size

 

10. 切换Frame

        网页中有一种节点叫做iframe ,也就是子Frame ,相当于页面的子页面,它的结构和外部网页的结构完全一致。selenium 打开页面之后,他默认是在父级Frame 里面操作,而此时如果页面中还有子Frame,它是不能获取到子Frame里面的节点的。这时就需要使用  switch_to.frame() 方法来切换Frame。

 

11. 延时等待

       在Selenium 中,get() 方法会在网页框架加载结束后结束执行,此时如果获取page_source  ,可能并不是浏览器完全加载完成的页面,如果某些页面有额外的 Ajax  请求。我们在网页源代码中也不一定能获取到。所以,这里需要等待一段时间,确保节点已经加载出来。

      等待的方式有两种:隐式等待  和 显式等待。

 

  • 隐式等待
from selenium import webdriver

browser = webdriver.Chrome()
browser.implicitly_wait(10)   //隐式等待10秒
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)

 

 

  • 显式等待
from selenium import webdriver
from selenium.webdriver.commom.by import By
from selenium.webdriver.support import excepted_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

browser = webdriver.Chrome()
browser.get ('https://www.taobao.com ')
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, 'q')))
button = wait.until(EC.element_to_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)

                            可参考官方文档:http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.supported.expected_conditions

 

12. 前进和后退

平时我们在使用浏览器的时候都有前进和后退功能,Selenium 也可以实现这个功能。

  • back() : 后退
  • forward(): 前进
from selenium import webdriver
import time 
browser = webdriver.Chrome()
browser.get ('https://www.taobao.com ')
browser.get ('https://www.python.org ')
browser.get ('https://www.baidu.com ')

browser.back()   
time.sleep(1)
browser.forward()
browser.close()

 

13. Cookies

使用Selenium 可以方便的对Cookies进行操作,如获取、添加、删除cookies。

 

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'  
browser.get(url)   //加载完成后,实际上已经生成cookies了
print(browser.get_cookies())    //获取所有的cookies
browser.add_cookie({'name': 'name', 'domin':'www.zhihu.com', 'value':'germey'})   //添加cookie,注意cookie的单复数
print(browser.get_cookies())  //再次获取cookies
browser.delete_all_cookies()   //删除所有的cookies
print(browser.get_cookies())   //再次获取为空了

 

 

 

14. 选项卡管理

在访问网页的时候,会开启一个个选项卡。在Selenium 中,我们可以对选项卡进行操作。

 

15. 异常处理

try  except

 

Guess you like

Origin www.cnblogs.com/bltstop/p/11664209.html