Python Reptile 4.4 - selenium Advanced Usage tutorial

Overview

At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

In front of an article about the basis for the use of selenium, which is a selenium us about some of the more advanced usage.

Headless Chrome

The above example code running in a browser window will pop up, sometimes inconvenient, which we need not pop-up crawling data.

Headless Chrome Chrome is a browser interface without form, you can open the browser without the premise of using Chrome supports all features, you run the script on the command line. Previously reptiles to use Phantomjs to achieve these functions, but Phantomjs development has been suspended, can now use Headless Chrome instead.

Sample code is as follows:

# 引入所需库
import time

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# 定制option
chrome_options = Options()
# 设置无头
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=path, options=chrome_options)
# 打开百度
url = 'http://www.baidu.com/'
browser.get(url)

time.sleep(3)
# 保存页面截图
browser.save_screenshot('baidu.png')

browser.quit()

Setting request header

from selenium import webdriver
# 进入浏览器设置
options = webdriver.ChromeOptions()
# 设置中文
options.add_argument('lang=zh_CN.UTF-8')
# 更换头部
options.add_argument('user-agent="Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"')
browser = webdriver.Chrome(chrome_options=options)
url = "https://httpbin.org/get?show_env=1"
browser.get(url)
browser.quit()

Set the proxy IP

Sometimes frequent crawling some pages, the server will find that you are mad reptile after your ip address. Then we can change the proxy ip to solve this problem. Change the proxy ip, different browsers have different implementations, here to Chromethe browser as an example to explain:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--proxy-server-http://123.56.74.13:8080')

# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
driver = webdriver.Chrome(executable_path=path, options=options)
driver.get('https://httpbin.org/ip')

Common Startup Items parameter setting options

Startup Parameters effect
–user-agent=”“ Set User-Agent request header
-window-size = length, width Set your browser resolution
–headless No interface runs
–start-maximized Maximize uptime
–incognito Stealth mode
–disable-javascript Javascript disabled
–disable-infobars Disable the browser is being automated process control tips

More Parameters: https: //peter.sh/experiments/chromium-command-line-switches/

Operation Cookie

  1. Get all cookie:
    for cookie in driver.get_cookies():
        print(cookie)
    
  2. According acquisition value of key cookie:
    cookie = driver.get_cookie('BD_HOME')
    print(cookie)
    
  3. Delete all cookie:
    driver.delete_all_cookies()
    
  4. Delete a cookie:
    driver.delete_cookie('BD_HOME')
    

Sample code is as follows:

# 引入所需库
import time

from selenium import webdriver

# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')
time.sleep(2)

# 获取所有cookie
for cookie in driver.get_cookies():
    print(cookie)

# 根据cookie的key获取value
# cookie = driver.get_cookie('BD_HOME')
# print(cookie)

# 删除所有cookie
# driver.delete_all_cookies()

# 删除某个cookie
# driver.delete_cookie('BD_HOME')
driver.close()

selenium cookie settings

Using the add_cookie(cookie_dict)method to the current session may add a cookie; cookie_dictis a dictionary object, you must have nameand valuetwo keys, optional keys are: path, domain, secure, expiry. E.g:

driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’})
driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’, ‘path’ :/})
driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’, ‘path’ :/, ‘secure’:True})

Using the following sample code:

from selenium import webdriver
browser = webdriver.Chrome()

url = "https://www.baidu.com/"
browser.get(url)
# 通过js新打开一个窗口
newwindow='window.open("https://www.baidu.com");'
# 删除原来的cookie
browser.delete_all_cookies()
# 携带cookie打开
browser.add_cookie({'name':'ABC','value':'DEF'})
# 通过js新打开一个窗口
browser.execute_script(newwindow)
input("查看效果")
browser.quit()

Behavior Chain

Sometimes the operation page also might have a lot, so this time you can use the mouse behavior chain class ActionChainsto complete. For example, now move the mouse to click on an element and execute events, sample code as follows:

import time

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')
time.sleep(2)
# 根据id获取元素
input_kw = driver.find_element_by_id('kw')
submit_btn = driver.find_element_by_id('su')

# 实例化Action
action = ActionChains(driver)
action.move_to_element(input_kw)
action.send_keys_to_element(input_kw, 'python')
action.move_to_element(submit_btn)
action.click(submit_btn)
# 执行上述操作
action.perform()

time.sleep(5)

driver.close()

Chain common behavior operation method (ActionChains class methods)

  • click (on_element = None): Left-click the element passed, if not passed, then click on the current mouse position.
  • context_click (on_element = None): Right-click.
  • double_click(on_element=None) : 双击。
  • click_and_hold (on_element = None): Click the mouse but do not let go
  • drag_and_drop (source, target): Click grabbed on the source element, the element moves to the target release down.
  • drag_and_drop_by_offset (source, xoffset, yoffset): Click grabbed on the source element to move relative to the coordinate position of the source element and offset xoffset yoffset put down.
  • send_keys (* keys_to_send): to send the key elements of the current focus.
  • send_keys_to_element (element, * keys_to_send): to send the key to the specified element.
  • reset_actions (): cleanup actions already stored.
  • For more please refer to: http: //sekenium-python.readthedocs.io/api.html

Page wait

Now more and more web pages using Ajax technology, so that the program can not determine the appropriate elements to complete a load out. If the actual pages too long to wait for an event leading to a DOM element has not come out, but your code directly using this page element, it will throw an exception NullPointer of. To solve this problem, so wait Selenium provides two ways, one is an implicit wait, wait one is displayed.

1. Implicit wait

Implicit wait refers, in webdriver the find_element_*time to find this type of operation, if no element, it will default polling wait for some time.

Call driver.implicitly_wait(10). So before acquiring element unavailable, it will wait 10 seconds, the following sample code:

driver = webdriver.Chrome(path)
# 设置隐式等待
driver.implicitly_wait(10)
# 请求网页
driver.get('https://www.baidu.com/')

2. Display wait

Display is waiting for the show to perform the operation only after obtaining an element conditions are satisfied. You can also specify a maximum waiting time when, if more than this time then it throws an exception. Shows the wait should use the selenium.webdriver.support.expected_conditionsexpected conditions and selenium.webdriver.support.ui.WebDriverWaitto cooperate to complete.

Sample code is as follows:

# 引入所需库
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)

# 请求网页
driver.get('https://www.baidu.com/')

# 设置显示等待
try:
    element = WebDriverWait(driver, 10).until(
        # 只能传一个参数,需要放到元组中
        EC.presence_of_element_located((By.ID,'kw'))
    )
    print(element)
finally:
    driver.close()

In the above example, we find an element of time, no longer used find_element_by_*in such a way to find elements, but the use WebDriverWait.

try block code means: no abnormality before throwing element is present, wait up to 10 seconds. In the 10 seconds, WebDriverWaitit will be the default content in every 500ms until run time, and until the EC.presence_of_element_locatedit is checked whether the element has been loaded, check the elements through By.IDto find it this way.

In other words, in 10 seconds, by default checked once every 0.5 seconds element exists, then there is an element assigned to elementthis variable. If more than 10 seconds this element does not exist yet, throw timeout exception.

Other methods class expected_conditions

  1. title_is: Analyzing title, returns a Boolean value
    • WebDriverWait(driver,10).until(EC.title_is(u"百度一下,你就知道"))
  2. title_contains: Analyzing title, returns a Boolean value
    • WebDriverWait(driver,10).until(EC.title_contains(u"百度一下"))
  3. presence_of_element_located: Determining whether the object is loaded into the element tree dom; does not mean that the elements must be seen, if the target return Webelement
    • WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,'some')))
  4. visibility_of_element_located: Determining whether the object is loaded into the element in dom visible and generally used when the object may be obscured by other elements of the object
    • WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.ID,'some')))
  5. visibility_of: To determine whether the element is visible, if visible returns that element.
    • WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(by=By.ID,value='some')))
  6. presence_of_all_elements_located: Determine whether there is at least one element is present dom tree, if positioning (find) returns to the list.
    WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'some')))
  7. visibility_of_any_elements_located: To determine whether there is at least one element is visible on the page, if it is positioned to return to the list.
    • WebDriverWait(driver,10).until(EC.visibility_of_any_elements_located((By.CSS_SELECTOR,'some')))
  8. text_to_be_present_in_element: Determining whether the specified element contains the expected string, it returns a Boolean value.
    • WebDriverWait(driver,10).until(EC.text_to_be_present_in_element((By.XPATH,"some"),u'设置'))
  9. text_to_be_present_in_element_value: Attribute value determination whether to include the specified element of the expected string, returns a Boolean value.
    • WebDriverWait(driver,10).until(EC.text_to_be_present_in_element_value((By.CSS_SELECTOR,'some'),u'百度一下'))
  10. invisibility_of_element_located: Determine whether there is an element in the dom or invisible, if visible returns False, not visible to return to this element.
    • WebDriverWait(driver,10).until(EC.invisibility_of_element_located((By.CSS_SELECTOR,'some')))
  11. element_to_be_clickable: Determine whether there is a visible element and is enable (clickable).
    • WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"some"))).click()
  12. element_to_be_selected: To determine whether an element is selected, generally used in the drop-down list.
    • WebDriverWait(driver,10).until(EC.element_to_be_selected(driver.find_element(By.XPATH,"some")))
  13. For more please refer to: http: //sekenium-python.readthedocs.io/waits.html

Switch pages

Sometimes the window, there are many sub-tab, this time definitely need to toggle, selenuimprovides a feature called switch_to_windowto switch, switch to that specific page from driver.window_handlesLocate. Sample code is as follows:

# 引入所需库
from selenium import webdriver

# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')

driver.execute_script('window.open("http://www.douban.com/")')
print(driver.window_handles)
driver.switch_to.window(driver.window_handles[1])
print(driver.current_url)

# 虽然在窗口中切换到了新页面,但是driver中还没有切换.
# 如果想要在代码中切换到新的页面,并且做一些爬虫,
# 那么应该使用driver.switch_to.window()来切换到指定窗口
# 从driver.window_handlers中取出jurisdiction第几个窗口
# driver.window_handlers是一个列表,里面装的都是窗口句柄.
# 它会按照打开页面的顺序来存储窗口的句柄.

Other Bowen link

Published 154 original articles · won praise 404 · Views 650,000 +

Guess you like

Origin blog.csdn.net/Zhihua_W/article/details/102890725