Python Reptile 4.4 - selenium Advanced Usage tutorial
Overview
At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4
In front of an article about the basis for the use of selenium, which is a selenium us about some of the more advanced usage.
Headless Chrome
The above example code running in a browser window will pop up, sometimes inconvenient, which we need not pop-up crawling data.
Headless Chrome Chrome is a browser interface without form, you can open the browser without the premise of using Chrome supports all features, you run the script on the command line. Previously reptiles to use Phantomjs to achieve these functions, but Phantomjs development has been suspended, can now use Headless Chrome instead.
Sample code is as follows:
# 引入所需库
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# 定制option
chrome_options = Options()
# 设置无头
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=path, options=chrome_options)
# 打开百度
url = 'http://www.baidu.com/'
browser.get(url)
time.sleep(3)
# 保存页面截图
browser.save_screenshot('baidu.png')
browser.quit()
Setting request header
from selenium import webdriver
# 进入浏览器设置
options = webdriver.ChromeOptions()
# 设置中文
options.add_argument('lang=zh_CN.UTF-8')
# 更换头部
options.add_argument('user-agent="Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"')
browser = webdriver.Chrome(chrome_options=options)
url = "https://httpbin.org/get?show_env=1"
browser.get(url)
browser.quit()
Set the proxy IP
Sometimes frequent crawling some pages, the server will find that you are mad reptile after your ip address. Then we can change the proxy ip to solve this problem. Change the proxy ip, different browsers have different implementations, here to Chrome
the browser as an example to explain:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--proxy-server-http://123.56.74.13:8080')
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
driver = webdriver.Chrome(executable_path=path, options=options)
driver.get('https://httpbin.org/ip')
Common Startup Items parameter setting options
Startup Parameters | effect |
---|---|
–user-agent=”“ | Set User-Agent request header |
-window-size = length, width | Set your browser resolution |
–headless | No interface runs |
–start-maximized | Maximize uptime |
–incognito | Stealth mode |
–disable-javascript | Javascript disabled |
–disable-infobars | Disable the browser is being automated process control tips |
More Parameters: https: //peter.sh/experiments/chromium-command-line-switches/
Operation Cookie
- Get all
cookie
:for cookie in driver.get_cookies(): print(cookie)
- According acquisition value of key cookie:
cookie = driver.get_cookie('BD_HOME') print(cookie)
- Delete all cookie:
driver.delete_all_cookies()
- Delete a cookie:
driver.delete_cookie('BD_HOME')
Sample code is as follows:
# 引入所需库
import time
from selenium import webdriver
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')
time.sleep(2)
# 获取所有cookie
for cookie in driver.get_cookies():
print(cookie)
# 根据cookie的key获取value
# cookie = driver.get_cookie('BD_HOME')
# print(cookie)
# 删除所有cookie
# driver.delete_all_cookies()
# 删除某个cookie
# driver.delete_cookie('BD_HOME')
driver.close()
selenium cookie settings
Using the add_cookie(cookie_dict)
method to the current session may add a cookie; cookie_dict
is a dictionary object, you must have name
and value
two keys, optional keys are: path
, domain
, secure
, expiry
. E.g:
driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’})
driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’, ‘path’ : ‘/’})
driver.add_cookie({‘name’ : ‘foo’, ‘value’ : ‘bar’, ‘path’ : ‘/’, ‘secure’:True})
Using the following sample code:
from selenium import webdriver
browser = webdriver.Chrome()
url = "https://www.baidu.com/"
browser.get(url)
# 通过js新打开一个窗口
newwindow='window.open("https://www.baidu.com");'
# 删除原来的cookie
browser.delete_all_cookies()
# 携带cookie打开
browser.add_cookie({'name':'ABC','value':'DEF'})
# 通过js新打开一个窗口
browser.execute_script(newwindow)
input("查看效果")
browser.quit()
Behavior Chain
Sometimes the operation page also might have a lot, so this time you can use the mouse behavior chain class ActionChains
to complete. For example, now move the mouse to click on an element and execute events, sample code as follows:
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')
time.sleep(2)
# 根据id获取元素
input_kw = driver.find_element_by_id('kw')
submit_btn = driver.find_element_by_id('su')
# 实例化Action
action = ActionChains(driver)
action.move_to_element(input_kw)
action.send_keys_to_element(input_kw, 'python')
action.move_to_element(submit_btn)
action.click(submit_btn)
# 执行上述操作
action.perform()
time.sleep(5)
driver.close()
Chain common behavior operation method (ActionChains class methods)
- click (on_element = None): Left-click the element passed, if not passed, then click on the current mouse position.
- context_click (on_element = None): Right-click.
- double_click(on_element=None) : 双击。
- click_and_hold (on_element = None): Click the mouse but do not let go
- drag_and_drop (source, target): Click grabbed on the source element, the element moves to the target release down.
- drag_and_drop_by_offset (source, xoffset, yoffset): Click grabbed on the source element to move relative to the coordinate position of the source element and offset xoffset yoffset put down.
- send_keys (* keys_to_send): to send the key elements of the current focus.
- send_keys_to_element (element, * keys_to_send): to send the key to the specified element.
- reset_actions (): cleanup actions already stored.
- For more please refer to: http: //sekenium-python.readthedocs.io/api.html
Page wait
Now more and more web pages using Ajax technology, so that the program can not determine the appropriate elements to complete a load out. If the actual pages too long to wait for an event leading to a DOM element has not come out, but your code directly using this page element, it will throw an exception NullPointer of. To solve this problem, so wait Selenium provides two ways, one is an implicit wait, wait one is displayed.
1. Implicit wait
Implicit wait refers, in webdriver the find_element_*
time to find this type of operation, if no element, it will default polling wait for some time.
Call driver.implicitly_wait(10)
. So before acquiring element unavailable, it will wait 10 seconds, the following sample code:
driver = webdriver.Chrome(path)
# 设置隐式等待
driver.implicitly_wait(10)
# 请求网页
driver.get('https://www.baidu.com/')
2. Display wait
Display is waiting for the show to perform the operation only after obtaining an element conditions are satisfied. You can also specify a maximum waiting time when, if more than this time then it throws an exception. Shows the wait should use the selenium.webdriver.support.expected_conditions
expected conditions and selenium.webdriver.support.ui.WebDriverWait
to cooperate to complete.
Sample code is as follows:
# 引入所需库
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 请求网页
driver.get('https://www.baidu.com/')
# 设置显示等待
try:
element = WebDriverWait(driver, 10).until(
# 只能传一个参数,需要放到元组中
EC.presence_of_element_located((By.ID,'kw'))
)
print(element)
finally:
driver.close()
In the above example, we find an element of time, no longer used find_element_by_*
in such a way to find elements, but the use WebDriverWait
.
try block code means: no abnormality before throwing element is present, wait up to 10 seconds. In the 10 seconds, WebDriverWait
it will be the default content in every 500ms until run time, and until the EC.presence_of_element_located
it is checked whether the element has been loaded, check the elements through By.ID
to find it this way.
In other words, in 10 seconds, by default checked once every 0.5 seconds element exists, then there is an element assigned to element
this variable. If more than 10 seconds this element does not exist yet, throw timeout exception.
Other methods class expected_conditions
title_is
: Analyzing title, returns a Boolean valueWebDriverWait(driver,10).until(EC.title_is(u"百度一下,你就知道"))
title_contains
: Analyzing title, returns a Boolean valueWebDriverWait(driver,10).until(EC.title_contains(u"百度一下"))
presence_of_element_located
: Determining whether the object is loaded into the element tree dom; does not mean that the elements must be seen, if the target return WebelementWebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,'some')))
visibility_of_element_located
: Determining whether the object is loaded into the element in dom visible and generally used when the object may be obscured by other elements of the objectWebDriverWait(driver,10).until(EC.visibility_of_element_located((By.ID,'some')))
visibility_of
: To determine whether the element is visible, if visible returns that element.WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(by=By.ID,value='some')))
presence_of_all_elements_located
: Determine whether there is at least one element is present dom tree, if positioning (find) returns to the list.
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'some')))
visibility_of_any_elements_located
: To determine whether there is at least one element is visible on the page, if it is positioned to return to the list.WebDriverWait(driver,10).until(EC.visibility_of_any_elements_located((By.CSS_SELECTOR,'some')))
text_to_be_present_in_element
: Determining whether the specified element contains the expected string, it returns a Boolean value.WebDriverWait(driver,10).until(EC.text_to_be_present_in_element((By.XPATH,"some"),u'设置'))
text_to_be_present_in_element_value
: Attribute value determination whether to include the specified element of the expected string, returns a Boolean value.WebDriverWait(driver,10).until(EC.text_to_be_present_in_element_value((By.CSS_SELECTOR,'some'),u'百度一下'))
invisibility_of_element_located
: Determine whether there is an element in the dom or invisible, if visible returns False, not visible to return to this element.WebDriverWait(driver,10).until(EC.invisibility_of_element_located((By.CSS_SELECTOR,'some')))
element_to_be_clickable
: Determine whether there is a visible element and is enable (clickable).WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"some"))).click()
element_to_be_selected
: To determine whether an element is selected, generally used in the drop-down list.WebDriverWait(driver,10).until(EC.element_to_be_selected(driver.find_element(By.XPATH,"some")))
- For more please refer to: http: //sekenium-python.readthedocs.io/waits.html
Switch pages
Sometimes the window, there are many sub-tab, this time definitely need to toggle, selenuim
provides a feature called switch_to_window
to switch, switch to that specific page from driver.window_handles
Locate. Sample code is as follows:
# 引入所需库
from selenium import webdriver
# 声明定义chromedriver路径
path = r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe'
# 实例化Chrome
# 如果时其他浏览器需要实例化为对应的对象,例如火狐webdriver.firefox()
driver = webdriver.Chrome(path)
# 操作输入框
driver.get('https://www.baidu.com/')
driver.execute_script('window.open("http://www.douban.com/")')
print(driver.window_handles)
driver.switch_to.window(driver.window_handles[1])
print(driver.current_url)
# 虽然在窗口中切换到了新页面,但是driver中还没有切换.
# 如果想要在代码中切换到新的页面,并且做一些爬虫,
# 那么应该使用driver.switch_to.window()来切换到指定窗口
# 从driver.window_handlers中取出jurisdiction第几个窗口
# driver.window_handlers是一个列表,里面装的都是窗口句柄.
# 它会按照打开页面的顺序来存储窗口的句柄.
Other Bowen link
- Python Reptile 1.1 - urllib tutorial Basic usage
- Python Reptile 1.2 - urllib Advanced Usage tutorial
- Python Reptile 1.3 - requests tutorial Basic usage
- Python Reptile 1.4 - requests Advanced Usage tutorial
- Python Reptile 2.1 - BeautifulSoup usage Tutorial
- Python Reptile 2.2 - xpath usage Tutorial
- Python Reptile 3.1 - json Usage tutorial
- Python Reptile 3.2 - csv usage Tutorial
- Python Reptile 3.3 - txt usage Tutorial
- Python Reptile 4.1 - threading (multi-threaded) Usage tutorial
- Python Reptile 4.2 - ajax (dynamic web crawler) Usage tutorial
- Python Reptile 4.3 - selenium tutorial Basic usage