Selenium automation tool (2)

Article preview:

Selenium automation tool (2)

foreword

Continuing from the previous section, the content of this section is relatively small. For crawlers, you must be familiar with obtaining node information. Others depend on the situation. It is best to bring exception handling and bypass detection in most project codes, which is the most formal The way of writing.

1. Obtain node information

get attribute

We can use the get_attribute() method to get the attributes of a node, but the premise is to select the node first, as follows:

from selenium import webdriver
url = 'https://pic.netbian.com/4kmeinv/index.html'
browser.get(url)
src = browser.find_elements(By.XPATH,'//ul[@class="clearfix"]/li/a/img')
for i in src:
    url = i.get_attribute('src')
    print(url)

Through the get_attribute() method, and then pass in the name of the attribute you want to get, you can get its value.

2. Delayed waiting

In Selenium, the get() method will end execution after the web page frame is loaded. If you get the page_source at this time, it may not be the page that the browser has fully loaded. If some pages have additional Ajax requests, we are in the web page source code It may not be possible to obtain it successfully. Therefore, it is necessary to wait for a certain period of time to ensure that the node has been loaded.

Instructions

Specify the node to look for, and then specify a maximum wait time. If the node is loaded within the specified time, the searched node will be returned; if the node is still not loaded within the specified time, a timeout exception will be thrown. Examples are as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, 'kw')))
button = wait.until(EC.element_to_be_clickable((By.ID, 'su')))
print(input, button)

The effect that can be achieved in this way is that if the node with the ID q (that is, the search box) is successfully loaded within 10 seconds, the node will be returned; if it has not been loaded for more than 10 seconds, an exception will be thrown.

For the button, you can change the waiting condition, such as element_to_be_clickable, that is, it is clickable, so when looking for the button, look for the button with the CSS selector .btn-search, if it is clickable within 10 seconds, it is successfully loaded , then return to this button node; if it cannot be clicked for more than 10 seconds, that is, it has not been loaded, an exception will be thrown.

For the introduction of waiting condition parameters and usage, please refer to the official document: http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions

3. Tab management

When visiting a web page, each tab will be opened. like:
insert image description here

In Selenium, we can also operate on tabs. Examples are as follows:

import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.execute_script('window.open()')
print(browser.window_handles)
browser.switch_to.window(browser.window_handles[1])

browser.get('https://www.baidu.com')
time.sleep(1)
browser.switch_to.window(browser.window_handles[0])
browser.get('https://pic.netbian.com')

The console output is as follows:

['CDwindow-4f58e3a7-7167-4587-bedf-9cd8c867f435', 'CDwindow-6e05f076-6d77-453a-a36c-32baacc447df']

First visited Baidu, and then called the execute_script() method, where the window.open() JavaScript statement is passed in to open a new tab. Next, we want to switch to that tab. Here call the window_handles property to get all the currently opened tabs, and return a list of tab codes. To switch tabs, just call the switch_to_window() method, where the parameter is the code name of the tab. Here we pass in the second tab code, that is, jump to the second tab, then open a new page under the second tab, then switch back to the first tab and call the switch_to_window() method again , and then perform other operations.

4. Exception handling

In the process of using Selenium, it is inevitable to encounter some exceptions, such as timeout, node not found and other errors. Once such errors occur, the program will not continue to run. Here we can use try except statement to catch various exceptions.

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException
browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
except TimeoutException:
    print('Time Out')
try:
    browser.find_element(By.ID,'hello')
except NoSuchElementException:
    print('No Element')
finally:
    browser.close()

Here we use try except to catch all kinds of exceptions. For example, we catch NoSuchElementException in the find_element_by_id() method of finding nodes, so that once such an error occurs, the exception will be handled and the program will not be interrupted.

The console output is as follows:

No Element

For more exception classes, you can refer to the official documentation: http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions .

5 Bypass detection

Many websites require anti-crawling methods for automatic detection, and will return some wrong information to you. At this time, you need to bypass the detection. If there is any processing below, and the detection after processing, please run the reader to find the difference.

# 无处理
browser.get('https://bot.sannysoft.com/')

# 设置屏蔽
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
browsers = webdriver.Chrome(chrome_options=options)
browsers.get('https://bot.sannysoft.com/')