Selenium module application

Introduction

Selenium was originally an automated testing tool, and it was mainly used in crawlers to solve the problem that requests cannot directly execute JavaScript code. Selenium essentially drives the browser to completely simulate the operation of the browser, such as jump, input, click, drop down, etc. , To get the result after the webpage is rendered, can support multiple browsers

Environmental installation

Download and install selenium: pip install selenium

Download the browser driver: http://chromedriver.storage.googleapis.com/index.html

View the mapping relationship between driver and browser version: http://blog.csdn.net/huilan_same/article/details/51896672

Simple use / effect display

from Selenium Import the webdriver
 from Time Import SLEEP
 # behind your browser driving position, remember preceded by r '', 'r' is to prevent the escape character 
Driver = webdriver.Chrome (R & lt ' driver path ' )
 # the get open the Baidu page 
driver.get ( " http://www.baidu.com " )
 # locate page "settings" option, and click 
driver.find_elements_by_link_text ( ' set ' ) [0] .click ()
sleep ( 2 )
 # # Open the settings and find the "Search Settings" option, set to display 50 
drivers.find_elements_by_link_text ( ' Search Settings ' ) [0] .click ()
SLEEP ( 2 )
 # checked per page 50 
m = driver.find_element_by_id ( ' NR ' )
sleep(2)
m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()
m.find_element_by_xpath('.//option[3]').click()
SLEEP ( 2 )
 # Click Save Settings 
driver.find_elements_by_class_name ( " prefpanelgo " ) [0] .click ()
SLEEP ( 2 )
 # deal with pop-up warning page to determine the accept () and Cancel Dismiss () 
driver.switch_to_alert (). accept ()
SLEEP ( 2 )
 # find Baidu input box, and input beauty 
driver.find_element_by_id ( " kW " ) .send_keys ( ' beauty ' )
SLEEP ( 2 )
 # click on the search button 
driver.find_element_by_id ( ' su ' ) .click ()
SLEEP ( 2 )
 # find the page that opens "Selenium - Chinese open source community" and open the page 
driver.find_elements_by_link_text ( ' beauty _ Baidu Pictures ' ) [0] .click ()
SLEEP ( 3 )
 # close the browser 
driver.quit ()

Case: Obtaining data for dynamic page loading

from selenium import webdriver
from lxml import etree

driver = webdriver.Chrome(executable_path='./chromedriver.exe')
driver.get('http://125.35.6.84:81/xk/')
dir_txt = driver.page_source

tree = etree.HTML(dir_txt)
title_list = tree.xpath('//*[@id="gzlist"]/li')
for i in title_list:
    title = i.xpath('./dl/@title')[0]
    print(title)

driver.quit()

Method description

Browser creation

Selenium supports a lot of browsers, such as Chrome, Firefox, Edge, etc., as well as mobile browsers such as Android and BlackBerry. In addition, PhantomJS, an interfaceless browser, is also supported.

from selenium import webdriver
browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

Element positioning

webdriver provides a series of element positioning methods, commonly used are the following:

find_element_by_id()
find_element_by_name()
find_element_by_class_name()
find_element_by_tag_name()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_xpath()
find_element_by_css_selector()

note

1. Find_element_by_xxx finds the first label that meets the conditions, find_elements_by_xxx finds all the labels that meet the conditions.

2. According to the ID, CSS selector and XPath, the results returned by them are completely consistent.

3. In addition, Selenium also provides a general method find_element(), which requires two parameters to be passed in: the search method Byand the value. In fact, it is find_element_by_id()the common function version of this method, for example, find_element_by_id(id)it is equivalent to find_element(By.ID, id)that the results obtained by the two are completely consistent.

Node interaction

Selenium can drive the browser to perform some operations, which means that the browser can simulate the execution of some actions. The more common usages are: the send_keys()method used when entering text, the method used when the text is cleared, and the clear()method used when the button is clicked click(). Examples are as follows:

from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input = browser.find_element_by_id('q')
input.send_keys('MAC')
time.sleep(1)
input.clear()
input.send_keys('IPhone')
button = browser.find_element_by_class_name('btn-search')
button.click()
browser.quit()

Execute JavaScript

For some operations, the Selenium API is not provided. For example, if you pull down the progress bar, it can directly simulate and run JavaScript. At this time, execute_script()you can use the method to achieve it. The code is as follows:

om selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.jd.com/')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("123")')

Get page source data

page_sourceYou can get the source code of the webpage through attributes, and then you can use the parsing library (such as regular expressions, Beautiful Soup, pyquery, etc.) to extract the information.

Forward and backward

# Forward and back browser simulation 
Import Time
 from the Selenium Import webdriver
browser=webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.get('https://www.taobao.com')
browser.get('http://www.sina.com.cn/')
browser.back()
time.sleep(10)
browser.forward()
browser.close()

Case: Browser automation

from selenium import webdriver
import time

driver = webdriver.Chrome('./chromedriver.exe')
driver.get('https://www.taobao.com')

# Label location 
class_dir = driver.find_element_by_id ( ' Q ' )   # ID tag

# Label interaction 
class_dir.send_keys ( ' umbrella ' )

# Execute program js 
driver.execute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' )
time.sleep(2)

# Click the Search button 
# But driver.find_element_by_css_selector = ( '. Btn-Search') # class class label 
But = driver.find_element_by_xpath ( ' // * [@ the above mentioned id = "J_TSearchForm"] / div [1] / the Button ' )     # xpath path 
but.click ()      # click event


driver.get('https://www.baidu.com')
the time.sleep ( 2 )
 # page Back 
driver.back ()
 # # page forward 
driver.forward ()

time.sleep(3)
driver.close ()   # close the browser

iframe processing & action chain

In the above example, some interactive actions are performed for a certain node. For example, for the input box, we call its input text and clear text method; for the button, we call its click method. In fact, there are other operations, they do not have specific execution objects, such as mouse dragging, keyboard keys, etc. These actions are performed in another way, that is, the action chain.

Selenium processing iframe

If the positioned tag exists in the iframe tag, you must use switch_to.frame (id)

Action chain (drag): from selenium.webdriver import ActionChains

1. Instantiate an action chain object: action = ActionChains (bro)

2. click_and_hold (div): click and hold operation

3. move_by_offset (x, y): Set the amount of dragging

4. perform (): make the action chain execute immediately

5. action.release (): release the action chain object

For example, to realize the drag operation of a node, drag a node from one place to another, it can be achieved as follows:

# 动作链
from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep

driver = webdriver.Chrome('./chromedriver.exe')
driver.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

# If the tag is located iframe exists in the label, the label must be re-positioned by operating 
driver.switch_to.frame ( ' iframeResult ' )       # switch browser tab positioned scope 
div = driver.find_element_by_id ( ' draggable with ' )

# Operation Chain 
Action = ActionChains (Driver)

# Click press the specified label 
action.click_and_hold (div)

for i in range (5 ):
     # perform () immediately execute the action chain operation 
    action.move_by_offset (17 , 0) .perform ()
    sleep(0.3)

# Release action chain 
action.release ()
driver.close()

Case: Implementing simulated QQ space login

from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep

bor = webdriver.Chrome ( ' ./chromedriver.exe ' )
bor.get('https://qzone.qq.com')

# Change label location scope 
bor.switch_to.frame ( ' login_frame ' )

zm = bor.find_element_by_id('switcher_plogin')
zm.click()

user = bor.find_element_by_id('u')
pawd = bor.find_element_by_id('p')
user.send_keys('111222333')
sleep(1)
pawd.send_keys('123456')
sleep(1)
but = bor.find_element_by_id('login_button')
but.click()
sleep(3)
bor.close()

Cookie processing

With Selenium, you can also easily manipulate cookies, such as obtaining, adding, and deleting cookies. Examples are as follows:

from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'germey'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())

Exception handling

from selenium import webdriver
from selenium.common.exceptions import TimeoutException,NoSuchElementException,NoSuchFrameException
try:
    browser=webdriver.Chrome()
    browser.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
    browser.switch_to.frame('iframssseResult')
except TimeoutException as e:
    print(e)
except NoSuchFrameException as e:
    print(e)
finally:
    browser.close()

Headless browser & evasion detection

Google Headless Browser

Since PhantomJs has recently stopped updating and maintaining, it is recommended that you can use Google ’s headless browser, which is a Google browser with no interface.

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

dirver = webdriver.Chrome(chrome_options=chrome_options)

Selenium evades detection and recognition

Now many large websites have adopted a monitoring mechanism for selenium. For example, under normal circumstances, the value of window.navigator.webdriver when we use a browser to access Taobao and other websites is undefined. While using selenium access, the value is true. So how to solve this problem?

Only need to set the startup parameters of Chromedriver to solve the problem. Before starting Chromedriver, enable experimental function parameters for Chrome excludeSwitches. Its value is [‘enable-automation’]as follows. The complete code is as follows:

from selenium.webdriver import ChromeOptions

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = Chrome(options=option)

Code demo:

from selenium import webdriver
from time import sleep

# -Free visual interface 
from selenium.webdriver.chrome.options Import Options

# Achieve evade detection 
from selenium.webdriver Import ChromeOptions


# Create a parameter object, to control the interface mode is turned in a non-chrome 
chrome_options = the Options ()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')


# Achieve evade detection of 
the Option = ChromeOptions ()
option.add_experimental_option('excludeSwitches',['enable-automation'])

bor = webdriver.Chrome('./chromedriver.exe',chrome_options=chrome_options,options=option)
bor.get('https://i.qq.com')
bor_txt = bor.page_source
print(bor_txt)
sleep(3)
bor.close()

Day 42 Crawler_selenium module application & Super Eagle case