Selenium
Basic operations such as page navigation and element positioning can help us complete 80% of the work in automated crawling tasks. If you just want to simply implement the automated crawler function, then you do not need to read the following content; if you want the crawler program you write to run more efficiently, and you want to learn some selenium
code implementation skills, it is strongly recommended that you read it , This part of the content will take you into a different world, you will find: the original automated crawler code can still be written like this? Yes, this part of the content can make your code implementation more attractive.
The knowledge points involved in this advanced topic are:
- Delay loading
Dom
elements: Learn how to handle asynchronous loading of elements on the page. - Behavior chain
ActionChain
: Learn to chain multiple related operations. WebDriver
The architecture and internal implementation: understandWebDriver
how it is implemented andWebDriver
some commonly used interfaces provided.
Delay page loading
With the rapid development of the Internet and the rapid update of front-end technologies such as HTML5, CSS3, ajax (Asynchronous Javascript And XML), React, and Vue, the realization of website functions tends to be more complicated and diversified. However, for asynchronous loading of dynamic pages, Ajax has always played a very important role in the realization of front-end technology. Ajax not only improves the response effect of the page, but also provides a friendly user experience.
When a user uses a browser to initiate a request to the server, the browser parses and renders the information returned by the server, and finally presents a visible HTML page to the user. When we use selenium to perform automated crawling tasks, we only need to load the target element we want to locate, and there is no need to wait until the entire page is loaded. In this way, the page load time can be reduced to a certain extent, thereby improving the performance of the crawler program. For example, when using selenium to implement the simulated login function, we only need to wait until the login-related elements are loaded before proceeding with the login operation; we do not need to pay attention to page elements that are not related to login.
Because the loading time of each element of a page is different, this increases the difficulty of positioning the element. For this type of problem, we can solve it by waiting for the operation. However, using selenium to manipulate an element in the DOM, when the element is not in the DOM, selenium will throw ElementNotVisibleException. The wait operation can avoid the occurrence of the above-mentioned exceptions, thereby improving the execution efficiency of the code.
Note: When using selenium to perform crawler tasks, avoid selenium throwing exceptions as much as possible. If selenium throws an unhandled exception and can only restart selenium to execute the crawler task, this increases the complexity of the crawler program and also reduces the performance of the crawler program. Therefore, we must write defensive code to avoid throwing exceptions.
Selenium provides two kinds of waiting operations-Explicit Waits and Implicit Waits.
Display waiting (Explicit Waits)
Explicit Waits
Wait for a given condition to trigger before proceeding to the next step. Based on the WebDriverWait
sum ExpectedCondition
, we can implement a method of displaying waiting, and let the written code wait for the required loading time. To give a simple example, when I am waiting for the bus to go to the company, if the bus arrives at the bus stop within 10 minutes, I will take the bus; otherwise, I will take a taxi to the company.
Note: Use time.sleep()
is also a way to display waiting. However, the time.sleep()
function will only wait for the given time. However, the time required for the page to load the element is usually different from the time we specify, which will lead to two results: if the time we give is too short, selenium
an exception will be thrown; if the time we give is too long , It will cause the program to wait too long, resulting in a decrease in program performance. The best solution is to let the program decide how long it needs to wait, so as not to let the program wait too long, but also to ensure the normal operation of the program. However, time.sleep()
this display of waiting is helpful for us to debug the code during the development process.
The code is implemented as follows:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
CHROMEDRIVER_PATH = './chromedriver' # chromedriver所在的目录
TIMEOUT = 5 # seconds
def main():
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH
)
start = time.perf_counter()
driver.get('https://www.baidu.com')
try:
element = WebDriverWait(driver, TIMEOUT).until(
EC.presence_of_element_located((By.ID, 'su'))
)
print(element.get_attribute('value'))
print('waiting time: {:d}s'.format(TIMEOUT))
print('loading time: {:.2f}s'.format(time.perf_counter()-start))
finally:
driver.quit()
if __name__ == '__main__':
main()
# Result:
# 百度一下
# waiting time: 5s
# loading time: 1.96s
Code analysis: The code specifies the timeout as 5 seconds. If the presence_of_element_located
function completes the positioning element operation within 5 seconds, the corresponding code will return the result immediately; otherwise, a TimeoutException will be thrown.
WebDriverWait
Class is selenium.webdriver.support.wait
implemented, which function signature is: class.selenium.webdriver.support.wait.WebDriverWait(driver, timeout, poll_frequency=0.5, ignore_exceptions=None)
wherein:
- The parameter
driver
is an already createdWebDriver
instance. - The parameter
timeout
is the given timeout period. - The parameter
poll_frequency
indicates the sleep interval of the call, and the default value is0.5 s
. - The parameter
ignore_exceptions
represents a tuple of all exception classes that may occur in the call, and the default value is(NoSuchElementException, )
.
WebDriverWait
Class provides until
, until_not
two methods: until(method, message="")
: at poll_frequency
the time the calling method is repeated at intervals method
, until the return value is not False
, or the timeout expires, the timeout is thrown exception; unti_not(method, message="")
: a until
method analogous, when the return value False
, the method
end of the call, and returns The value is returned.
For Expected Conditions
, selenium
provides some common conditions expected, here are selenium.webdriver.support.expected_conditions
26 kinds of conditions expected to provide:
- title_is: Is the title of the page equal to the given title
- title_contains: Whether the title of the page contains the given title
- presence_of_element_located: Whether a given element exists in the DOM of the page
- url_contains: Does the URL of the current page contain the given string
- url_matches: Whether the URL of the current page meets the expected pattern
- url_to_be: Is the URL of the current page equal to the given URL
- url_change: Whether the URL of the current page is not equal to the given URL, opposite to url_to_be
- visibility_of_element_located: Determine whether a given element exists in the DOM of the page and is visible.
- visibility_of: Whether the given element is visible
- presence_of_all_elements_located: At least one element exists in the page
- visibility_of_any_elements_located: At least one element on the page is visible
- visibility_of_all_elements_located: All elements are on the page and are visible
- text_to_be_present_in_element: Whether the given text is in the selected element
- text_to_be_present_in_element_value: Whether the given text is in the attribute value of the given element
- frame_to_be_available_and_switch_to_it: Whether the given frame can be switched
- invisibility_of_element_located: The given element is neither visible nor present in the DOM of the page
- invisibility_of_element:
- element_to_be_clickable: The given element is visible and operable, and can perform click operations
- staleness_of: wait until the elements are no longer attached to the DOM of the page
- element_to_be_selected: element can be selected
- element_located_to_be_selected: The located element can be selected
- element_selection_state_to_be: The state of whether a given element is selected
- element_located_selection_state_to_be: Whether the positioned element is selected
- number_of_windows_to_be: Whether the data of the window is equal to the given value
- new_window_is_opened: Whether the new window is open
- alert_is_present: whether the warning window exists
In addition to the general conditions given above, you can also customize new conditions. By WebDriverWait
classes and expected_conditions
general condition of the module can be provided for efficient explicit wait operation.
隐式Waits(Implicit Waits)
When we need to operate more than one that can't be used immediately element
, implicit waiting allows the WebDriverWait
DOM to be polled a specified number of times. The benefits of this method of operation are not obvious. The recommended approach: Before writing the code, carefully analyze the objects to be operated, DOM
and decide whether to use display waiting or implicit waiting according to the number of objects to be operated.
Code:
from selenium import webdriver
CHROMEDRIVER_PATH = './chromedriver'
def main():
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH
)
driver.implicitly_wait(10)
driver.get('https://www.baidu.com')
dynamic_element = driver.find_element_by_id('su')
print(dynamic_element.get_attribute('value'))
if __name__ == '__main__':
main()
Code analysis: implicitly_wait
The signature of implicitly_wait(self, time_to_wait)
the function is:, the function sets the timeout
value, and implicitly waits for the desired DOM element to be discovered or the Command
execution is completed. This method is executed only once every time a session is established. The concept of conversation will be introduced in Section 3.
Action Chains
Action Chains are used to complete simple interactive behaviors, such as mouse movement, mouse click, keyboard input and other events. This is very useful for simulating more complex continuous operations, such as the sliding of the verification code, which involves events such as mouse clicks, mouse hovering, and drag behavior.
The series of methods called on the ActionChains object are similar to a series of continuous operations of the user, and these behaviors are stored in a queue. When perform() is called, these actions are sequentially dequeued and executed.
The methods provided by the ActionChains class are as follows:
- perform(): Perform all stored actions
- reset_action(): Set the stored action to empty
- click(): Perform a click operation
- click_and_hold(): Click the left mouse button on the element and keep it still
- context_click(): Right-click on the element
- double_click()
- drag_and_drop
- drag_and_drop_by_offset
- key_down
- key_up
- move_by_offset
- move_to_element
- move_to_element_with_offset
- pause
- release
- send_keys
- send_keys_to_element
The ActionChain class implements the __enter__
and __exit__
method, so the ActionChain class is a context manager object.
WebDriver commonly used API
In this section, we will introduce the commonly used APIs of WebDriver, RemoteDriver and ChromeDriver. For the related operations of other browser drivers, you can view the Python version of the selenium documentation for learning.
Selenium architecture and core components
From the perspective of Client/Server, selenium plays the role of Server. During the communication between the client and the server, the two need to interact according to a certain protocol to complete the transmission of information.
All WebDrivers that communicate with the browser in selenium implement a common protocol-Json Wire Protocol, which defines a RESTful Web service based on HTTP, where Json is used as the medium of information exchange. The protocol assumes that the client implementation adopts an object-oriented approach. In this protocol, the realization of request/response corresponds to commands/responses.
In Json Wire Protocol, there are some basic terms and concepts:
- Client: a machine using WebDriver API; usually the client and server are on the same host
- Server: A browser that implements the wire protocol, such as FirefoxDriver or IPoneDriver, etc.
- Session: The server guarantees that each browser corresponds to a session, and the Command sent to the session will directly act on the corresponding browser, complete the operation corresponding to the Command, and return a teding JSON response message.
- WebElement: The object in the WebDriver API represents the DOM element on the page
- WebElement JSON Object: JSON representation of WebElement transmitted on wire
- Commands: WebDriver the Command message in line
HTTP/1.1 request specification
, wire the agreement, all of the commands to receiveapplication/sjon;charset=UTF-8
content. In the WebDriver service, each command can be mapped to an HTTP method on a specific path. - Responses: Responses should
HTTP/1.1 response messages
be sent according to the specification.
The above are some concepts that I think are more important, which help us to understand the code implementation of selenium. For the specific implementation of Json Wire Protocol, please refer to link 2.
RemoteWebDriver
RemoteWebDriver is the base class of all browsers WebDriver. By learning the implementation of RemoteWebDriver, we can learn more about other browsers WebDriver.
The implementation class of the RemoteWebDriver object is: selenium.webdriver.remote.webdriver.WebDriver. The implementation of RemoteWebDriver conforms to the Json Wire Protocol, and provides users with a variety of easy-to-use interfaces to control the browser and complete the operations that users need to complete.
Taking RemoteWebDriver as the base class, selenium implements different browser drivers according to different browsers. The following code is all browser drivers provided by selenium:
In [1]: from selenium import webdriver
In [2]: webdriver.remote.webdriver.WebDriver.__subclasses__()
Out[2]:
[selenium.webdriver.firefox.webdriver.WebDriver,
selenium.webdriver.chrome.webdriver.WebDriver,
selenium.webdriver.ie.webdriver.WebDriver,
selenium.webdriver.edge.webdriver.WebDriver,
selenium.webdriver.safari.webdriver.WebDriver,
selenium.webdriver.blackberry.webdriver.WebDriver,
selenium.webdriver.phantomjs.webdriver.WebDriver,
selenium.webdriver.android.webdriver.WebDriver,
selenium.webdriver.webkitgtk.webdriver.WebDriver]
In [3]: webdriver.Remote
Out[3]: selenium.webdriver.remote.webdriver.WebDriver
In [4]: webdriver.Chrome.__base__
Out[4]: selenium.webdriver.remote.webdriver.WebDriver
If you need to customize WebDriver, you can refer to the implementation of browser drivers such as Chrome.
ChromeDriver
ChromeDriver is based on the chrome browser chromedriver
, follows the JSON Wire Protocol, and implements the interface for Python developers. The specific implementation of the ChromeDriver object is
class selenium.webdriver.chrome.webdriver.WebDriver(*executable_path='chromedriver'*, *port=0*, *options=None*, *service_args=None*, *desired_capabilities=None*, *service_log_path=None*, *chrome_options=None*)
This class allows to control the browser and create an object of ChromeDriver. The base class of this class is RemoteWebDriver. Where:
executable_path parameter indicates the path chromedriver located, by default, will $PATH
be to find;
- The port parameter indicates the port on which the service runs;
- If chrome_options exists, options = chrome_options ;
- desired_capabilities:Dictionary object with non-browser specific capablilities;
service_log_path
: The path where the log information generated by the driver is stored;keep_alive
: Whether to configure ChromeRemoteConnection to useHTTP keep-alive
;- DesiredCapabilities class provides default support of selenium
desired capablities
,
ChromeOption object implementation class isclass selenium.webdriver.chrome.options.Options
used to configure Chrome extensions andheadless
state.
In automated crawlers, the headless
stateful browser mode is usually used to improve the performance of the program.