When we crawl web pages, we often find that the data we want to obtain cannot simply be obtained by parsing the HTML code. These data are displayed on the page through AJAX asynchronous loading or JS rendering.
Selenuim is an automated testing tool that supports multiple browsers. In the crawler, we can use it to simulate the browser browsing the page, thereby solving the problem of JavaScript rendering.
1. Usage examples
2. Detailed introduction
2.1 Declare the browser object
That is, tell the program which browser should be used to operate
2.2 Access page
2.3 Find elements
After successfully accessing the web page, we may need to perform some operations, such as finding the search box and entering keywords and hitting the Enter key. Therefore, you need to find the element in selenium.
2.3.1 Single element
Selenium has two ways to find elements. The first is to specify which method to use to find elements, such as specifying to select according to CSS or to search according to xpath.
The following is a detailed element search method
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
The second method is to use find_element() directly. The first parameter passed in is the element search method that needs to be used.
2.3.2 Multiple elements
The method of searching for multiple elements is basically the same as that of searching for a single element (just add an s to the func that searches for a single element). Finding multiple elements returns a list.
2.4 Element interaction
Element interaction is to first obtain an element and then call the interaction method on the obtained element. For example, enter text in the search box:
2.5 Interactive actions
Interaction is to attach actions to the interaction chain and execute them serially, which requires the use of ActionChains.
2.6 Execute JavaScript
For example, drag and drop
2.7 Get element information
After you have obtained the element through element search, you may also need to obtain the attributes and text of this element.
2.7.1 Get attributes
2.8 Frame
If you locate the parent frame, you cannot find the information of the child frame, so you need to switch to the child frame and search again. In the same way, the information of the parent frame cannot be found in the child frame.
2.9 Waiting
When requesting a web page, there may be AJAX asynchronous loading. Selenium will only load the main web page and will not take AJAX into account. Therefore, you need to wait some time for the web page to load completely before proceeding.
2.9.1 Implicit wait
When using implicit wait, if webdriver does not find the specified element, it will continue to wait. After the specified time is exceeded, if the specified element is still not found, an element not found exception will be thrown. The default waiting time is 0.
Implicit wait is waiting for the entire page.
It should be noted that the implicit wait works for the entire driver cycle, so it only needs to be set once.
2.9.2 Explicit wait
Display waiting includes waiting conditions and waiting time.
First determine whether the waiting condition is established. If it is established, return directly; if the condition is not established, the longest waiting time is the waiting time. If the waiting condition is not met after the waiting time, an exception is thrown.
Explicit waiting waits for the specified element.
2.10 Browser forward/backward
back realizes returning to the previous page, forward realizes going to the next page
2.11 Operating Cookies
2.12 Tab management
Tab management is the browser's tabs. Sometimes we need to add a new tab or delete a tab in the browser, we can use selenium to achieve this.