Python crawler uses selenium to process dynamic web pages

For static web pages, you can easily get the source code of the web page using libraries such as requests, and then extract the desired information. But for dynamic web pages, the situation is much more complicated. The source code of such pages often has only one frame, and its content is rendered by JavaScript. At this time, we can use selenium to directly drive the browser to crawl.

Selenium is an automated testing tool that can drive the browser to perform a series of operations, and can get the source code of the currently presented web page, which is very effective for crawling dynamic pages. Let's talk about the simple use of selenium.

1. Installation

1. selenium

It is recommended to use pip to install directly:

pip install selenium

2. ChromeDriver

Selenium is an automated testing tool that needs to be used with a browser driver. Taking Chrome as an example, you first need to download the driver of the corresponding version of your browser. The version comparison can be viewed in the chromedriver and chrome version mapping table . In addition, Chrome version 70 and above can be downloaded directly to ChromeDriverMirror according to the browser version.

Insert picture description here

After the download is complete, you need to configure the executable file to the environment variable. In the Windows system, you can drag the chromedriver.exe file directly to the scripts directory of the python environment, or you can configure its path to the system environment variable. In the Linux system, you can drag the chromedriver.exe file to the scripts directory of the python environment. The execution file is moved to the environment variable directory:sudo mv chromedriver /usr/bin

After the configuration, the command line entered directly chromedriverif the following output, the environmental configuration is successful:

Insert picture description here

Two, use

1. Declare the browser object

from selenium import webdriver
chrome = webdriver.Chrome()

After such initialization, when using it for browser operation, the browser interface will pop up until it is closed. Obviously we want it to work in the background silently without popping up any interface, so we also need to set chrome's interfaceless mode ( Old version may not support):

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome = webdriver.Chrome(chrome_options=chrome_options)

2. Visit the page

To access the page, use the get function to pass in the url:

chrome.get('https://www.baidu.com')
print(chrome.page_source)	# 输出当前页面源码

3. Find Node

After opening a page with the get method, you can use the page_source method to obtain its source code and then use libraries such as BeautifulSoup for parsing. However, selenium has provided a series of methods for operating nodes, so there is no need to use additional parsing libraries.
Nodes with common attributes such as id and class can be obtained in find_element_by_idthis way. The methods to obtain a single node are as follows:

find_element_by_tag_name()
find_element_by_id()
find_element_by_name()
find_element_by_class_name()

In addition, there is a general method for finding nodes find_element(属性名, 值), which can flexibly find nodes with custom attributes. The above method is a method to obtain a single node, if there are multiple qualified node page you can only return the first result, if you want to find all the qualified nodes only need to approach the element changed elements can be , Which will return a list of nodes.

4. Get node information

After locating a node, you can use the get_attribute('属性名')method to get its attribute information, and get the content of the node by using the .txt method:

node = chrome.find_element_by_id('kw')
print(node.get_attribute('class'))
print(node.text)

5. Node interaction

Selenium provides a series of operation methods for nodes, such as filling in the input box:

input = chrome.find_element_by_id('kw')
input.send_keys('python')	# 填入内容
input.clear()	# 清空输入
input.send_keys('zzu')

Click the designated button:

button = chrome.find_element_by_id('su')
button.click()

6. Delayed waiting

When opening a page, it may be caused by network problems or the page needs to load a lot of dynamic information, which may cause errors when operating the page immediately. Therefore, it is best to wait for a while after opening the page before proceeding. The easiest way to wait is to use the time library. :

import time
time.sleep(3)

Obviously this method is very inflexible, most pages can be loaded quickly, such a fixed delay will waste a lot of time. A better way is to use the WebDriverWait class, initialize with a browser object and a maximum waiting time, then call the until method, and pass in the condition to wait for completion. If the condition is met within the maximum waiting time, the waiting will end immediately, otherwise an exception will be thrown, for example :

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

# 5秒内id为content_left的节点出现则结束等待,否则抛出异常
WebDriverWait(chrome, 5).until(EC.presence_of_element_located(('id', 'content_left')))

The complete waiting conditions can be viewed in the official document

7. Execute JavaScript code

Selenium provides a method to run JavaScript code directly, excute_script . Actions not provided by Selenium can be implemented by executing JavaScript code, such as pulling down the progress bar to load more content:

# 下拉进度条到最底部
chrome.execute_script('window.scrollTo(0, document.body.scrollHeight)')
chrome.execute_script('alert("to bottom")')

Guess you like

Origin blog.csdn.net/zzh2910/article/details/89186550