Python crawler selection 11 episodes (selenium advanced summary)
- 1. Introduction of selenium
- Two. Selenium extracts data
- Three. Other ways to use selenium
1. Introduction of selenium
1. selenium running effect display
Selenium is a Web automated testing tool. It was originally developed for automated website testing. Selenium can directly call the browser. It supports all mainstream browsers (including PhantomJS and other interfaceless browsers). It can receive instructions and allow browsing The browser automatically loads the page, obtains the required data, and even takes a screenshot of the page. We can use selenium to easily complete the crawler we wrote before. Next, let’s take a look at the effect of selenium.
1.1 The running effect of chrome browser
After downloading chromedriver and installing the selenium module, execute the following code and observe the running process
from selenium import webdriver
# 如果driver没有添加到了环境变量,则需要将driver的绝对路径赋值给executable_path参数
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')
# 如果driver添加了环境变量则不需要设置executable_path
driver = webdriver.Chrome()
# 向一个url发起请求
driver.get("http://www.itcast.cn/")
# 把网页保存为图片,69版本以上的谷歌浏览器将无法使用截图功能
# driver.save_screenshot("itcast.png")
print(driver.title) # 打印页面的标题
# 退出模拟浏览器
driver.quit() # 一定要退出!不退出会有残留进程!
1.2 The running effect of phantomjs interfaceless browser
PhantomJS is a Webkit-based "headless" browser that loads the website into memory and executes JavaScript on the page. Download link: http://phantomjs.org/download.html
from selenium import webdriver
# 指定driver的绝对路径
driver = webdriver.PhantomJS(executable_path='/home/worker/Desktop/driver/phantomjs')
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')
# 向一个url发起请求
driver.get("http://www.itcast.cn/")
# 把网页保存为图片
driver.save_screenshot("itcast.png")
# 退出模拟浏览器
driver.quit() # 一定要退出!不退出会有残留进程!
1.3 Observe the operation effect
- Python code can automatically call Google browse or phantomjs interfaceless browser to control its automatic access to the website
1.4 Usage scenarios of headless browsers and headless browsers
- Usually in the development process, we need to check various conditions during the running process, so we usually use a headed browser
- When the project is completed for deployment, usually the system used by the platform is the server version of the operating system, and the server version of the operating system must use a headless browser to run normally
2. The role and working principle of selenium
Utilize the browser's native API to encapsulate a more object-oriented Selenium WebDriver API, directly manipulate the elements in the browser page, and even manipulate the browser itself (screenshot, window size, startup, shutdown, plug-in installation, certificate configuration, etc.) )
- Webdriver is essentially a web-server, which provides webapi externally, which encapsulates various functions of the browser
- Different browsers use different webdrivers
3. Installation and simple use of selenium
Let's take Google Chrome's chromedriver as an example
3.1 Install the selenium module in the python virtual environment
pip/pip3 install selenium
3.2 Download the webdriver that conforms to the version
Take chrome Google browser as an example
- Check the version of Google Chrome
- Visit https://npm.taobao.org/mirrors/chromedriver, click to enter the download page of different versions of chromedriver
-
Click notes.txt to enter the version description page
-
View the matching version of chrome and chromedriver
- Download the correct version of chromedriver according to the operating system
-
After decompressing the compressed package, obtain the Google Chrome webdriver executable file that the python code can call
-
windows is
chromedriver.exe
-
linux and macos are
chromedriver
-
-
chromedriver environment configuration
- In the windows environment, you need to set the directory where chromedriver.exe is located to the path in the path environment variable
- In the linux/mac environment, set the directory where chromedriver is located to the PATH environment value of the system
4. Simple use of selenium
Next, we will simulate Baidu search through code
import time
from selenium import webdriver
# 通过指定chromedriver的路径来实例化driver对象,chromedriver放在当前目录。
# driver = webdriver.Chrome(executable_path='./chromedriver')
# chromedriver已经添加环境变量
driver = webdriver.Chrome()
# 控制浏览器访问url地址
driver.get("https://www.baidu.com/")
# 在百度搜索框中搜索'python'
driver.find_element_by_id('kw').send_keys('python')
# 点击'百度搜索'
driver.find_element_by_id('su').click()
time.sleep(6)
# 退出浏览器
driver.quit()
webdriver.Chrome(executable_path='./chromedriver')
The executable parameter specifies the path of the downloaded chromedriver filedriver.find_element_by_id('kw').send_keys('python')
Locate the tag whose id attribute value is'kw' and enter the string'python' into itdriver.find_element_by_id('su').click()
Locate the label whose id attribute value is su, and click- The role of the click function is: the click event of the js that triggers the label
Two. Selenium extracts data
-
Knowledge points:
- Understand the common attributes and methods of the driver object
- Master the method of locating label elements of driver object and obtaining label objects
- Master the method of extracting text and attribute values from label objects
1. Common attributes and methods of the driver object
In the process of using selenium, after instantiating the driver object, the driver object has some commonly used attributes and methods
driver.page_source
The source code of the web page rendered by the current tab browserdriver.current_url
The url of the current tabdriver.close()
Close the current tab, if there is only one tab, close the entire browserdriver.quit()
Close the browserdriver.forward()
Page forwarddriver.back()
Page backdriver.screen_shot(img_name)
Screenshot of the page
2. The method of locating the label element of the driver object to obtain the label object
There are many ways to locate the label in selenium and return the label element object
find_element_by_id (返回一个元素)
find_element(s)_by_class_name (根据类名获取元素列表)
find_element(s)_by_name (根据标签的name属性值返回包含标签对象元素的列表)
find_element(s)_by_xpath (返回一个包含元素的列表)
find_element(s)_by_link_text (根据连接文本获取元素列表)
find_element(s)_by_partial_link_text (根据链接包含的文本获取元素列表)
find_element(s)_by_tag_name (根据标签名获取元素列表)
find_element(s)_by_css_selector (根据css选择器来获取元素列表)
- note:
- The difference between find_element and find_elements:
- If there is more s, return the list, if there is no s, return the first label object that matches
- An exception will be thrown if find_element fails to match, and an empty list will be returned if find_elements fails to match.
- The difference between by_link_text and by_partial_link_tex: all text and containing a certain text
- How to use the above functions
driver.find_element_by_id('id_str')
- The difference between find_element and find_elements:
3. The label object extracts the text content and attribute values
find_element can only get the element, not the data in it directly. If you need to get the data, you need to use the following methods
-
Perform click operations on elements
element.click()
- Click on the positioned label object
-
Enter data into the input box
element.send_keys(data)
- Enter data for the located label object
-
Get text
element.text
text
Obtain the text content by locating the properties of the label object obtained
-
Get attribute value
element.get_attribute("属性名")
get_attribute
Get the value of the attribute by locating the function of the obtained label object and passing in the attribute name
Three. Other ways to use selenium
-
Knowledge points:
- Master selenium to control the switching of tabs
- Master selenium to control iframe switching
- Master the method of obtaining cookies using selenium
- Master manual page waiting
- Master the method of selenium to control the browser to execute js code
- Master selenium to open the interfaceless mode
- Understand selenium use proxy ip
- Understand selenium replace user-agent
1. Switching selenium tabs
When selenium controls the browser to open multiple tabs, how to control the browser to switch between different tabs? We need to do the following two steps:
-
Get the window handles of all tabs
-
Use the window handle word to switch to the tab page pointed to by the handle
- The window handle here refers to: the identification pointing to the tab page object
- Please learn more about handles after class, this section will not expand
-
Concrete method
# 1. 获取当前所有的标签页的句柄构成的列表 current_windows = driver.window_handles # 2. 根据标签页句柄列表索引下标进行切换 driver.switch_to.window(current_windows[0])
-
Reference code example:
import time from selenium import webdriver driver = webdriver.Chrome() driver.get("https://www.baidu.com/") time.sleep(1) driver.find_element_by_id('kw').send_keys('python') time.sleep(1) driver.find_element_by_id('su').click() time.sleep(1) # 通过执行js来新开一个标签页 js = 'window.open("https://www.sogou.com");' driver.execute_script(js) time.sleep(1) # 1. 获取当前所有的窗口 windows = driver.window_handles time.sleep(2) # 2. 根据窗口索引进行切换 driver.switch_to.window(windows[0]) time.sleep(2) driver.switch_to.window(windows[1]) time.sleep(6) driver.quit()
2. switch_to switch frame label
Iframe is a commonly used technology in html, that is, a page is nested in another web page. Selenium cannot access the content in the frame by default. The corresponding solution is
driver.switch_to.frame(frame_element)
. Next, we will learn this knowledge point through qq mailbox simulation login
-
Reference Code:
import time from selenium import webdriver driver = webdriver.Chrome() url = 'https://mail.qq.com/cgi-bin/loginpage' driver.get(url) time.sleep(2) login_frame = driver.find_element_by_id('login_frame') # 根据id定位 frame元素 driver.switch_to.frame(login_frame) # 转向到该frame中 driver.find_element_by_xpath('//*[@id="u"]').send_keys('您的qq邮箱') time.sleep(2) driver.find_element_by_xpath('//*[@id="p"]').send_keys('邮箱密码') time.sleep(2) driver.find_element_by_xpath('//*[@id="login_button"]').click() time.sleep(2) """操作frame外边的元素需要切换出去""" windows = driver.window_handles driver.switch_to.window(windows[0]) content = driver.find_element_by_class_name('login_pictures_title').text print(content) driver.quit()
-
to sum up:
-
Switch to the page where the positioned frame tag is nested
driver.switch_to.frame(通过find_element_by函数定位的frame、iframe标签对象)
-
Cut out the frame tag by switching tabs
-
windows = driver.window_handles driver.switch_to.window(windows[0])
-
-
3. Selenium's handling of cookies
Selenium can help us deal with cookies on the page, such as getting and deleting, then we will learn this part of knowledge
3.1 Get cookies
driver.get_cookies()
Return to the list, which contains the complete cookie information! Not only name, value, but also other dimensions of cookie information such as domain. So if you want to use the obtained cookie information with the requests module, you need to convert it to a cookie dictionary with name and value as key-value pairs.
# 获取当前标签页的全部cookie信息
print(driver.get_cookies())
# 把cookie转化为字典
cookies_dict = {cookie[‘name’]: cookie[‘value’] for cookie in driver.get_cookies()}
3.2 Delete cookies
#删除一条cookie
driver.delete_cookie("CookieName")
# 删除所有的cookie
driver.delete_all_cookies()
4. Selenium controls the browser to execute js code
Selenium can let the browser execute the js code we specified, run the following code to see the effect
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.itcast.cn/")
time.sleep(1)
js = 'window.scrollTo(0,document.body.scrollHeight)' # js语句
driver.execute_script(js) # 执行js的方法
time.sleep(5)
driver.quit()
- The method of executing js:
driver.execute_script(js)
5. Page waiting
During the loading process of the page, it takes time to wait for the response from the web server. In this process, the label element may not be loaded yet and is invisible. How to deal with this situation?
- Page waiting for classification
- Mandatory wait for introduction
- Explicit wait introduction
- Implicit wait introduction
- Manually implement page waiting
5.1 Classification of page waiting
First of all, let's understand the classification of the following selenium page waiting
- Forced wait
- Implicit wait
- Explicit wait
5.2 Forced waiting (understand)
- In fact, it is time.sleep()
- Disadvantages are not smart at times, the setting time is too short, and the element has not been loaded; the setting time is too long, it will waste time
5.3 Implicit wait
-
Implicit waiting is for element positioning. Implicit waiting sets a time to determine whether the element is positioned successfully within a period of time. If it is completed, proceed to the next step
-
If the positioning is not successful within the set time, it will report timeout loading
-
Sample code
from selenium import webdriver driver = webdriver.Chrome() driver.implicitly_wait(10) # 隐式等待,最长等20秒 driver.get('https://www.baidu.com') driver.find_element_by_xpath()
5.4 Explicit wait (understand)
-
Check whether the waiting condition is reached every few seconds, if it is reached, stop waiting and continue to execute the subsequent code
-
If it is not reached, continue to wait until the specified time is exceeded, and report a timeout exception
-
Sample code
from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get('https://www.baidu.com') # 显式等待 WebDriverWait(driver, 20, 0.5).until( EC.presence_of_element_located((By.LINK_TEXT, '好123'))) # 参数20表示最长等待20秒 # 参数0.5表示0.5秒检查一次规定的标签是否存在 # EC.presence_of_element_located((By.LINK_TEXT, '好123')) 表示通过链接文本内容定位标签 # 每0.5秒一次检查,通过链接文本内容定位标签是否存在,如果存在就向下继续执行;如果不存在,直到20秒上限就抛出异常 print(driver.find_element_by_link_text('好123').get_attribute('href')) driver.quit()
5.5 Manually realize page waiting
After understanding the implicit and explicit waiting and forced waiting, we found that there is no universal method to solve the problem of page waiting, such as "the page needs to slide to trigger ajax asynchronous loading" scenario, then we will Take Taobao homepage as an example, manually realize page waiting
- principle:
- Manually implement the idea of forced waiting and explicit waiting
- Constant judgment or limited number of times to judge whether a certain label object has been loaded (whether it exists)
- The implementation code is as follows:
import time
from selenium import webdriver
driver = webdriver.Chrome('/home/worker/Desktop/driver/chromedriver')
driver.get('https://www.taobao.com/')
time.sleep(1)
# i = 0
# while True:
for i in range(10):
i += 1
try:
time.sleep(3)
element = driver.find_element_by_xpath('//div[@class="shop-inner"]/h3[1]/a')
print(element.get_attribute('href'))
break
except:
js = 'window.scrollTo(0, {})'.format(i*500) # js语句
driver.execute_script(js) # 执行js的方法
driver.quit()
6. selenium turns on no interface mode
Most servers do not have an interface. Selenium also has an interfaceless mode for controlling Google Chrome. In this section, we will learn how to turn on the interfaceless mode (also called the headless mode).
- How to turn on the interfaceless mode
- Instantiate configuration object
options = webdriver.ChromeOptions()
- Configure the object to add a command to open the interfaceless mode
options.add_argument("--headless")
- Configure the object to add a command to disable gpu
options.add_argument("--disable-gpu")
- Instantiate the driver object with the configuration object
driver = webdriver.Chrome(chrome_options=options)
- Instantiate configuration object
- Note: The chrome browser version 59+ in macos, version 57+ in Linux can only use the interfaceless mode!
- The reference code is as follows:
from selenium import webdriver
options = webdriver.ChromeOptions() # 创建一个配置对象
options.add_argument("--headless") # 开启无界面模式
options.add_argument("--disable-gpu") # 禁用gpu
# options.set_headles() # 无界面模式的另外一种开启方式
driver = webdriver.Chrome(chrome_options=options) # 实例化带有配置的driver对象
driver.get('http://www.itcast.cn')
print(driver.title)
driver.quit()
7. selenium uses proxy ip
Selenium control browser can also use proxy ip!
-
Method of using proxy ip
- Instantiate configuration object
options = webdriver.ChromeOptions()
- Configure the object to add the command to use the proxy ip
options.add_argument('--proxy-server=http://202.20.16.82:9527')
- Instantiate the driver object with the configuration object
driver = webdriver.Chrome('./chromedriver', chrome_options=options)
- Instantiate configuration object
-
The reference code is as follows:
from selenium import webdriver options = webdriver.ChromeOptions() # 创建一个配置对象 options.add_argument('--proxy-server=http://202.20.16.82:9527') # 使用代理ip driver = webdriver.Chrome(chrome_options=options) # 实例化带有配置的driver对象 driver.get('http://www.itcast.cn') print(driver.title) driver.quit()
8. selenium replace user-agent
When selenium controls Google Chrome, User-Agent defaults to Google Chrome. In this section, we will learn to use different User-Agents.
-
Method of replacing user-agent
- Instantiate configuration object
options = webdriver.ChromeOptions()
- Configure the object to add and replace the UA command
options.add_argument('--user-agent=Mozilla/5.0 HAHA')
- Instantiate the driver object with the configuration object
driver = webdriver.Chrome('./chromedriver', chrome_options=options)
- Instantiate configuration object
-
The reference code is as follows:
from selenium import webdriver options = webdriver.ChromeOptions() # 创建一个配置对象 options.add_argument('--user-agent=Mozilla/5.0 HAHA') # 替换User-Agent driver = webdriver.Chrome('./chromedriver', chrome_options=options) driver.get('http://www.itcast.cn') print(driver.title) driver.quit() ```、