Python crawler selection 11 episodes (selenium advanced summary [selenium cookies processing, ip proxy, useragent replacement])

1. Introduction of selenium

1. selenium running effect display

Selenium is a Web automated testing tool. It was originally developed for automated website testing. Selenium can directly call the browser. It supports all mainstream browsers (including PhantomJS and other interfaceless browsers). It can receive instructions and allow browsing The browser automatically loads the page, obtains the required data, and even takes a screenshot of the page. We can use selenium to easily complete the crawler we wrote before. Next, let’s take a look at the effect of selenium.

1.1 The running effect of chrome browser

After downloading chromedriver and installing the selenium module, execute the following code and observe the running process

from selenium import webdriver 

# 如果driver没有添加到了环境变量,则需要将driver的绝对路径赋值给executable_path参数
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')

# 如果driver添加了环境变量则不需要设置executable_path
driver = webdriver.Chrome()

# 向一个url发起请求
driver.get("http://www.itcast.cn/")

# 把网页保存为图片,69版本以上的谷歌浏览器将无法使用截图功能
# driver.save_screenshot("itcast.png")

print(driver.title) # 打印页面的标题

# 退出模拟浏览器
driver.quit() # 一定要退出!不退出会有残留进程!

1.2 The running effect of phantomjs interfaceless browser

PhantomJS is a Webkit-based "headless" browser that loads the website into memory and executes JavaScript on the page. Download link: http://phantomjs.org/download.html

from selenium import webdriver 

# 指定driver的绝对路径
driver = webdriver.PhantomJS(executable_path='/home/worker/Desktop/driver/phantomjs') 
# driver = webdriver.Chrome(executable_path='/home/worker/Desktop/driver/chromedriver')

# 向一个url发起请求
driver.get("http://www.itcast.cn/")

# 把网页保存为图片
driver.save_screenshot("itcast.png")

# 退出模拟浏览器
driver.quit() # 一定要退出!不退出会有残留进程!

1.3 Observe the operation effect

  • Python code can automatically call Google browse or phantomjs interfaceless browser to control its automatic access to the website

1.4 Usage scenarios of headless browsers and headless browsers

  • Usually in the development process, we need to check various conditions during the running process, so we usually use a headed browser
  • When the project is completed for deployment, usually the system used by the platform is the server version of the operating system, and the server version of the operating system must use a headless browser to run normally

2. The role and working principle of selenium

Utilize the browser's native API to encapsulate a more object-oriented Selenium WebDriver API, directly manipulate the elements in the browser page, and even manipulate the browser itself (screenshot, window size, startup, shutdown, plug-in installation, certificate configuration, etc.) )
Insert picture description here

  • Webdriver is essentially a web-server, which provides webapi externally, which encapsulates various functions of the browser
  • Different browsers use different webdrivers

3. Installation and simple use of selenium

Let's take Google Chrome's chromedriver as an example

3.1 Install the selenium module in the python virtual environment

pip/pip3 install selenium

3.2 Download the webdriver that conforms to the version

Take chrome Google browser as an example

  1. Check the version of Google Chrome

Insert picture description here
Insert picture description here

  1. Visit https://npm.taobao.org/mirrors/chromedriver, click to enter the download page of different versions of chromedriver

Insert picture description here

  1. Click notes.txt to enter the version description page
    Insert picture description here

  2. View the matching version of chrome and chromedriver

Insert picture description here

  1. Download the correct version of chromedriver according to the operating system

Insert picture description here

  1. After decompressing the compressed package, obtain the Google Chrome webdriver executable file that the python code can call

    • windows ischromedriver.exe

    • linux and macos arechromedriver

  2. chromedriver environment configuration

    • In the windows environment, you need to set the directory where chromedriver.exe is located to the path in the path environment variable
    • In the linux/mac environment, set the directory where chromedriver is located to the PATH environment value of the system

4. Simple use of selenium

Next, we will simulate Baidu search through code

import time
from selenium import webdriver

# 通过指定chromedriver的路径来实例化driver对象,chromedriver放在当前目录。
# driver = webdriver.Chrome(executable_path='./chromedriver')
# chromedriver已经添加环境变量
driver = webdriver.Chrome()

# 控制浏览器访问url地址
driver.get("https://www.baidu.com/")

# 在百度搜索框中搜索'python'
driver.find_element_by_id('kw').send_keys('python')
# 点击'百度搜索'
driver.find_element_by_id('su').click()

time.sleep(6)
# 退出浏览器
driver.quit()
  • webdriver.Chrome(executable_path='./chromedriver')The executable parameter specifies the path of the downloaded chromedriver file
  • driver.find_element_by_id('kw').send_keys('python')Locate the tag whose id attribute value is'kw' and enter the string'python' into it
  • driver.find_element_by_id('su').click()Locate the label whose id attribute value is su, and click
    • The role of the click function is: the click event of the js that triggers the label

Two. Selenium extracts data

  • Knowledge points:

    • Understand the common attributes and methods of the driver object
    • Master the method of locating label elements of driver object and obtaining label objects
    • Master the method of extracting text and attribute values ​​from label objects

1. Common attributes and methods of the driver object

In the process of using selenium, after instantiating the driver object, the driver object has some commonly used attributes and methods

  1. driver.page_source The source code of the web page rendered by the current tab browser
  2. driver.current_url The url of the current tab
  3. driver.close() Close the current tab, if there is only one tab, close the entire browser
  4. driver.quit() Close the browser
  5. driver.forward() Page forward
  6. driver.back() Page back
  7. driver.screen_shot(img_name) Screenshot of the page

2. The method of locating the label element of the driver object to obtain the label object

There are many ways to locate the label in selenium and return the label element object

find_element_by_id 						(返回一个元素)
find_element(s)_by_class_name 			(根据类名获取元素列表)
find_element(s)_by_name 				(根据标签的name属性值返回包含标签对象元素的列表)
find_element(s)_by_xpath 				(返回一个包含元素的列表)
find_element(s)_by_link_text 			(根据连接文本获取元素列表)
find_element(s)_by_partial_link_text 	(根据链接包含的文本获取元素列表)
find_element(s)_by_tag_name 			(根据标签名获取元素列表)
find_element(s)_by_css_selector 		(根据css选择器来获取元素列表)
  • note:
    • The difference between find_element and find_elements:
      • If there is more s, return the list, if there is no s, return the first label object that matches
      • An exception will be thrown if find_element fails to match, and an empty list will be returned if find_elements fails to match.
    • The difference between by_link_text and by_partial_link_tex: all text and containing a certain text
    • How to use the above functions
      • driver.find_element_by_id('id_str')

3. The label object extracts the text content and attribute values

find_element can only get the element, not the data in it directly. If you need to get the data, you need to use the following methods

  • Perform click operations on elementselement.click()

    • Click on the positioned label object
  • Enter data into the input boxelement.send_keys(data)

    • Enter data for the located label object
  • Get textelement.text

    • textObtain the text content by locating the properties of the label object obtained
  • Get attribute valueelement.get_attribute("属性名")

    • get_attributeGet the value of the attribute by locating the function of the obtained label object and passing in the attribute name

Three. Other ways to use selenium

  • Knowledge points:

    • Master selenium to control the switching of tabs
    • Master selenium to control iframe switching
    • Master the method of obtaining cookies using selenium
    • Master manual page waiting
    • Master the method of selenium to control the browser to execute js code
    • Master selenium to open the interfaceless mode
    • Understand selenium use proxy ip
    • Understand selenium replace user-agent

1. Switching selenium tabs

When selenium controls the browser to open multiple tabs, how to control the browser to switch between different tabs? We need to do the following two steps:

  • Get the window handles of all tabs

  • Use the window handle word to switch to the tab page pointed to by the handle

  • Concrete method

    # 1. 获取当前所有的标签页的句柄构成的列表
    current_windows = driver.window_handles
    
    # 2. 根据标签页句柄列表索引下标进行切换
    driver.switch_to.window(current_windows[0])
    
  • Reference code example:

    import time
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    driver.get("https://www.baidu.com/")
    
    time.sleep(1)
    driver.find_element_by_id('kw').send_keys('python')
    time.sleep(1)
    driver.find_element_by_id('su').click()
    time.sleep(1)
    
    # 通过执行js来新开一个标签页
    js = 'window.open("https://www.sogou.com");'
    driver.execute_script(js)
    time.sleep(1)
    
    # 1. 获取当前所有的窗口
    windows = driver.window_handles
    
    time.sleep(2)
    # 2. 根据窗口索引进行切换
    driver.switch_to.window(windows[0])
    time.sleep(2)
    driver.switch_to.window(windows[1])
    
    time.sleep(6)
    driver.quit()
    

2. switch_to switch frame label

Iframe is a commonly used technology in html, that is, a page is nested in another web page. Selenium cannot access the content in the frame by default. The corresponding solution is driver.switch_to.frame(frame_element). Next, we will learn this knowledge point through qq mailbox simulation login

  • Reference Code:

    import time
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    
    url = 'https://mail.qq.com/cgi-bin/loginpage'
    driver.get(url)
    time.sleep(2)
    
    login_frame = driver.find_element_by_id('login_frame') # 根据id定位 frame元素
    driver.switch_to.frame(login_frame) # 转向到该frame中
    
    driver.find_element_by_xpath('//*[@id="u"]').send_keys('您的qq邮箱')
    time.sleep(2)
    
    driver.find_element_by_xpath('//*[@id="p"]').send_keys('邮箱密码')
    time.sleep(2)
    
    driver.find_element_by_xpath('//*[@id="login_button"]').click()
    time.sleep(2)
    
    """操作frame外边的元素需要切换出去"""
    windows = driver.window_handles
    driver.switch_to.window(windows[0])
    
    content = driver.find_element_by_class_name('login_pictures_title').text
    print(content)
    
    driver.quit()
    
  • to sum up:

    • Switch to the page where the positioned frame tag is nested

      • driver.switch_to.frame(通过find_element_by函数定位的frame、iframe标签对象)
    • Cut out the frame tag by switching tabs

      • windows = driver.window_handles
        driver.switch_to.window(windows[0])
        

3. Selenium's handling of cookies

Selenium can help us deal with cookies on the page, such as getting and deleting, then we will learn this part of knowledge

3.1 Get cookies

driver.get_cookies()Return to the list, which contains the complete cookie information! Not only name, value, but also other dimensions of cookie information such as domain. So if you want to use the obtained cookie information with the requests module, you need to convert it to a cookie dictionary with name and value as key-value pairs.

# 获取当前标签页的全部cookie信息
print(driver.get_cookies())
# 把cookie转化为字典
cookies_dict = {cookie[‘name’]: cookie[‘value’] for cookie in driver.get_cookies()}

3.2 Delete cookies

#删除一条cookie
driver.delete_cookie("CookieName")

# 删除所有的cookie
driver.delete_all_cookies()

4. Selenium controls the browser to execute js code

Selenium can let the browser execute the js code we specified, run the following code to see the effect

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.itcast.cn/")
time.sleep(1)

js = 'window.scrollTo(0,document.body.scrollHeight)' # js语句
driver.execute_script(js) # 执行js的方法

time.sleep(5)
driver.quit()
  • The method of executing js:driver.execute_script(js)

5. Page waiting

During the loading process of the page, it takes time to wait for the response from the web server. In this process, the label element may not be loaded yet and is invisible. How to deal with this situation?

  1. Page waiting for classification
  2. Mandatory wait for introduction
  3. Explicit wait introduction
  4. Implicit wait introduction
  5. Manually implement page waiting

5.1 Classification of page waiting

First of all, let's understand the classification of the following selenium page waiting

  1. Forced wait
  2. Implicit wait
  3. Explicit wait

5.2 Forced waiting (understand)

  • In fact, it is time.sleep()
  • Disadvantages are not smart at times, the setting time is too short, and the element has not been loaded; the setting time is too long, it will waste time

5.3 Implicit wait

  • Implicit waiting is for element positioning. Implicit waiting sets a time to determine whether the element is positioned successfully within a period of time. If it is completed, proceed to the next step

  • If the positioning is not successful within the set time, it will report timeout loading

  • Sample code

    from selenium import webdriver
    
    driver = webdriver.Chrome()  
    
    driver.implicitly_wait(10) # 隐式等待,最长等20秒  
    
    driver.get('https://www.baidu.com')
    
    driver.find_element_by_xpath()
    
    

5.4 Explicit wait (understand)

  • Check whether the waiting condition is reached every few seconds, if it is reached, stop waiting and continue to execute the subsequent code

  • If it is not reached, continue to wait until the specified time is exceeded, and report a timeout exception

  • Sample code

    from selenium import webdriver  
    from selenium.webdriver.support.wait import WebDriverWait  
    from selenium.webdriver.support import expected_conditions as EC  
    from selenium.webdriver.common.by import By 
    
    driver = webdriver.Chrome()
    
    driver.get('https://www.baidu.com')
    
    # 显式等待
    WebDriverWait(driver, 20, 0.5).until(
        EC.presence_of_element_located((By.LINK_TEXT, '好123')))  
    # 参数20表示最长等待20秒
    # 参数0.5表示0.5秒检查一次规定的标签是否存在
    # EC.presence_of_element_located((By.LINK_TEXT, '好123')) 表示通过链接文本内容定位标签
    # 每0.5秒一次检查,通过链接文本内容定位标签是否存在,如果存在就向下继续执行;如果不存在,直到20秒上限就抛出异常
    
    print(driver.find_element_by_link_text('好123').get_attribute('href'))
    driver.quit() 
    

5.5 Manually realize page waiting

After understanding the implicit and explicit waiting and forced waiting, we found that there is no universal method to solve the problem of page waiting, such as "the page needs to slide to trigger ajax asynchronous loading" scenario, then we will Take Taobao homepage as an example, manually realize page waiting

  • principle:
    • Manually implement the idea of ​​forced waiting and explicit waiting
    • Constant judgment or limited number of times to judge whether a certain label object has been loaded (whether it exists)
  • The implementation code is as follows:
import time
from selenium import webdriver
driver = webdriver.Chrome('/home/worker/Desktop/driver/chromedriver')

driver.get('https://www.taobao.com/')
time.sleep(1)

# i = 0
# while True:
for i in range(10):
    i += 1
    try:
        time.sleep(3)
        element = driver.find_element_by_xpath('//div[@class="shop-inner"]/h3[1]/a')
        print(element.get_attribute('href'))
        break
    except:
        js = 'window.scrollTo(0, {})'.format(i*500) # js语句
        driver.execute_script(js) # 执行js的方法
driver.quit()

6. selenium turns on no interface mode

Most servers do not have an interface. Selenium also has an interfaceless mode for controlling Google Chrome. In this section, we will learn how to turn on the interfaceless mode (also called the headless mode).

  • How to turn on the interfaceless mode
    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Configure the object to add a command to open the interfaceless mode
      • options.add_argument("--headless")
    • Configure the object to add a command to disable gpu
      • options.add_argument("--disable-gpu")
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome(chrome_options=options)
  • Note: The chrome browser version 59+ in macos, version 57+ in Linux can only use the interfaceless mode!
  • The reference code is as follows:
from selenium import webdriver

options = webdriver.ChromeOptions() # 创建一个配置对象
options.add_argument("--headless") # 开启无界面模式
options.add_argument("--disable-gpu") # 禁用gpu

# options.set_headles() # 无界面模式的另外一种开启方式
driver = webdriver.Chrome(chrome_options=options) # 实例化带有配置的driver对象

driver.get('http://www.itcast.cn')
print(driver.title)
driver.quit()

7. selenium uses proxy ip

Selenium control browser can also use proxy ip!

  • Method of using proxy ip

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Configure the object to add the command to use the proxy ip
      • options.add_argument('--proxy-server=http://202.20.16.82:9527')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • The reference code is as follows:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() # 创建一个配置对象
    options.add_argument('--proxy-server=http://202.20.16.82:9527') # 使用代理ip
    
    driver = webdriver.Chrome(chrome_options=options) # 实例化带有配置的driver对象
    
    driver.get('http://www.itcast.cn')
    print(driver.title)
    driver.quit()
    

8. selenium replace user-agent

When selenium controls Google Chrome, User-Agent defaults to Google Chrome. In this section, we will learn to use different User-Agents.

  • Method of replacing user-agent

    • Instantiate configuration object
      • options = webdriver.ChromeOptions()
    • Configure the object to add and replace the UA command
      • options.add_argument('--user-agent=Mozilla/5.0 HAHA')
    • Instantiate the driver object with the configuration object
      • driver = webdriver.Chrome('./chromedriver', chrome_options=options)
  • The reference code is as follows:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() # 创建一个配置对象
    options.add_argument('--user-agent=Mozilla/5.0 HAHA') # 替换User-Agent
    
    driver = webdriver.Chrome('./chromedriver', chrome_options=options)
    
    driver.get('http://www.itcast.cn')
    print(driver.title)
    driver.quit()
    ```、
    

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/108302512