python selenium chromedriver realizes selenium operation chrome browser to grab web page data content and automatically fill in the form (normally running code)

Summary

Last updated: 2020.08.20 (to be updated experimental part)
This article type: practical application class (non-knowledge to explain)
This article describes selenium libraries and chrome browser automatically crawls the page elements, fill out the form and positioning data, automatic fill, Save a lot of manpower. In order to facilitate the use of the selenium library and facilitate the handling of errors in operation, this article will repackage the selenium library to a certain extent so that readers can quickly get started programming after understanding the selenium library.
1. Knowledge points of this article: 1. Install the selenium library, 2. The way the selenium library finds elements, 3. The repackage of the selenium library
2. The structure of this article: 1. Briefly introduce the knowledge points first, 2. Post it in a complete paragraph The debugging code that can be run directly after copying is convenient for readers to debug each piece of code posted.
3. The method of this article is to use Baidu homepage as the control webpage, Google Chrome as the experimental platform, and python+selenium for webpage operation. ps: Other corresponding crawling web experiments will be updated.
4. Experiments in this article: 1. Grab the A-share market data in the background (updated on September 09, 2020), 2. Read the latest emails in the background QQ mailbox (to be updated), and wait for other experiments to be updated (all provide detailed code and comments, this article There is a corresponding link at the end)

Tip: The following is the content of this article, the following cases are for reference


One, install the selenium library and related files

The Selenium library is an automated tool used by python for crawling websites. Supported browsers include mainstream interface browsers such as Chrome, Firefox, and Safari. It also supports multiple operating systems such as Windows, Linux, IOS, Android, etc.

1. Install the selenium library

(1) Click the start menu of win10, directly enter cmd and run in administrator mode by right-clicking
(2) If the option to add path is checked when installing python, you can directly enter the command (if not added, it is recommended to uninstall python and reinstall it And check the add path. Choose this suggestion carefully, because you need to reinstall the previously downloaded library) (3) Wait for the completion of the network connection, if it prompts timeout, you can try the following command to switch the download source and reinstall:
pip install selenium

pip install selenium -i https://pypi.douban.com/simple

2. Download Google Chrome related files

The browser used in this article is Google Chrome, so only the crawling method of Google Chrome is introduced. The methods of other browsers are similar.

(1) Click to download Google Chrome
(2) After installing Google Chrome, check the version number of Google Chrome. Click on the three dots in the upper right corner of Google Chrome and select – 帮助关于Google Chrome查看版本号.
As shown in the figure, the version number of this article is 84.0.4147.125
Insert picture description here
(3) Click to download the Google Chrome driver.
Open the download driver page and find the driver version corresponding to the version number. In this article 84.0.4147.125, look for 84.0the driver at the beginning of the figure , and click Open to download the driver for the corresponding system . Then unzip it to the folder where you are going to write the project.
Do not place it in the python directory or the browser installation directory. If you do this, when transplanting to another computer, various BUGs will appear. The root cause is that your computer has installed the corresponding library files and drivers, but the transplant The computer may not be installed.

Insert picture description here
Insert picture description here

Two, selenium quick start

1. Eight ways to position elements

(1) id
(2) name
(3) xpath
(4) link text
(5) partial link text
(6) tag name
(7) class name
(8) css selector

2. id method

(1) By idlocating an element in Selenium : find_element_by_id
Take Baidu's page as an example, the source code of the page of Baidu's input box is as follows:

<html>
  <head>
  <body link="#0000cc">
    <a id="result_logo" href="/" onmousedown="return c({
     
     'fm':'tab','tab':'logo'})">
    <form id="form" class="fm" name="f" action="/s">
      <span class="soutu-btn"></span>
        <input id="kw" class="s_ipt" name="wd" value="" maxlength="255" 
        .......
        <input type="submit" value="百度一下" id="su" class="btn self-btn bg s_btn">

Among them, <input id="kw" class="s_ipt" name="wd" value="" maxlength="255">is the webpage code of the input box, and is the webpage code <input type="submit" value="百度一下" id="su" class="btn self-btn bg s_btn>"of Baidu
. The code and comments for finding elements by id are as follows:

import os
import sys
import time
from selenium import webdriver
##此方法获取的工作文件夹路径在py文件下或封装exe后运行都有效##
当前工作主路径 = os.path.dirname(os.path.realpath(sys.argv[0]))
##配置谷歌浏览器驱动路径##
谷歌驱动器驱动 = 当前工作主路径+"/"+"chromedriver.exe"
##初始化selenium控件##
浏览器驱动 = webdriver.Chrome(executable_path=谷歌驱动器驱动)
##打开链接,更换其他网址,请注意开头要带上http://##
浏览器驱动.get("http://www.baidu.com")
##通过id搜索清除搜索框内容##
浏览器驱动.find_element_by_id("kw").clear()
##通过id搜索输入搜索框内容##
浏览器驱动.find_element_by_id("kw").send_keys("python+selenium库 实现爬虫抓取网页数据内容并自动填表的解决方法并附已交付甲方实际稳定运行的代码")
##通过id搜索点击百度一下进行搜索##
浏览器驱动.find_element_by_id("su").click()
##保持5s时间
time.sleep(5)
###关闭退出浏览器
浏览器驱动.quit() 

3. name method

Still taking the Baidu input box as an example, in selenium by namepositioning an element: find_element_by_name
its code and comments are as follows:

import os
import sys
import time
from selenium import webdriver
##此方法获取的工作文件夹路径在py文件下或封装exe后运行都有效##
当前工作主路径 = os.path.dirname(os.path.realpath(sys.argv[0]))
##配置谷歌浏览器驱动路径##
谷歌驱动器驱动 = 当前工作主路径+"/"+"chromedriver.exe"
##初始化selenium控件##
浏览器驱动 = webdriver.Chrome(executable_path=谷歌驱动器驱动)
##打开链接,更换其他网址,请注意开头要带上http://##
浏览器驱动.get("http://www.baidu.com")
##通过name搜索清除搜索框内容##
浏览器驱动.find_element_by_name("wd").clear()
##通过name搜索输入搜索框内容##
浏览器驱动.find_element_by_name("wd").send_keys("python+selenium库 实现爬虫抓取网页数据内容并自动填表的解决方法并附已交付甲方实际稳定运行的代码")
##通过id搜索点击百度一下进行搜索##
浏览器驱动.find_element_by_id("su").click()
##保持5s时间
time.sleep(5)
###关闭退出浏览器
浏览器驱动.quit()   

4. xpath way

Baidu is still the input box, for example, by selenium xpathlocate an element: find_element_by_xpath. However, this method is not suitable for table elements whose positions in the web page will change, because the xpath method points to fixed rows and columns, and cannot detect changes in the content of rows and columns.
First, open Baidu in the Google browser, 右键鼠标and select 检查(N)this option in the blank space to enter the developer mode of the web page, as shown in the figure. Then on the Baidu input box 右键鼠标, click 检查(N)Select again , you can find that the automatically selected part of the code box on the right has become the code of the Baidu input box. Finally, on the right side automatically selected code segment 右键- selection copy- selection copy Xpath, Baidu input box xpathis //*[@id="kw"], of Baidu, xpathis //*[@id="su"]
followed from this value Baidu search code and the Notes below.
Insert picture description here
Insert picture description here

import os
import sys
import time
from selenium import webdriver
##此方法获取的工作文件夹路径在py文件下或封装exe后运行都有效##
当前工作主路径 = os.path.dirname(os.path.realpath(sys.argv[0]))
##配置谷歌浏览器驱动路径##
谷歌驱动器驱动 = 当前工作主路径+"/"+"chromedriver.exe"
##初始化selenium控件##
浏览器驱动 = webdriver.Chrome(executable_path=谷歌驱动器驱动)
##打开链接,更换其他网址,请注意开头要带上http://##
浏览器驱动.get("http://www.baidu.com")
##通过xpath搜索清除搜索框内容,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_xpath('//*[@id="kw"]').clear()
##通过xpath搜索输入搜索框内容,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_xpath('//*[@id="kw"]').send_keys("python+selenium库 实现爬虫抓取网页数据内容并自动填表的解决方法并附已交付甲方实际稳定运行的代码")
##通过xpath搜索点击百度一下进行搜索,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_xpath('//*[@id="su"]').click()
##保持5s时间
time.sleep(5)
###关闭退出浏览器
浏览器驱动.quit() 

5. Link text and partial link text methods

Still taking the news element on the Baidu homepage as an example,
by link textlocating an element: find_element_by_link_text
by partial link textlocating an element: find_element_by_partial_link_text
the difference between the two is: the link textmethod is the full name of the element’s text attribute, and the partial link textmethod is the part of the element’s text attribute, similar to fuzzy search. There are many similar text attributes, so it is recommended not to use the partial link textmethod.
The page code of the news section of Baidu homepage is as follows:

 <div id="u1">
   <a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a>
   <a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
   <a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a>
   <a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a>

Use link textthe following method to click into the news page, the code and comments are as follows:

import os
import sys
import time
from selenium import webdriver
##此方法获取的工作文件夹路径在py文件下或封装exe后运行都有效##
当前工作主路径 = os.path.dirname(os.path.realpath(sys.argv[0]))
##配置谷歌浏览器驱动路径##
谷歌驱动器驱动 = 当前工作主路径+"/"+"chromedriver.exe"
##初始化selenium控件##
浏览器驱动 = webdriver.Chrome(executable_path=谷歌驱动器驱动)
##打开链接,更换其他网址,请注意开头要带上http://##
浏览器驱动.get("http://www.baidu.com")
##通过link text搜索清除搜索框内容##
浏览器驱动.find_element_by_link_text('新闻').click()
##通过partial link text搜索清除搜索框内容,请注意实际运行时,选择其中一个##
##浏览器驱动.find_element_by_partial_link_text('新').click()
##保持5s时间
time.sleep(5)
###关闭退出浏览器
浏览器驱动.quit() 

6. Tag name and class name method

These two methods is not recommended, because in one glance tag nameand class namerepeated too high, is not easy to locate, because the paper does not do in-depth research.

7. css selector method

Baidu is still the input box, for example, by selenium css selectorlocate an element: find_element_by_css_selector. From the above, it can be seen that <input id="kw" class="s_ipt" name="wd" value="" maxlength="255">the webpage code of the input box is the webpage code <input type="submit" value="百度一下" id="su" class="btn self-btn bg s_btn>"of Baidu, then the code and comments for element search in the css selector method are as follows:

import os
import sys
import time
from selenium import webdriver
##此方法获取的工作文件夹路径在py文件下或封装exe后运行都有效##
当前工作主路径 = os.path.dirname(os.path.realpath(sys.argv[0]))
##配置谷歌浏览器驱动路径##
谷歌驱动器驱动 = 当前工作主路径+"/"+"chromedriver.exe"
##初始化selenium控件##
浏览器驱动 = webdriver.Chrome(executable_path=谷歌驱动器驱动)
##打开链接,更换其他网址,请注意开头要带上http://##
浏览器驱动.get("http://www.baidu.com")
##通过xpath搜索清除搜索框内容,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_css_selector("input[id=\"kw\"]").clear()
##通过xpath搜索输入搜索框内容,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_css_selector("input[id=\"kw\"]").send_keys("python+selenium库 实现爬虫抓取网页数据内容并自动填表的解决方法并附已交付甲方实际稳定运行的代码")
##通过xpath搜索点击百度一下进行搜索,注意单引号与双引号的混合使用##
浏览器驱动.find_element_by_css_selector("input[type=\"submit\"]").click()
##保持5s时间
time.sleep(5)
###关闭退出浏览器
浏览器驱动.quit() 

Three, my simple package

Most of the elements found in the website belong to typed values ​​or click-type elements. For the convenience of use, the selenium library is further encapsulated. Part of the code is as follows (will be updated, and the specific usage method can be seen in the project practice code and comments):
(1) Type Value class

def 网站元素值更改(self,元素名称,更改数值,方法):
        元素出现 = False
        while(元素出现 == False):
            try:
                if 方法 == "xpath":
                    self.浏览器驱动.find_element_by_xpath(元素名称).clear()
                    self.浏览器驱动.find_element_by_xpath(元素名称).send_keys(更改数值)
                if 方法 == "name":
                    self.浏览器驱动.find_element_by_name(元素名称).clear()
                    self.浏览器驱动.find_element_by_name(元素名称).send_keys(更改数值)
                元素出现 = True
            except:
                元素出现 = False
            time.sleep(0.1)

(2) Click class

def Y网站元素值点击(self,元素名称,方法):
        元素出现 = False
        循环次数 = 0
        while(元素出现 == False):
            try:
                if 方法 == "xpath":
                    self.浏览器驱动.find_element_by_xpath(元素名称).click()
                if 方法 == "text":
                    self.浏览器驱动.find_element_by_link_text(元素名称).click()
                元素出现 = True
            except: 
                元素出现 = False
            循环次数 = 循环次数+1
            time.sleep(0.1)

Four, experiment

1. Realize the background capture of A-share market data portal (updated)

2. Automatically read the latest mail portal of QQ mailbox (to be updated)

Guess you like

Origin blog.csdn.net/baidu_37611158/article/details/108083986