I learned from another blogger who wrote crawler
Web of Science crawler combat (simulating browser)
before only writing static page analysis, simply constructing url kind of crawler.
From this actual combat, I came into contact with the following knowledge:
- xpath
- selenium WebDriver
- etree
Only the places used in this article are introduced here
xpath
Element lookup method, using this method, you can locate almost any element on the page. XPath is the abbreviation of XML Path. Since the HTML document itself is a standard XML page, we can use the XPath syntax to locate page elements.
References on xpath
titleList = tree.xpath("//a[@class='smallV110']/value/text()") # 文献标题
'''
// 表示绝对路径,全文查找这个标签 通过@定位
'''
webdriver
It belongs to a set of APIs designed to operate browsers in the selenium system . webdriver is a third-party library of python for web automation. All automation does is simulate mouse and keyboard actions to manipulate these elements, click, type, mouseover, etc.
driver = webdriver.Chrome()
url = 'http://apps.webofknowledge.com/UA_ClearGeneralSearch.do?action=clear&product=UA&search_mode=GeneralSearch&SID=5DWLAqTxJHRNqCCxms5'
driver.get(url)
driver.find_element_by_id("clearIcon1").click() # 点击清除输入框内原有缓存地址
driver.find_element_by_id("value(input1)").send_keys(keyword) # 模拟在输入框输入keyword
driver.find_element_by_xpath("//span[@class='searchButton']/button").click() # 模拟点击检索按钮
newurl = driver.current_url # 新页面
driver.close()
etree
It is equivalent to a tree structure of storage nodes. When you first contacted it, it was indistinguishable from the BeautifulSoup library. Of course, it was a tree structure. In fact, it has different focuses. etree can be simply understood as a simple data structure. BeautifulSoup is a A complete system for analyzing pages
Paste the code first:
import requests
from lxml import etree
from selenium import webdriver
def geturl(keyword):
driver = webdriver.Chrome()
url = 'http://apps.webofknowledge.com/UA_ClearGeneralSearch.do?action=clear&product=UA&search_mode=GeneralSearch&SID=5DWLAqTxJHRNqCCxms5'
driver.get(url)
driver.find_element_by_id("clearIcon1").click() # 点击清除输入框内原有缓存地址
driver.find_element_by_id("value(input1)").send_keys(keyword) # 模拟在输入框输入keyword
driver.find_element_by_xpath("//span[@class='searchButton']/button").click() # 模拟点击检索按钮
newurl = driver.current_url # 新页面
driver.close()
return newurl # 返回新页面
def getHTMLText(url):
try:
kv = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=kv, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(html):
try:
tree = etree.HTML(html)
titleList = tree.xpath("//a[@class='smallV110']/value/text()") # 文献标题
print(titleList)
except:
return ""
def main():
keyword = "big data" #要输入的关键字
url = geturl(keyword) #获取url
print(url)
html=getHTMLText(url)
parsePage(html)
main()
A few pits:
- To download the corresponding chrome version of chromedriver.exe, to configure the environment variables, you can also directly put chromedriver.exe in the project directory
- It seems that WOS has been maintained once, and a lot of codes have been changed. Some of the blogger's code can no longer be used.
- driver.find_elements_by_id() and driver.find_element_by_id() are two functions and cannot be mixed
- Several common errors reported by selenium webdriver, consider sorting them out, to distinguish whether the node cannot be found or the node wrong (found)
- className does not allow the use of compound class names as parameters.
The above only crawled the data of a titleList , others can also be obtained in the same way, but this method is too slow, try the post method
selenium automation script error summary
Selenium error prompt