Web of Science crawler [simulated browser]

I learned from another blogger who wrote crawler
Web of Science crawler combat (simulating browser)
before only writing static page analysis, simply constructing url kind of crawler.
From this actual combat, I came into contact with the following knowledge:

  1. xpath
  2. selenium WebDriver
  3. etree

Only the places used in this article are introduced here

xpath

Element lookup method, using this method, you can locate almost any element on the page. XPath is the abbreviation of XML Path. Since the HTML document itself is a standard XML page, we can use the XPath syntax to locate page elements.
References on xpath

titleList = tree.xpath("//a[@class='smallV110']/value/text()")  # 文献标题
'''
// 表示绝对路径,全文查找这个标签 通过@定位
'''

webdriver

It belongs to a set of APIs designed to operate browsers in the selenium system . webdriver is a third-party library of python for web automation. All automation does is simulate mouse and keyboard actions to manipulate these elements, click, type, mouseover, etc.

    driver = webdriver.Chrome()
    url = 'http://apps.webofknowledge.com/UA_ClearGeneralSearch.do?action=clear&product=UA&search_mode=GeneralSearch&SID=5DWLAqTxJHRNqCCxms5'
    driver.get(url)
    driver.find_element_by_id("clearIcon1").click()  # 点击清除输入框内原有缓存地址
    driver.find_element_by_id("value(input1)").send_keys(keyword)  # 模拟在输入框输入keyword
    driver.find_element_by_xpath("//span[@class='searchButton']/button").click()  # 模拟点击检索按钮
    newurl = driver.current_url  # 新页面
    driver.close()

etree

It is equivalent to a tree structure of storage nodes. When you first contacted it, it was indistinguishable from the BeautifulSoup library. Of course, it was a tree structure. In fact, it has different focuses. etree can be simply understood as a simple data structure. BeautifulSoup is a A complete system for analyzing pages

Paste the code first:

import requests
from lxml import etree
from selenium import webdriver

def geturl(keyword):
    driver = webdriver.Chrome()
    url = 'http://apps.webofknowledge.com/UA_ClearGeneralSearch.do?action=clear&product=UA&search_mode=GeneralSearch&SID=5DWLAqTxJHRNqCCxms5'
    driver.get(url)
    driver.find_element_by_id("clearIcon1").click()  # 点击清除输入框内原有缓存地址
    driver.find_element_by_id("value(input1)").send_keys(keyword)  # 模拟在输入框输入keyword
    driver.find_element_by_xpath("//span[@class='searchButton']/button").click()  # 模拟点击检索按钮
    newurl = driver.current_url  # 新页面
    driver.close()
    return newurl  # 返回新页面

def getHTMLText(url):
    try:
        kv = {'user-agent': 'Mozilla/5.0'}
        r = requests.get(url, headers=kv, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def parsePage(html):
    try:
        tree = etree.HTML(html)
        titleList = tree.xpath("//a[@class='smallV110']/value/text()")  # 文献标题
        print(titleList)
    except:
        return ""


def main():
    keyword = "big data" #要输入的关键字
    url = geturl(keyword) #获取url
    print(url)
    html=getHTMLText(url)
    parsePage(html)

main()

A few pits:

  1. To download the corresponding chrome version of chromedriver.exe, to configure the environment variables, you can also directly put chromedriver.exe in the project directory
  2. It seems that WOS has been maintained once, and a lot of codes have been changed. Some of the blogger's code can no longer be used.
  3. driver.find_elements_by_id() and driver.find_element_by_id() are two functions and cannot be mixed
  4. Several common errors reported by selenium webdriver, consider sorting them out, to distinguish whether the node cannot be found or the node wrong (found)
  5. className does not allow the use of compound class names as parameters.

The above only crawled the data of a titleList , others can also be obtained in the same way, but this method is too slow, try the post method

selenium automation script error summary
Selenium error prompt

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325173415&siteId=291194637