Crawler: Ajax Request Processing Technology Summary (Part 2) - Simulating Browser Behavior

Each sentence:

Never lie to someone who trust you. Never trust someone who lies to you.


Foreword:

In the last article, we introduced a way to crawl dynamic web pages: reverse engineering .

The disadvantage of this method is that this method requires us to have a certain understanding of JavaScript and Ajax, and when the JS code of the web page is messy and difficult to analyze, the above process will take us a lot of time and energy.

At this time, if you don't have too many requirements on the efficiency of crawler execution, and don't want to waste too much time on understanding JavaScript code logic and finding Ajax request links, we can try the following ideas:

  • Simulate browser behavior by using the browser rendering engine to execute JavaScript code on the target web page and parse the HTML.

Introduction to Selenium:

Selenium is a tool for web application testing. Selenium tests run directly in the browser, just like a real user. Supports almost all major browsers on the market.

Originally intended to use a combination of selenium + PhantomJS, but found that Chrome and FireFox have also launched a headless browser mode, and I personally prefer Chrome. This article uses a combination of Selenium+Chrome.

Next, we will introduce the use of Selenium in combination with a specific webpage.


Example:

Or take Sina Reading_Book Digest as an example:

Since Selenium is relatively simple to use, look at the sample code directly, and the code contains enough comments:

# coding=utf8

import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import NoSuchElementException


# 解析列表页内容
def getItem():

    # 处理文章详情页面,具体方法根据具体情况添加

    global location
    # 每解析一篇文章location加一
    location += 1


def getList():
    global location
    global driver
    # 解析网页内容
    divs = driver.find_elements_by_class_name("item")

    # 构造动作链
    actions = ActionChains(driver)

    for i in range(location, len(divs)):

        div = divs[i]
        # 标题
        title = div.find_element_by_tag_name("a").text
        # URL
        url = div.find_element_by_tag_name("a")
        # url = div.find_element_by_tag_name('a').get_attribute("href")

        # 进入文章详情页
        actions.click(url)
        actions.perform()
        actions.reset_actions()

        # 将driver转移到当前页面,接下来处理文章详情页面
        driver.switch_to.window(driver.window_handles[1])
        # 在此处调用处理文章详情页的方法,获取所需要的字段
        getItem()

        driver.close()

        driver.switch_to.window(driver.window_handles[0])
        print driver.title


# 判断“更多书摘”按钮是否存在,存在的话,返回元素
def loadMore():
    global driver
    elem = None
    try:
        elem = driver.find_element_by_id("subShowContent1_loadMore")
    except NoSuchElementException:
        pass

    # 如果不存在或不可见,返回None
    if elem is None or not(elem.is_displayed()):
        return None
    else:
        return elem


# 每两次“更多书摘”后,会出现“下一页”按钮,判断"下一页"是否存在,存在的话,返回元素
def pagebox_next():
    global driver
    next_page = None
    try:
        next_page = driver.find_element_by_class_name("pagebox_next")
    except NoSuchElementException:
        pass

    # 如果不存在或不可见,返回None
    if next_page is None or not(next_page.is_displayed()):
        return None
    else:
        return next_page


# 判断是否还有新的内容
def haveNext():
    global location
    global driver
    more = loadMore()
    if more:  # 是否存在“更多书摘”按钮
        return more
    else:
        # 无论是否还有“下一页”,都将location置为0
        location = 0
        more = pagebox_next()
        return more


# 点击“更多书摘”后,之前内容并不会消失,所以设置一个变量记录当前位置
# 每次点击“下一页后,再次置为0
location = 0

# 指定driver的浏览器

# 创建Chrome的无头浏览器
# opt = webdriver.ChromeOptions()
# opt.set_headless()
# driver = webdriver.Chrome(options=opt)

# 创建可见的Chrome浏览器
driver = webdriver.Chrome()

# 设置浏览器的隐式等待时间
driver.implicitly_wait(30)

# get方法打开网页
driver.get("http://book.sina.com.cn/excerpt/")

# 分析文章列表
getList()

# 初始化动作链
actions_more = ActionChains(driver)

# 判断是否还有新的内容
more_page = haveNext()

while more_page:
    # 添加点击动作
    actions_more.click(more_page)
    actions_more.perform()
    actions_more.reset_actions()

    # 每次翻页或刷新页面时要等待页面加载完成,
    # 这里采用的是强制等待,效率较慢,不推荐这种做法;
    # 比较适合的是采用Selenium的显性等待方法:"WebDriverWait",配合该类的until()和until_not()方法,就能够根据判断条件而进行灵活地等待了。
    time.sleep(3)

    # 分析加载的新内容,从location开始
    getList()

    # 是否存在更多或下一页
    more_page = haveNext()

# 关闭浏览器
driver.close()
  • If you want to actually run the above code, make sure before running (win10):

    • Chrome browser (newer version) installed.

    • The Chrome browser driver (chromedriver.exe) is installed and correctly added to the environment variables path.

    • Selenium is installed.

Brief explanation:

The process of using Selenium is roughly as follows:

  • Create a connection to the browser
from selenium import webdriver

driver = webfriver.Chrome()

# 创建无头浏览器方式:
# opt = webdriver.ChromeOptions()
# opt.set_headless()
# driver = webdriver.Chrome(options=opt)
  • Set wait time:
driver.implicitly_wait(30)
- 此处是Selenium的隐式等待设置,我们设置了 30秒的延时,如果我们要查找的元素没有出现,Selenium 至多等待30秒,然后就会抛出异常。推荐配合显示等待一起使用
- 显示等待:"WebDriverWait",配合该类的until()和until_not()方法,就能够根据判断条件而进行灵活地等待了。
  • Call the get()method to load the web page:
driver.get("http://book.sina.com.cn/excerpt/")
  • Get the required elements from the web page:
driver.find_element_by_id("item")
  • Create an action chain:
actions = ActionChains(driver)
  • Add an action to the action chain:
actions.click(url)
  • Action chain execution:
actions.perform()
  • Call close()the method to close the browser:
driver.close()

The previous sample code includes the logical judgment of page turning and the switching between the list page and the article details page, so it may seem complicated, but in fact the execution process is roughly the same idea.

In addition, the sample code is not a complete crawler. After jumping to the article details page, no data acquisition is performed. If the reader needs, you can getItem()add a specific implementation to the function in the sample code, and then you can add functions such as data storage.


Analysis and comparison:

When using Selenium, pay special attention to:

  • When using Selenium, it is inevitable to refresh or switch pages, so pay attention to the response time of the web page. Selenium doesn't wait for the web page response to complete before continuing to execute the code, it executes directly. The two should be different processes. Here you can choose to set implicit wait and explicit wait.

    • Invisible waiting: A maximum waiting time is set. If the webpage is loaded within the specified time, the next step will be executed, otherwise, it will wait until the time expires, and then throw an exception. It should be noted that the implicit wait works on the entire driver cycle, so it only needs to be set once.

    • Explicit waiting: WebDriverWait, with the until() and until_not() methods of this class, can wait flexibly according to the judgment conditions. The main meaning is: the program takes a look every xx seconds, if the condition is true, execute the next step, otherwise continue to wait until the set maximum time is exceeded, and then throw a TimeoutException.

    • If both invisible waiting and explicit waiting are set at the same time, explicit waiting plays a major role in WebDriverWait, and in other operations, invisible waiting plays a decisive role. It should be noted that the longest waiting time depends on the time between the two. Big one.

Comparison with "reverse engineering":

  • The former (reverse engineering) is faster to run and has less overhead. In practice, most web pages can be reversed, but some pages are complex enough to be reversed with great effort.

  • The latter (simulating browser behavior) is more intuitive in thinking and easier to accept and understand. Browser rendering engines can save us time in understanding how the backend of a website works, but rendering a web page adds overhead, making it slower than just downloading HTML. Also, using the latter usually requires polling the web page to check that the desired HTML element has been obtained, which is very brittle and often fails on slow networks.

Which method to use depends on the specific situation in the crawler activity:

  • If it is easy to reverse, and requires high speed and resources, the former should be used;

  • The latter can be used if it is difficult to reverse and there is no requirement for engineering optimization.

Personally, I think the method of simulating a browser should be avoided as much as possible, because the browser environment consumes a lot of memory and CPU, which can be used as a short-term solution. At this time, long-term performance and reliability are not important; as a long-term solution, I A best effort will be made to reverse engineer the website.


finally:

This article introduces Selenium, a tool for web application testing, and the idea of ​​how to obtain data without reversing the process of web pages.

If readers want to learn more about Selenium-related content, I recommend a Zhihu column [a small project per week] in which six articles introduce some commonly used classes, methods, and APIs of Selenium, as well as translations of some official documents, write relatively perfect.

If there are any deficiencies or errors in the text, please point out!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324780878&siteId=291194637