Performance Comparison Xpath re bs4 and other reptiles parser

Performance Comparison xpath re bs4 parser like reptiles

This article Original address: https: //sitoi.cn/posts/23470.html

Thinking

Test Web site address: http://baijiahao.baidu.com/s?id=1644707202199076031

According to the same site, for the same data, repeated 500 times taken and after comparison.

Test case

# -*- coding: utf-8 -*-
import re
import time

import scrapy
from bs4 import BeautifulSoup


class NewsSpider(scrapy.Spider):
    name = 'news'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baijiahao.baidu.com/s?id=1644707202199076031']

    def parse(self, response):
        re_time_list = []
        xpath_time_list = []
        lxml_time_list = []
        bs4_lxml_time_list = []
        html5lib_time_list = []
        bs4_html5lib_time_list = []
        for _ in range(500):
            # re
            re_start_time = time.time()
            news_title = re.findall(pattern="<title>(.*?)</title>", string=response.text)[0]
            news_content = "".join(re.findall(pattern='<span class="bjh-p">(.*?)</span>', string=response.text))
            re_time_list.append(time.time() - re_start_time)
            # xpath
            xpath_start_time = time.time()
            news_title = response.xpath("//div[@class='article-title']/h2/text()").extract_first()
            news_content = response.xpath('string(//*[@id="article"])').extract_first()
            xpath_time_list.append(time.time() - xpath_start_time)
            # bs4 html5lib without BeautifulSoup
            soup = BeautifulSoup(response.text, "html5lib")
            html5lib_start_time = time.time()
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            html5lib_time_list.append(time.time() - html5lib_start_time)
            # bs4 html5lib with BeautifulSoup
            bs4_html5lib_start_time = time.time()
            soup = BeautifulSoup(response.text, "html5lib")
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            bs4_html5lib_time_list.append(time.time() - bs4_html5lib_start_time)

            # bs4 lxml without BeautifulSoup
            soup = BeautifulSoup(response.text, "lxml")
            lxml_start_time = time.time()
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            lxml_time_list.append(time.time() - lxml_start_time)

            # bs4 lxml without BeautifulSoup
            bs4_lxml_start_time = time.time()
            soup = BeautifulSoup(response.text, "lxml")
            news_title = soup.select_one("div.article-title > h2").text
            news_content = soup.select_one("#article").text
            bs4_lxml_time_list.append(time.time() - bs4_lxml_start_time)
        re_result = sum(re_time_list)
        xpath_result = sum(xpath_time_list)
        lxml_result = sum(lxml_time_list)
        html5lib_result = sum(html5lib_time_list)
        bs4_lxml_result = sum(bs4_lxml_time_list)
        bs4_html5lib_result = sum(bs4_html5lib_time_list)

        print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n")
        print(f"re 使用时间:{re_result}")
        print(f"xpath 使用时间:{xpath_result}")
        print(f"lxml 纯解析使用时间:{lxml_result}")
        print(f"html5lib 纯解析使用时间:{html5lib_result}")
        print(f"bs4_lxml 转换解析使用时间:{bs4_lxml_result}")
        print(f"bs4_html5lib 转换解析使用时间:{bs4_html5lib_result}")
        print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n")
        print(f"xpath/re :{xpath_result / re_result}")
        print(f"lxml/re :{lxml_result / re_result}")
        print(f"html5lib/re :{html5lib_result / re_result}")
        print(f"bs4_lxml/re :{bs4_lxml_result / re_result}")
        print(f"bs4_html5lib/re :{bs4_html5lib_result / re_result}")
        print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")

Test Results:

the first time

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用时间:0.018010616302490234
xpath 使用时间:0.19927382469177246
lxml 纯解析使用时间:0.3410227298736572
html5lib 纯解析使用时间:0.3842911720275879
bs4_lxml 转换解析使用时间:1.6482152938842773
bs4_html5lib 转换解析使用时间:6.744122505187988

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :11.064242408196765
lxml/re :18.934539726245003
html5lib/re :21.336925154218847
bs4_lxml/re :91.51354213550078
bs4_html5lib/re :374.4526223822509
lxml/xpath :1.7113272673976896
html5lib/xpath :1.9284578525152096

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

the second time

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用时间:0.023047208786010742
xpath 使用时间:0.18992280960083008
lxml 纯解析使用时间:0.3522317409515381
html5lib 纯解析使用时间:0.418229341506958
bs4_lxml 转换解析使用时间:1.710503101348877
bs4_html5lib 转换解析使用时间:7.1153998374938965

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :8.24059917034769
lxml/re :15.28305419636484
html5lib/re :18.14663742538819
bs4_lxml/re :74.21736476770769
bs4_html5lib/re :308.7315216154427
lxml/xpath :1.8546047296364272
html5lib/xpath :2.2021016979791463

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

the third time

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

re 使用时间:0.014002561569213867
xpath 使用时间:0.18992352485656738
lxml 纯解析使用时间:0.3783881664276123
html5lib 纯解析使用时间:0.39995455741882324
bs4_lxml 转换解析使用时间:1.751767873764038
bs4_html5lib 转换解析使用时间:7.1871068477630615

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

xpath/re :13.563484360899695
lxml/re :27.022781835827757
html5lib/re :28.56295653062267
bs4_lxml/re :125.10338662716453
bs4_html5lib/re :513.2708620660298
lxml/xpath :1.9923185751389976
html5lib/xpath :2.1058716013241323

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Result analysis:

Three times averaged results analysis

re xpath lxml html5lib lxml(bs4) html5lib(bs4)
re 1 10.52 19.46 21.84 92.82 382.25
xpath 1 1.85 2.08 8.82 36.34
lxml 1 1.12 4.77 19.64
html5lib 1 4.25 17.50
lxml(bs4) 1 4.12
html5lib(bs4) 1
  • xpath/re :10.52
  • lxml/re :19.46
  • html5lib/re :21.84
  • bs4_lxml/re :92.82
  • bs4_html5lib/re :382.25
  • lxml/xpath :1.85
  • html5lib/xpath :2.08
  • bs4_lxml/xpath :8.82
  • bs4_html5lib/xpath :36.34
  • html5lib/lxml :1.12
  • bs4_lxml/lxml :4.77
  • bs4_html5lib/lxml :19.64
  • bs4_lxml/html5lib :4.25
  • bs4_html5lib/html5lib :17.50
  • bs4_html5lib/bs4_lxml :4.12

Three comparative embodiment crawling

re xpath bs4
installation Internal Third party Third party
grammar Regular Path match Object-Oriented
use difficult More difficult simple
performance highest Moderate lowest

in conclusion

re > xpath > bs4

  • re is about 10 times the xpath

    Although re in performance is much higher than xpath bs4, but in use, compared with the previous xpath and bs4 difficulty to be much larger, and a lot of the difficulty of post-maintenance is also high.

  • xpath is about 1.8 times the bs4

    Efficiency is only comparing the extracted, xpath is about 1.8 times bs4, but reality also includes the conversion process bs4, in large number and amount of layers, the actual efficiency xpath is much higher than bs4.

Overall, xpath plus distributed scrapy-redis been very satisfying performance requirements, the proposed xpath into the pit.

Guess you like

Origin www.cnblogs.com/sitoi/p/11819580.html