bs4 parsing library module 03 beautifulsoup

 

03 resolver library beautifulsoup

 

An introduction

Beautiful Soup is a can extract data from HTML or XML file Python library. It can be achieved through your favorite converter usual document navigation, search, way .Beautiful Soup modify the document to help you save hours or even days working hours. you may be looking Beautiful Soup3 document, Beautiful Soup 3 has stopped development, the official website recommended Beautiful Soup 4 in the current project, transplanted to BS4

#安装 Beautiful Soup
pip install beautifulsoup4

# Installation parser
Beautiful Soup supports HTML parser Python standard library also supports a number of third-party parser, one of which is lxml Depending on the operating system, you can choose the following methods to install lxml.:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative parser is pure Python implementation of html5lib, html5lib the same analytical methods and the browser, you can choose the following methods to install html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

The following table lists the main parser, as well as their advantages and disadvantages, as the official website recommended lxml parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install or lxml html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough.

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(markup, "html.parser")
  • Python's standard library built
  • Execution rate is moderate
  • Documents fault-tolerant capability
  • Version of Python 2.7.3 or 3.2.2) before the document fault tolerance poor
lxml HTML parser BeautifulSoup(markup, "lxml")
  • high speed
  • Documents fault-tolerant capability
  • You need to install the C language library
lxml XML parser

BeautifulSoup(markup, ["lxml", "xml"])

BeautifulSoup(markup, "xml")

  • high speed
  • The only support XML parser
  • You need to install the C language library
html5lib BeautifulSoup(markup, "html5lib")
  • The best fault tolerance
  • Browser way to parse the document
  • Generating documentation HTML5 format
  • Slow
  • Do not rely on external expansion

Chinese document: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Two basic use

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# Basic use: fault-tolerant processing, fault tolerance, the document refers to the case where the html code incomplete, the module may be used to identify the error. Use BeautifulSoup parsing the code above, the object can be obtained a BeautifulSoup, and can output in the indented structure of the standard
from bs4 import BeautifulSoup
= the BeautifulSoup Soup (html_doc, 'lxml' ) having a fault tolerant #
res = soup.prettify () # retracted handle, the structured display
print(res)

Three traverse the document tree

# Traversing the document tree: that is, directly through the label name selection, is characterized by fast speed choice, but if there is more of the same label only return the first
# 1, usage
# 2, get the name tags
# 3, to obtain property of the label
# 4, to obtain the contents of the label
# 5, choose nested
# 6, the child node, node descendants
# 7, parent, ancestor node
# 8, sibling
View Code

Four search document tree

1, these filters

View Code

2、find_all( name , attrs , recursive , text , **kwargs )

View Code

3、find( name , attrs , recursive , text , **kwargs )

View Code

4, other methods

View Code

5, CSS selectors

View Code

Five modify the document tree

Link: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40

Six summary

# to sum up:
# 1 recommended lxml parsing library
# 2, about three selectors: tag selector, find and find_all, css selector
    1 , the tag selector filtering function is weak, but fast
    2 , recommended find, find_all query results matches a single or a plurality of results
    3 , if css selectors are very familiar with the recommended select
# 3, remember acquisition method, and text attributes attrs value get_text () commonly used

An introduction

Beautiful Soup is a can extract data from HTML or XML file Python library. It can be achieved through your favorite converter usual document navigation, search, way .Beautiful Soup modify the document to help you save hours or even days working hours. you may be looking Beautiful Soup3 document, Beautiful Soup 3 has stopped development, the official website recommended Beautiful Soup 4 in the current project, transplanted to BS4

#安装 Beautiful Soup
pip install beautifulsoup4

# Installation parser
Beautiful Soup supports HTML parser Python standard library also supports a number of third-party parser, one of which is lxml Depending on the operating system, you can choose the following methods to install lxml.:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative parser is pure Python implementation of html5lib, html5lib the same analytical methods and the browser, you can choose the following methods to install html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

The following table lists the main parser, as well as their advantages and disadvantages, as the official website recommended lxml parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install or lxml html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough.

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(markup, "html.parser")
  • Python's standard library built
  • Execution rate is moderate
  • Documents fault-tolerant capability
  • Version of Python 2.7.3 or 3.2.2) before the document fault tolerance poor
lxml HTML parser BeautifulSoup(markup, "lxml")
  • high speed
  • Documents fault-tolerant capability
  • You need to install the C language library
lxml XML parser

BeautifulSoup(markup, ["lxml", "xml"])

BeautifulSoup(markup, "xml")

  • high speed
  • The only support XML parser
  • You need to install the C language library
html5lib BeautifulSoup(markup, "html5lib")
  • The best fault tolerance
  • Browser way to parse the document
  • Generating documentation HTML5 format
  • Slow
  • Do not rely on external expansion

Chinese document: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Two basic use

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#基本使用:容错处理,文档的容错能力指的是在html代码不完整的情况下,使用该模块可以识别该错误。使用BeautifulSoup解析上述代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml') #具有容错功能
res=soup.prettify() #处理好缩进,结构化显示
print(res)

三 遍历文档树

#遍历文档树:即直接通过标签名字选择,特点是选择速度快,但如果存在多个相同的标签则只返回第一个
#1、用法
#2、获取标签的名称
#3、获取标签的属性
#4、获取标签的内容
#5、嵌套选择
#6、子节点、子孙节点
#7、父节点、祖先节点
#8、兄弟节点
View Code

四 搜索文档树

1、五种过滤器

View Code

2、find_all( name , attrs , recursive , text , **kwargs )

View Code

3、find( name , attrs , recursive , text , **kwargs )

View Code

4、其他方法

View Code

5、CSS选择器

View Code

五 修改文档树

链接:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40

六 总结

# 总结:
#1、推荐使用lxml解析库
#2、讲了三种选择器:标签选择器,find与find_all,css选择器
    1、标签选择器筛选功能弱,但是速度快
    2、建议使用find,find_all查询匹配单个结果或者多个结果
    3、如果对css选择器非常熟悉建议使用select
#3、记住常用的获取属性attrs和文本值get_text()的方法

Guess you like

Origin www.cnblogs.com/cherish937426/p/11955178.html