BeautifulSoup quick documentation

BeautifulSoup documentation

Installation

In Debian and Ubuntu, it can be installed via the package manager

apt-get install python-bs4 # python2
apt-get install python3-bs4 # python3

Or you can directly use python's package manager to install

pip install beautifulsoup4
easy_install install beautifulsoup4

Install a parser

Beautiful Soup uses the default HTML parser in the python standard library by default. However, in order to obtain a certain performance improvement in parsing speed, we can manually select some parsers.

Parser	example	advantage	Disadvantage
Python’s html.parser	BeautifulSoup(markup, “html.parser”)	No external dependency speed can still support more python versions	It is slower than lxml, and has less functions than html5lib
lxml’s HTML parser	BeautifulSoup(markup, “lxml”)	Very fast support for more python versions	Rely on C
lxml’s XML parser	BeautifulSoup(markup, “lxml-xml”)	Very fast support for XML	Rely on C
html5lib	BeautifulSoup(markup, “html5lib”)	The parsing method is similar to that of web browsers supporting html5	The speed is slower, some python relies on

Make the soup

By default, file-like objects are supported or unicode fields are used directly.

from bs4 import BeautifulSoup

with open("index.html") as f:
    soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

Kinds of objects

BeautifulSoup will parse the html document into a complex tree of Python objects. However, the only objects for general operations are Tag, NavigableString, BeautifulSoup, and Comment.

Tag

Corresponds to the Tag tag in html

soup = BeautifulSoup("<a id=\"c\">Hello</a>"
tag = soup.b
type(tag) === Tag

# name
tag.name # a
# change name
tag.name = "c"

# attributes
tag["id"] # c
tag.attrs # {"id": "c"}
# operate tag's attribute
tag["id"] = "ddd" # <a id="ddd">Hello</a>
tag["id2"] = "ccc" # <a id="ddd" id2="ccc">Hello</a>
del tag["id2"] # <a id="ddd">Hello</a>
tag.get("id") # ddd

# multi-value attributes
”“”
BeautifulSoup支持将具有多个值的标签属性解析为一个列表，默认此属性是开启的，比如对于class属性
<a class="cls1 cls2"></a> => Tag_a["class"] => ["cls1", "cls2"]
我们可以通过在BeautifulSoup的生命中设置multi_valued_attributes选项控制此解析行为
“”“
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None)
no_list_soup.p['class']
# u'body strikeout'

NavigableString

Used to represent text in html tags

tag.string # str
type(tag.string) # bs4.element.NavigableString

# 可以直接使用Unicode字符串进行转换
uncideo_string = unicode(tag.string)

# 该对象不支持直接编辑
# 可以直接通过replace_with对于字符串内容进行编辑
tag.string.replace_with("hello") # <tag>str</tag> -> <tag>hello</tag>

If you need to use the text content of the object additionally, it is recommended to use Unicode for conversion. It will be used for the entire parse tree, which will cause a waste of memory.

BeautifulSoup

The BeautifulSoup object can be used as a basic Tag in most cases, and most methods of Tag are supported, and its name attribute is a specific value [document].

Notes on the document and some special characters

Comment

For comments, the content obtained by different access forms is different

soup = BeautifulSoup("<b><!--Hey?--></b>")
comment = soup.b.string # => bs4.element.Comment
comment # Hey?
soup.b.prettify() # =>
"""
<b>
  <!--Hey?-->
</b>
"""

Stylesheet、Script和TemplateString

BeautifulSoup supports parsing Stylesheet, Script and TemplateString in HTML documents into corresponding objects.

<style></style> <!-- bs4.element.Stylesheet -->
<script></script> <!-- bs4.element.Script -->
<template></template> <!-- bs4.element.TemplateString -->

Note that this feature is only supported by BeautifulSoup> 4.9.0, and the current html5lib parser does not support
these objects in the same way as NavigableString objects.

CDATA

For special CDATA objects, there are also some specific new categories, such as CData.

from bs4 import BeautifulSoup
cdata = CData("A CDATA block")
soup.xml.replace_with(cdata)
soup.xml.prettify() # =>
"""
<xml>
  <![CDATA[A CDATA block]]>
</xml>
"""

Navigation tree

Official document

(To be continued)