Under Python parsing HTML using the BeautifulSoup

Summary

Beautiful Soup is a Python library can extract data from HTML or XML format file, he can be HTML or XML data parsed into Python objects, in order to facilitate treatment by Python code.

Document environment

This document test environment code

Beautifu Soup instructions

The basic functions of Beautiful Soup is an HTML tag to find and edit.

The basic concept - object type

Beautiful Soup complex complex HTML document into a tree structure, each node is converted into a Python object, Beautiful Soup these four types of objects are defined: Tag, NavigableString, BeautifulSoup, Comment.

Object Types	description
BeautifulSoup	The entire contents of the document
Tag	HTML tags
NavigableString	The tag contains text
Comment	Is a special type NavigableString, when the label NavigableString annotated, that defines the type of

Installation and reference

# Beautiful Soup
pip install bs4

# 解析器
pip install lxml
pip install html5lib

# 初始化
from bs4 import BeautifulSoup

# 方法一，直接打开文件
soup = BeautifulSoup(open("index.html"))

# 方法二，指定数据
resp = "<html>data</html>"
soup = BeautifulSoup(resp, 'lxml')

# soup 为 BeautifulSoup 类型对象
print(type(soup))

Tag search and filtering

basic method

Search tags have find_all () and find () two basic search methods, find_all () method returns a list of all matching keywords tag, find () method returns only one matching results.

soup = BeautifulSoup(resp, 'lxml')

# 返回一个标签名为"a"的Tag
soup.find("a")

# 返回所有tag 列表
soup.find_all("a")

## find_all方法可被简写
soup("a")

#找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

#找出列表中的所有标签
soup.find_all(["a", "p"])

# 查找标签名为p，class属性为"title"
soup.find_all("p", "title")

# 查找属性id为"link2"
soup.find_all(id="link2")

# 查找存在属性id的
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#查找标签文字包含"sisters"
soup.find(string=re.compile("sisters"))

# 获取指定数量的结果
soup.find_all("a", limit=2)

# 自定义匹配方法
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

# 仅对属性使用自定义匹配方法
def not_lacie(href):
        return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

# 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False 

soup.find_all("title", recursive=False)

Extension Methods


find_parents()	All parents nodes
find_parent()	The first node fathers
find_next_siblings()	All siblings after
find_next_sibling()	The first sibling node after
find_previous_siblings()	All previous siblings
find_previous_sibling()	Before the first sibling
find_all_next()	After all elements of
find_next()	The first element after
find_all_previous()	Before all the elements
find_previous()	The first element before

CSS selectors

Beautiful Soup supports most CSS selectors http://www.w3.org/TR/CSS2/selector.html , passed string argument .select Tag or BeautifulSoup object () method, can be used to select CSS the syntax of find tag.

html_doc = """
<html>
<head>
  <title>The Dormouse's story</title>
</head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>

  <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
  </p>

  <p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

# 所有 a 标签
soup.select("a")

# 逐层查找
soup.select("body a")
soup.select("html head title")

# tag标签下的直接子标签
soup.select("head > title")
soup.select("p > #link1")

# 所有匹配标签之后的兄弟标签
soup.select("#link1 ~ .sister")

# 匹配标签之后的第一个兄弟标签
soup.select("#link1 + .sister")

# 根据calss类名
soup.select(".sister")
soup.select("[class~=sister]")

# 根据ID查找
soup.select("#link1")
soup.select("a#link1")

# 根据多个ID查找
soup.select("#link1,#link2")

# 根据属性查找
soup.select('a[href]')

# 根据属性值查找
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

# 只获取一个匹配结果
soup.select(".sister", limit=1)

# 只获取一个匹配结果
soup.select_one(".sister")

Label object methods

Tag attributes

soup = BeautifulSoup('<p class="body strikeout" id="1">Extremely bold</p><p class="body strikeout" id="2">Extremely bold2</p>')
# 获取所有的 p标签对象
tags = soup.find_all("p")
# 获取第一个p标签对象
tag = soup.p
# 输出标签类型 
type(tag)
# 标签名
tag.name
# 标签属性
tag.attrs
# 标签属性class 的值
tag['class']
# 标签包含的文字内容，对象NavigableString 的内容
tag.string

# 返回标签内所有的文字内容
for string in tag.strings:
    print(repr(string))

# 返回标签内所有的文字内容, 并去掉空行
for string in tag.stripped_strings:
    print(repr(string))

# 获取到tag中包含的所有及包括子孙tag中的NavigableString内容，并以Unicode字符串格式输出
tag.get_text()
## 以"|"分隔
tag.get_text("|")
## 以"|"分隔，不输出空字符
tag.get_text("|", strip=True)

Getting child nodes

tag.contents  # 返回第一层子节点的列表
tag.children  # 返回第一层子节点的listiterator 对象
for child in tag.children:
    print(child)

tag.descendants # 递归返回所有子节点
for child in tag.descendants:
    print(child)

Get the parent node

tag.parent # 返回第一层父节点标签
tag.parents # 递归得到元素的所有父辈节点

for parent in tag.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

Gets sibling

# 下一个兄弟元素
tag.next_sibling 

# 当前标签之后的所有兄弟元素
tag.next_siblings
for sibling in tag.next_siblings:
    print(repr(sibling))

# 上一个兄弟元素
tag.previous_sibling

# 当前标签之前的所有兄弟元素
tag.previous_siblings
for sibling in tag.previous_siblings:
    print(repr(sibling))

Traversing element

Beautiful Soup in each of the tag is defined as an "element", each "element", the top-down arranged in HTML, the tag can be displayed one by one by traversing command

# 当前标签的下一个元素
tag.next_element

# 当前标签之后的所有元素
for element in tag.next_elements:
    print(repr(element))

# 当前标签的前一个元素
tag.previous_element
# 当前标签之前的所有元素
for element in tag.previous_elements:
    print(repr(element))

Modify the label attribute

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "New link text."
print(tag)

Modify the label content (NavigableString)

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.string = "New link text."

Add tags content (NavigableString)

soup = BeautifulSoup("<a>Foo</a>")
tag = soup.a
tag.append("Bar")
tag.contents

# 或者

new_string = NavigableString("Bar")
tag.append(new_string)
print(tag)

Add comments (Comment)

Note NavigableString is a special object, can also be added via append () method.

from bs4 import Comment
soup = BeautifulSoup("<a>Foo</a>")
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
print(tag)

Add tags (Tag)

There are two ways to add the label, one is added to the interior of the specified tag (append method), another is to add (insert, insert_before, insert_after method) at a predetermined position

append method


soup = BeautifulSoup("<b></b>")
tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
new_tag.string = "Link text."
tag.append(new_tag)
print(tag)

* insert方法，是指在当前标签子节点列表的指定位置插入对象（Tag或NavigableString）
```python
html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents

insert_before () and insert_after () method of the sibling node before or after the current tab additive element


html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.insert_before(tag)
soup.b

* wrap() 和 unwrap()可以对指定的tag元素进行包装或解包,并返回包装后的结果。

```python
# 添加包装
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#输出 <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))
#输出 <div><p><b>I wish I was bold.</b></p></div>

# 拆解包装
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#输出 <a href="http://example.com/">I linked to example.com</a>

Remove label

html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
# 清楚当前标签的所有子节点
soup.b.clear()

# 将当前标签及所有子节点从soup 中移除,返回当前标签。
b_tag=soup.b.extract()
b_tag
soup

# 将当前标签及所有子节点从soup 中移除，无返回。
soup.b.decompose()

# 将当前标签替换为指定的元素
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "Don't"
tag.replace_with(new_tag)

Other methods

Export

# 格式化输出
tag.prettify()
tag.prettify("latin-1")

After using Beautiful Soup parses the document are converted to Unicode, special characters are also converted to Unicode, if you convert the document to a string, Unicode encoding will be encoded as UTF-8. Such HTML special characters can not be displayed correctly
When using Unicode, Beautiful Soup will be smart to "quote" into HTML or XML special characters

Document encoding

After using Beautiful Soup parses the document are converted to Unicode, which uses "auto encoding detection" sub-library to identify the current document encoding and converted to Unicode encoding.

soup = BeautifulSoup(html)
soup.original_encoding

# 也可以手动指定文档的编码 
soup = BeautifulSoup(html, from_encoding="iso-8859-8")
soup.original_encoding

# 为提高“编码自动检测”的检测效率，也可以预先排除一些编码
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

Through Beautiful Soup output document, no matter what the input document encoding, default output encoding are UTF-8 encoding

Document parser

Beautiful Soup is currently supported, "lxml", "html5lib", and "html.parser"

soup=BeautifulSoup("<a><b /></a>")
soup
#输出： <html><body><a><b></b></a></body></html>
soup=BeautifulSoup("<a></p>", "lxml")
soup
#输出： <html><body><a></a></body></html>
soup=BeautifulSoup("<a></p>", "html5lib")
soup
#输出： <html><head></head><body><a><p></p></a></body></html>
soup=BeautifulSoup("<a></p>", "html.parser")
soup
#输出： <a></a>

Reference Documents

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh