Article Directory
A, BeautifulSoup Introduction and Installation
1 Introduction
In simple terms, BeautifulSoup is a python parsing library, HTML data and its main function is to parse the pages of
the official explanation is as follows:
Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.
Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.
2. Install
Directly pip can be installed
pip install beautifulsoup4
Two, BeautifulSoup using the method described
1. Notes
BeautifulSoup need to specify when using a 解析器
:
-
html.parse - comes with Python, but fault tolerance is not high enough for some not standardized written part of the page will be lost
-
lxml - parsing speed, additional installation
-
xml - belong lxml library that supports XML documents
-
html5lib - best fault tolerance, but slightly slower
Here lxml and html5lib require additional installation, use self- pip can be installed ( 推荐使用lxml
)
2. Use
For example, the following HTML document fragment:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="beautiful title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
Initialization object that specifies the parser is lxml
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
2.1 tag information was acquired
HTML code completion, tag information was acquired
print(soup.prettify()) # 补全HTML代码
print(soup.p) # 获取整条p标签
print(soup.p.name) # 获取p标签名称
# 获取标签属性
print(soup.p.attrs) # 获取p标签内的所有属性,返回一个字典
print(soup.p['class']) # 获取p标签内的class属性值,返回一个列表
# 获取标签文本
print(soup.p.string) # 获取p标签的文本信息,如果p标签内包含了多个子节点并有多个文本时返回None
print(soup.p.strings) # 获取p标签内的所有文本信息,返回一个生成器
print(soup.p.text) # 获取p标签内的所有文本信息,返回一个列表
print(soup.stripped_strings) # 去掉空白,保留所有的文本,返回一个生成器
2.2 Gets the element node
Gets the specified element 父/祖先节点
, 子/子孙节点
and兄弟节点
# 获取父·祖先节点
print(soup.p.parent) # 获取p标签的直接父节点
print(soup.p.parents) # 获取p标签的祖先节点,返回一个生成器
# 获取子·子孙节点
print(soup.p.contents) # 获取p标签内的直接子节点,返回一个列表
print(soup.p.children) # 获取p标签内的直接子节点,返回一个生成器
print(soup.p.descendants) # 获取p标签内的子孙节点,返回一个生成器
# 获取兄弟节点
print(soup.a.previous_sibling) # 获取a标签的上一个兄弟节点
print(soup.a.previous_siblings) # 获取a标签前面的所有兄弟节点,返回一个生成器
print(soup.a.next_sibling) # 获取a标签的下一个兄弟节点
print(soup.a.next_siblings) # 获取a标签后面的所有兄弟节点,返回一个生成器
2.3 Use selector
Not all information can be easily obtained through a structured, usually find () and find_all () method to find:
- Find () - returns a match result to the first
- find_all () - Returns a list of all matching results
Since find () and find_all () is almost the same in the use, so this list only find_all () to use
print(soup.find_all(text=re.compile('Lacie'), limit=2)) # 使用正则获取所有文本包含'Lacie'的节点(limit: 限制匹配个数)
print(soup.find_all('a', text='Lacie')) # 获取所有a标签内文本等于'Lacie'的节点(文本完整匹配)
print(soup.find_all('a', id='link2')) # 获取所有a标签内id等于'link2'的节点
print(soup.find_all('a', class_='sister')) # 获取所有a标签内class等于'sister'的节点
print(soup.find_all('a', class_='sister', id='link2')) # 多个搜索条件叠加
print(soup.find_all(name='a')) # 获取所有a节点
print(soup.find_all(attrs={'class': 'sister'})) # 获取所有class属性值为'sister'的节点
2.4 Use CSS selectors
If you're familiar with CSS selectors, BeautifulSoup also provides a corresponding method:
.
- on behalf of class#
- on behalf of id
print(soup.select('p')) # 获取所有p标签,返回一个列表
print(soup.select('p a')) # 获取所有p标签内的a节点,返回一个列表
print(soup.select('p.story')) # 获取p标签内class为'story'的所有元素,返回一个列表
print(soup.select('.story')) # 获取class为'story'的所有元素,返回一个列表
print(soup.select('.beautiful.title')) # 获取class为'beautiful title'的所有元素,返回一个列表
print(soup.select('#link1')) # 获取id为'link1'的所有元素,返回一个列表