Article directory
- 1. Introduction to BeautifulSoup
- 2. Installation
- 3. The principle of bs4 data analysis
- 4. Commonly used methods and properties of bs4
- Five, contents, children and descendants
- 六、parent、parents
- 七、next_sibling、previous_sibling
- 八、 next_element、previous_element
- Nine, find () and find_all ()
- Ten, select () and select_one ()
- Eleven, combined with actual combat
- 12. CSS selectors
- Thirteen, use summary
1. Introduction to BeautifulSoup
BeautifulSoup is a Python library that can extract data from HTML or XML files. Beautiful Soup has become an excellent Python interpreter like lxml and html5lib, providing users with the flexibility to provide different parsing strategies or strong speed.
BeautifulSoup Official Documentation: BeautifulSoup
Study notes about the use of BeautifulSoup: Rakuten Notes
2. Installation
pip install bs4 # 下载BeautifulSoup包
pip install lxml # 下载lxml包
How to use the parser and compare its advantages and disadvantages
#标准库的使用方法
BeautifulSoup(html,'html.parser')
#优势:内置标准库,速度适中,文档容错能力强
#劣势:Python3.2版本前的文档容错能力差
#lxml HTML的使用方法
BeautifulSoup(html,'lxml')
#优势:速度快,文档容错能力强
#劣势:需要安装C语言库
#lxml XML的使用方法
BeautifulSoup(html,'xml')
#优势:速度快,唯一支持XML
#劣势:需要安装C语言库
#html5lib的使用方法
BeautifulSoup(html,'html5lib')
#优势:容错性最强,可生成HTML5
#劣势:运行慢,不依赖外部扩展
Summary of crawler parsers
parser | Instructions | Advantage | disadvantage |
---|---|---|---|
Python standard library | BeautifulSoup(html, “html.parser”) | Built-in standard library, moderate speed, strong document fault tolerance | Documentation before Python 3.2 has poor fault tolerance |
lxml HTML parser | BeautifulSoup(html,‘lxml’) | Fast speed and strong document fault tolerance | Need to install the C language library |
lxml XML parser | BeautifulSoup(markup, “xml”) | Fast, the only parser that supports XML | Need to install the C language library |
html5lib | BeautifulSoup(html,‘html5lib’) | The most fault-tolerant, can generate HTML5 format documents | Runs slowly and does not rely on external extensions |
3. The principle of bs4 data analysis
- Instantiate a BeautifulSoup object, and load the page source code data into the object.
- Label positioning and data extraction are performed by calling related properties or methods in the BeautifulSoup object.
4. Commonly used methods and properties of bs4
1. BeautifulSoup construction
1.1 Build from String
from bs4 import BeautifulSoup
html = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div id="container">
<span class="title">
<h3>Python爬虫网页解析神器BeautifulSoup详细讲解</h3>
</span>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
# 打印soup对象的内容,格式化输出
print(soup.prettify())
Format and print the content of the html object, this function will be used frequently in the future.
1.2 Load from file
from bs4 import BeautifulSoup
with open(r"D:\index.html") as fp:
soup = BeautifulSoup(fp, "lxml")
print(soup.prettify())
2. Four objects of BeautifulSoup
Beautiful Soup converts complex HTML documents into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types: Tag, NavigableString, BeautifulSoup, Comment
2.1 Tag object
Tag object also includes string, strings, stripped_strings
If a node contains only text, you can directly access the text of the node through string, for example:
from bs4 import BeautifulSoup
html = """
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)
print(soup.span.text)
print(soup.a['href'])
# 输出结果
The Kevin's story house
这里是王菜鸟的Python系列文章
https://token.blog.csdn.net/
The above method looks for the first tag that meets the requirements in all content, and for Tag, it has two important attributes, name
andattrs
print(soup.p.attrs) # 此处获取的是p标签的所有属性,得到的类型是一个字典
print(soup.p['class']) # 单独获取某个属性
print(soup.p.get('class')) # 同上,单独获取某个属性
# 输出结果
{
'class': ['link']}
['link']
['link']
Make changes to these attributes and content:
soup.p['class'] = "newClass"
print(soup)
# 输出结果
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<p class="newClass">
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a>
</p>
Additionally, a property can be removed:
del soup.p['class']
print(soup)
# 输出结果
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<p>
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a>
</p>
tag.attrs is a dictionary type, can pass tag.get('id')
or tag.get('class')
two ways, if the id or class attribute does not exist, return None. The subscript access method may throw an exception KeyError.
Second you can get_text()
get the text node using
# 获取所有文本内容
soup.get_text()
# 可以指定不同节点之间的文本使用|分割。
soup.get_text("|")
# 可以指定去除空格
soup.get_text("|", strip=True)
2.2 NavigableString object
If you want to get the content in the label, you can use .string
to get
print(soup.a.string)
print(type(soup.a.string))
# 输出结果
王菜鸟的博客
<class 'bs4.element.NavigableString'>
2.3 BeautifulSoup object
The BeautifulSoup object represents the entire content of a document. Most of the time, it can be used as a Tag tag. It is a special Tag, and its type name can be obtained separately:
print(soup.name)
print(type(soup.name))
print(soup.attrs)
# 输出结果
[document]
<class 'str'>
{
}
2.4 Comment object
The Comment object is a special type of NavigableString object, and the output content still does not include comment symbols.
Five, contents, children and descendants
contents, children and descendants are all child nodes of the node, but
- contents is a list
- children is the generator
Notice: contents and children only contain direct child nodes, and descendants is also a generator, but contains descendants of nodes.
Examples of child nodes:
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
print(type(soup.p.contents))
# 输出结果
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
<class 'list'>
Examples of descendant nodes:
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
六、parent、parents
- parent: parent node
- parents: recursive parent node
parent node example :
from bs4 import BeautifulSoup
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.span.parent)
Example of recursive parent node
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))
七、next_sibling、previous_sibling
- next_sibling: the next sibling node
- previous_sibling: previous sibling node
Example of sibling nodes
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))
八、 next_element、previous_element
next_element: the next node
previous_element: the previous node
The difference between next_element and next_sibling is:
- next_sibling starts parsing from the end tag of the current tag
- next_element starts parsing from the start tag of the current tag
Nine, find () and find_all ()
9.1 Method
find_parent: Find the parent node
find_parents: Recursively find the parent node
find_next_siblings: Find the subsequent sibling nodes
find_next_sibling: Find the first sibling node that meets the conditions
find_all_next: Find all the subsequent nodes
find_next: Find the first subsequent node that meets the conditions
find_all_previous: Find All previous nodes that meet the conditions
find_previous: Find the first node that meets the conditions before
9.2 tag name
# 查找所有p节点
soup.find_all('p')
# 查找title节点,不递归
soup.find_all("title", recursive=False)
# 查找p节点和span节点
soup.find_all(["p", "span"])
# 查找第一个a节点,和下面一个find等价
soup.find_all("a", limit=1)
soup.find('a')
9.3 Properties
# 查找id为id1的节点
soup.find_all(id='id1')
# 查找name属性为tim的节点
soup.find_all(name="tim")
soup.find_all(attrs={
"name": "tim"})
#查找class为clazz的p节点
soup.find_all("p", "clazz")
soup.find_all("p", class_="clazz")
soup.find_all("p", class_="body strikeout")
9.4 Regular expressions
import re
# 查找与p开头的节点
soup.find_all(class_=re.compile("^p"))
9.5 Functions
# 查找有class属性并且没有id属性的节点
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
return tag.has_attr('class') and not tag.has_attr('id')
9.6 Text
# 查找有class属性并且没有id属性的节点
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
return tag.has_attr('class') and not tag.has_attr('id')
Ten, select () and select_one ()
select()
is to select elements that satisfy all conditions, select_one()
and only select the first element that meets the conditions.
The focus of select() is on the selector. CSS selectors are divided into id selectors and class selectors . The tag name does not add any modification. Add a dot before the class name and # before the id name. Use a similar approach here to filter elements.
10.1 Select by tag
It is very simple to select through tags, that is, according to the level, it is enough to use spaces to separate the names of the tags.
# 选择title节点
soup.select("title")
# 选择body节点下的所有a节点
soup.select("body a")
# 选择html节点下的head节点下的title节点
soup.select("html head title")
10.2 id and class selectors
The id and class selectors are also relatively simple. The class selector starts with . and the id selector starts with #.
# 选择类名为article的节点
soup.select(".article")
# 选择id为id1的a节点
soup.select("a#id1")
# 选择id为id1的节点
soup.select("#id1")
# 选择id为id1、id2的节点
soup.select("#id1,#id2")
10.3 Attribute selectors
# 选择有href属性的a节点
soup.select('a[href]')
# 选择href属性为http://mycollege.vip/tim的a节点
soup.select('a[href="http://mycollege.vip/tim"]')
# 选择href以http://mycollege.vip/开头的a节点
soup.select('a[href^="http://mycollege.vip/"]')
# 选择href以png结尾的a节点
soup.select('a[href$="png"]')
# 选择href属性包含china的a节点
soup.select('a[href*="china"]')
# 选择href属性包含china的a节点
soup.select("a[href~=china]")
10.4 Other selectors
# 父节点为div节点的p节点
soup.select("div > p")
# 节点之前有div节点的p节点
soup.select("div + p")
# p节点之后的ul节点(p和ul有共同父节点)
soup.select("p~ul")
# 父节点中的第3个p节点
soup.select("p:nth-of-type(3)")
Eleven, combined with actual combat
Through a case, to learn the usage of find(), find_all(), select(), select_one().
from bs4 import BeautifulSoup
text = '''
<li class="subject-item">
<div class="pic">
<a class="nbg" href="https://mycollege.vip/subject/25862578/">
<img class="" src="https://mycollege.vip/s27264181.jpg" width="90">
</a>
</div>
<div class="info">
<h2 class=""><a href="https://mycollege.vip/subject/25862578/" title="解忧杂货店">解忧杂货店</a></h2>
<div class="pub">[日] 东野圭吾 / 李盈春 / 南海出版公司 / 2014-5 / 39.50元</div>
<div class="star clearfix">
<span class="allstar45"></span>
<span class="rating_nums">8.5</span>
<span class="pl">
(537322人评价)
</span>
</div>
<p>现代人内心流失的东西,这家杂货店能帮你找回——僻静的街道旁有一家杂货店,只要写下烦恼投进卷帘门的投信口,
第二天就会在店后的牛奶箱里得到回答。因男友身患绝... </p>
</div>
</li>
'''
soup = BeautifulSoup(text, 'lxml')
print(soup.select_one("a.nbg").get("href"))
print(soup.find("img").get("src"))
title = soup.select_one("h2 a")
print(title.get("href"))
print(title.get("title"))
print(soup.find("div", class_="pub").string)
print(soup.find("span", class_="rating_nums").string)
print(soup.find("span", class_="pl").string.strip())
print(soup.find("p").string)
12. CSS selectors
12.1 Common selectors
12.2 Location selector
12.3 Other selectors
Thirteen, use summary
- It is recommended to use
lxml
the parsing library and use it when necessaryhtml.parser
- Label selection and filtering is weak but fast
- It is recommended to use the
find()
queryfind_all()
to match a single result or multiple results - If you are familiar with CSS selectors, it is recommended to use
select()
,select_one()
- Memorize common ways to get properties and text values