Python crawler learning 30
write directory title here
6. Use of Beautiful Soup
We have learned the lxml library before, and today we will learn the use of the Beautiful Soup library together
6-1 Introduction to Beautiful Soup
Beautiful Soup is an HTML or XML parsing library for Python, we can use it to easily extract data from web pages.
It provides some simple, python-like functions to handle navigation, searching, modifying parse trees, and more. Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding.
At this time, we don't need to consider the encoding method, we only need to explain the original encoding.
Like the lxml library, beautiful soup is a python interpreter for parsing web pages.
6-2 Parser
Beautiful soup needs to rely on the parser when parsing. In addition to supporting the HTML parser of the python standard library, it also supports some third-party parsers (such as lxml).
Install before using:
pip3 install beautifulsoup4
The download may time out, try several times or simply hang a mirror
After the installation is complete, use bs4 to call the lxml parser
# 解析器的调用
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hellow python</p>', 'lxml')
print(soup.p.string)
operation result:
6-3 Basic use
from bs4 import BeautifulSoup
html = """
<title>一段html文本</title>
<div class="nav">
<ul>
<li><a href="https://www.qbiqu.com/">首页</a></li>
<li><a href="/modules/article/bookcase.php">我的书架</a></li>
<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
<li><a href="/dushixiaoshuo/">都市小说</a></li>
<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
<li><a href="/paihangbang/">排行榜单</a></li>
<li><a href="/wanben/1_1">完本小说</a></li>
<li><a href="/xiaoshuodaquan/">全部小说</a></li>
<li><script type="text/javascript">yuedu();</script></li>
</ul>
</div>
<div id="banner" style="display:none"></div>
<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
# 调用 lxml 解析器
soup = BeautifulSoup(html, 'lxml')
# 标准化html中的HTML文本
print(soup.prettify())
# 获取 title 节点中的字符串
print(soup.title.string)
operation result:
For the output html text
, you can see that html, head and other nodes are automatically completed
For the string we get:
6-4 Node Selector
Use the node selector for node selection:
# 节点选择器
from bs4 import BeautifulSoup
html = """
<title>一段html文本</title>
<div class="nav">
<ul>
<li><a href="https://www.qbiqu.com/">首页</a></li>
<li><a href="/modules/article/bookcase.php">我的书架</a></li>
<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
<li><a href="/dushixiaoshuo/">都市小说</a></li>
<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
<li><a href="/paihangbang/">排行榜单</a></li>
<li><a href="/wanben/1_1">完本小说</a></li>
<li><a href="/xiaoshuodaquan/">全部小说</a></li>
<li><script type="text/javascript">yuedu();</script></li>
</ul>
</div>
<div id="banner" style="display:none"></div>
<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
soup = BeautifulSoup(html, 'lxml')
# 选取title节点
print(soup.title)
# title节点属性
# 返回的结果是 bs4.element.Tag class
print(type(soup.title))
# 获取 title 节点中的字符串
print(soup.title.string)
# 获取head节点
print(soup.head)
# 获取 p节点 可以看到结果没有则返回None
print(soup.p)
operation result:
Today ends, to be continued...