Python crawler learning 30

Python crawler learning 30

6. Use of Beautiful Soup

We have learned the lxml library before, and today we will learn the use of the Beautiful Soup library together

6-1 Introduction to Beautiful Soup

Beautiful Soup is an HTML or XML parsing library for Python, we can use it to easily extract data from web pages.

It provides some simple, python-like functions to handle navigation, searching, modifying parse trees, and more. Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding.

At this time, we don't need to consider the encoding method, we only need to explain the original encoding.

Like the lxml library, beautiful soup is a python interpreter for parsing web pages.

6-2 Parser

Beautiful soup needs to rely on the parser when parsing. In addition to supporting the HTML parser of the python standard library, it also supports some third-party parsers (such as lxml).
insert image description here

Install before using:

pip3 install beautifulsoup4

The download may time out, try several times or simply hang a mirror

After the installation is complete, use bs4 to call the lxml parser

# 解析器的调用
from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hellow python</p>', 'lxml')
print(soup.p.string)

operation result:

insert image description here

6-3 Basic use

from bs4 import BeautifulSoup

html = """
<title>一段html文本</title>
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
# 调用 lxml 解析器
soup = BeautifulSoup(html, 'lxml')
# 标准化html中的HTML文本
print(soup.prettify())
# 获取 title 节点中的字符串
print(soup.title.string)

operation result:

For the output html text
, you can see that html, head and other nodes are automatically completed
insert image description here

For the string we get:

insert image description here

6-4 Node Selector

Use the node selector for node selection:

# 节点选择器

from bs4 import BeautifulSoup

html = """
<title>一段html文本</title>
<div class="nav">
			<ul>
				<li><a href="https://www.qbiqu.com/">首页</a></li>
                <li><a href="/modules/article/bookcase.php">我的书架</a></li>
				<li><a href="/xuanhuanxiaoshuo/">玄幻小说</a></li>
				<li><a href="/xiuzhenxiaoshuo/">修真小说</a></li>
				<li><a href="/dushixiaoshuo/">都市小说</a></li>
				<li><a href="/chuanyuexiaoshuo/">穿越小说</a></li>
				<li><a href="/wangyouxiaoshuo/">网游小说</a></li>
				<li><a href="/kehuanxiaoshuo/">科幻小说</a></li>
				<li><a href="/paihangbang/">排行榜单</a></li>
				<li><a href="/wanben/1_1">完本小说</a></li>
				<li><a href="/xiaoshuodaquan/">全部小说</a></li>
                <li><script type="text/javascript">yuedu();</script></li>
			</ul>
		</div>
        <div id="banner" style="display:none"></div>
		<div class="dahengfu"><script type="text/javascript">list1();</script></div>
"""
soup = BeautifulSoup(html, 'lxml')
# 选取title节点
print(soup.title)
# title节点属性
# 返回的结果是 bs4.element.Tag class
print(type(soup.title))
# 获取 title 节点中的字符串
print(soup.title.string)
# 获取head节点
print(soup.head)
# 获取 p节点 可以看到结果没有则返回None
print(soup.p)

operation result:

insert image description here
Today ends, to be continued...

Guess you like

Origin blog.csdn.net/szshiquan/article/details/124180016