作者：IT小样
上一篇简单的举例了BeautifulSoup的初级使用，本篇详细介绍BeautifulSoup的原理以及详细用法。

BeautifulSoup 原理

1、获取文档对象

获得BeautifulSoup对象，通过对BeautifulSoup（）传入字符串或文件句柄。示例：

soup = BeautifulSoup(open(index.html))
soup = BeautifulSoup("hello")

首先文档被转换成Unicode编码，其次再用解析器对其进行解析。

2、解析文档

HTML是有层次的语言:<html><head><title>…</title></head><body>…</body></html>.因此BeautifulSoup在对其解析时，将其解析为树形结构。每个节点都是python对象，一共存在四种对象类型：Tag，NavigableString，BeautifulSoup，Comment。

2.1、Tag对象

tag与原文档中定义相同；如在文档：<p class=‘tester’>hello Tester</p>中，tag为p。接下来介绍一下Tag的两大属性：name和attributes。

2.1.1、name属性

Tag通过.name方法来获取名字，如果改变Tag的name属性，将影响整个BeautifulSoup生成的文档，见示例（以上面的例子为例）：

from bs4 import BeautifulSoup
print(tag.name)
tag.name = "changeTag"
print(tag.name)

此段代码依次输出：
>>>‘p’
>>>‘changeTag’
此时整个tag变为：<changeTag class=‘tester’>hello Tester</changeTag>

2.1.2、Attributes

Tag可拥有多个Attributes，如上面的例子中则拥有class属性，值为‘tester’；Attributes存放时以字典形式存放，引用方法和字典引用方法一样。还有另外一种引用方法同name属性，使用.attrs来获取，两种方法使用示例(以交互式示例)：
>>>tag[‘class’]
tester
>>>tag.attrs
{‘class’:‘tester’}
tag的Attributes可以被修改，也可以删除
>>>del tag[‘class’]
>>>tag
<changeTag >hello Tester</changeTag>
>>>print(tag.get(‘class’))
None

2.1.3、多值属性（HTML）

多值属性以列表形式返回，若不支持多值属性，则以字符串形式返回。

2.2、 NavigableString

BeautifulSoup中用NavigableString类来处理字符串。还是以之前的为例：
>>>tag.string
hello Tester
tag中包含的字符串如何修改，通过replace_with()方法：
>>>print(tag.string.replace_with(‘welcome to BeautifulSoup’))
<changeTag>welcome to BeautifulSoup</changeTag>

上一篇 Python学习爬虫（3）–Beautifulsoup入门介绍
 下一篇：Python学习爬虫（5）–BeautifulSoup遍历文档树, .children, .parents, .descendants等

IT小样

发布了39 篇原创文章 · 获赞 16 · 访问量 1万+

私信关注

Python学习爬虫（4）--BeautifulSoup中Tag及NavigableSting详细介绍