利用BeautifulSoup4解析和提取 HTML/XML 数据

1 BeautifulSoup4简介

BeautifulSoup4和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，

支持CSS选择器（http://www.w3school.com.cn/cssref/css_selectors.asp）、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

下面直接上图直观的比较一下：

抓取工具	速度	使用难度	安装难度扫描二维码关注公众号，回复： 1918852 查看本文章
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

安装过程就不赘述了，命令在下面，直接使用就好：

Python3中Linux安装命令：sudo pip3 installbeautifulsoup4

Python2使用 pip 安装命令：sudo pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

2. 四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

· Tag，通俗点讲就是 HTML 中的一个个标签

· NavigableString，获取标签内部的文字

· BeautifulSoup，对象表示的是一个文档的内容

· Comment，是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号

因为NavigableString实际应用比较多，这里只详细介绍一下NavigableString

#导入BeautifulSoup4

from bs4 import BeautifulSoup

 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #创建 Beautiful Soup 对象 soup = BeautifulSoup(html,"lxml")#不写"lxml"好多时候会报错 print(soup.p.string)# The Dormouse's story print(type(soup.p.string))# <class 'bs4.element.NavigableString'>

运行结果：

3. 遍历文档树

3.1. 直接子节点：.contents .children 属性

.contents 和 .children 属性仅包含tag的直接子节点

3.1.1 .content

tag 的 .content 属性可以将tag的子节点以列表的方式输出

print(soup.head.contents) #[<title>The Dormouse's story</title>]

输出方式为列表，我们可以用列表索引来获取它的某一个元素

print(soup.head.contents[0])#<title>The Dormouse's story</title>

#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")
for a in sp.find_all(name="a"):
   content = a.contents
   print(content)

运行效果

3.1.2 .children

它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。

我们打印输出.children 看一下，可以发现它是一个list 生成器对象

print(soup.head.children)#<listiterator object at 0x7f71457f5710>
for child in  soup.body.children:
    print(child)

结果:

#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")

for item in sp.find_all(name="p"):
   #遍历取出所以的p标签,并且取出每个标题的子标签
   for ch in item.children:
      print(ch)

运行效果：

3.2. 所有子孙节点:.descendants 属性

.contents 和 .children 属性仅包含tag的直接子节点，.descendants属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:
    print(child)

运行结果：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

案例2

#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")
list_p = sp.find_all(name="p")
for p in list_p:
   print(p.name,":的所以后代==============")
   for child in p.descendants:
      print(child)

3.3. 节点内容:.string 属性

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。例如：

print(soup.head.string)#The Dormouse's story
print(soup.title.string)#The Dormouse's story

4. 搜索文档树--find_all

语法：.find_all(name, attrs, recursive, text, **kwargs)

4.1 name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:

print(soup.find_all('b'))# [<b>The Dormouse's story</b>]
print(soup.find_all('a'))#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到

soup = BeautifulSoup(html,"lxml")
import re
#匹配所以b开头的标题它们是:body标签和b标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)# body# b

C.传列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:

print(soup.find_all(["a", "b"]))# [<b>The Dormouse's story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4.2 keyword 参数

查找id为link3的标签

print(soup.find_all(id='link2'))# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

4.3 text 参数

通过 text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

import re
print(soup.find_all(text="Elsie"))
# []

print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
# ['Lacie', 'Tillie']
# 查找包含’Dormouse’内涵的
print(soup.find_all(text=re.compile("Dormouse")))
#["The Dormouse's story", "The Dormouse's story"]

4.4 href

得到所有的连接

from bs4 import BeautifulSoup

import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,"lxml")

links = soup.find_all(href=re.compile(r'http://example.com/'))
#得到所以的链接
for link in links:
   print(link["href"])