解析库之Beautiful Soup（一）

原创不易，转载前请注明博主的链接地址：Blessy_Zhu https://blog.csdn.net/weixin_42555080
本次代码的环境：
运行平台： Windows
Python版本： Python3.x
IDE： PyCharm

一概述

通过正则表达式的学习，可以是吸纳提取页面信息的功能，（相关内容为：Python小知识-正则表达式和Re库（一）和Python小知识-正则表达式和Re库（一））但是对于爬虫初学者来说这块内容较繁琐，并且如过有一个地方出错则满盘皆输。这样就要寻找其他的提取网页信息的方法。了解HTML的都会知道，对于各个节点会有很多属性如id，class等。这些节点之间是树形结构，就可以通过XPath和CSS选择器来定位并提取节点，再通过调用相关的方法就可以实现提取正文内容或者属性。Python为我们提供了主要是lxml、Beautiful Soup、pyquery等解析库，当然了，还有re。
这四个库的比较呢：如果你的前端基础比较扎实，用pyquery是最方便的；Beautiful Soup这个库，对于爬虫初学者来说，强力推荐这个库；re速度比较快，但是写正则比较麻烦；lxml的速度也是相对较快的，建议使用。接下来，详细讲解Beautiful Soup的使用。
在这里插入图片描述

二 Beautiful Soup简介

这里直接给出官方解释：

Beautiful Soup提供一些简单的、 Python式的函数来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的Python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

实际上Beautiful Soup在解析时依赖解析器，这些解析器如表1所示：
　

　表 1 Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup (markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python3.2.2之前的版本文档容错能力差
Ixml HTML解析器	BeautifulSoup(markup, “1xml”)	速度快、文档容错能力强	需要安装C语言库
Ixml XML解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装 C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

除了支持Python标准库中的HTML解析器外，还支持一些第三方解析器。实际上，通过对比，较为常使用的解析器是lxml，如何使用呢，就是在初始化Beautiful Soup时，把第二个参数改为lxml即可：

from bs4 import BeautifulSoup
soup = BeautifulSoup('待解析额的HTML代码','lxml')
print(soup.p.string)

三 Beautiful Soup的基本用法

3.1 prettify()方法和string属性

html ="""
<html><head><title> The Dormouse's story</title></head>
<body>
<p class= "title" name= "dromouse"><b>The Dormouse's story</b></p>
<p class= "story">once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
 The Dormouse's story

首先人为的写了一段并不规范的HTML代码，因为body和html节点都没有闭合。接着，我们将它当作第一个参数传给BeutifulSoup 对象，该对象的第二个参数为lxml解析器。此时就完成了BeatfulSoup 对象的初始化。然后，将这个对象赋值给soup变量。接下来，就可以调用soup的各个方法和属性解析这串HTML代码了。

prettify()：这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是,输出结果里面包含body和html节点，也就是说对于不标准的HTML字符串,BeautifulSoup可以自动更正格式。这一步不是由prettify()方法做的，而是在初始BeautifulSoup时就完成了。
soup.title.string：这实际上是输出HTML中title节点的文本内容。所以，soup.title可以选出HTML中的title 节点，再调用string属性就可以得到里面的文本了，所以我们可以通过简单调用几个属性完成文本提取。

3.2 节点选择器

3.2.1 选择元素

......
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

结果：

<title>
   The Dormouse's story
  </title>
<class 'bs4.element.Tag'>

   The Dormouse's story
  
<head>
<title>
   The Dormouse's story
  </title>
</head>
<p class="title" name="dromouse">
<b>
    The Dormouse's story
   </b>
</p>

首先打印输出title节点的选择结果，输出结果正是title节点加里面的文字内容。
接下来，输出它的类型，是bs4 .element.Tag类型，这是BeautifulSoup中一个重要的数据结构。经过选择器选择后，选择结果都是这种Tag类型。
Tag 具有一些属性，比如string属性，调用该属性，可以得到节点的文本内容，所以接下来的输出结果正是节点的文本内容。
接下来，我们又尝试选择了head节点，结果也是节点加其内部的所有内容。
最后，选择了p节点。不过这次情况比较特殊，我们发现结果是第一个 p节点的内容，后面的几个p节点并没有选到。也就是说，当有多个节点时，这种选择方式只会选择到第一个匹配的节点，其他的后面节点都会忽略。

3.2.1 选择元素

......
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
print(soup.p.attrs)
print(soup.p.attrs['name'])
print(soup.p['name'])
print(soup.p['class'])

结果：

title
{'class': ['title'], 'name': 'dromouse'}
dromouse
dromouse
['title']

首先，提取信息：通过name属性获取title节点的名称。需要注意的是，这个name属性如果不加指定，就是指的是该节点用<>包围的内容，比如
节点，它的name属性就是p。
然后，获取属性：每个节点都可能会有多个属性，如id，class等，指定节点时可以调用attrs来获取所有的属性，soup.p.attrs就是获取p节点的全部属性，因为有多个p节点，那么就会选择第一个p节点作为目标对象。可以看到，attrs 的返回结果是字典形式，它把选择的节点的所有属性和属性值组合成一个字典。
接下来，如果要获取name 属性，就相当于从字典中获取某个键值，只需要用中括号加属性名就可以了。比如，要获取name属性，就可以通过attrs[‘name’ ]来得到。
最后，是一种更简单的获取方式:可以不用写attrs, 直接在节点元素后面加中括号，传人属性名就可以获取属性值了。形如：p[‘name’],这里需要注意的是，有的返回结果是字符串，有的返回结果是字符串组成的列表。比如，name属性的值是唯一的，返回的结果就是单个字符串。而对于class,一个节点元素可能有多个class,所以返回的是列表。在实际处理过程中，我们要注意判断类型。

3.2.2 关联选择

有时在选择指定节点元素，并不能一步到位，这个时候需要先选中某一个节点元素，然后以他为基准选择他的子节点、父节点、兄弟节点等。
（1）子节点和子孙节点
1 ) 若果想要获得字节子节点，可以调用contents属性，示例如下：

html ="""
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   0nce upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

结果：

['\n   0nce upon a time there were three little sisters;and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, '\n   ,\n   ', <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, '\n   and\n   ', <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>, '\n   ;\nand they lived at the bottom of a well.\n  ']

可以看到，返回结果是列表形式。p节点里既包含文本，又包含节点，最后会将它们以列表形式统一返回。
需要注意的是,列表中的每个元素都是p节点的直接子节点。比如第一个a节点里面包含层span节点，这相当于孙子节点了，但是返回结果并没有单独把span节点选出来。所以说，contents 属性得到的是直接子节点的列表。
2) 也可以通过children属性来选择，返回的结果是生成器类型，接下来用for循环输出相应的内容。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for child in enumerate(soup.p.children):
    print(child)

结果：

<list_iterator object at 0x0000000002E66EF0>
(0, '\n   0nce upon a time there were three little sisters;and their names were\n   ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>)
(2, '\n   ,\n   ')
(3, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>)
(4, '\n   and\n   ')
(5, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>)
(6, '\n   ;\nand they lived at the bottom of a well.\n  ')

3）如果想要得到所有的子孙节点的话，可以调用descendants属性，他返回的结果还是一个生成器，便利输出一下可以看到这次的输出结果包含了span节点，descendants会递归查询所有的子节点，得到所有的子孙节点。

 from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for child in enumerate(soup.p.descendants):
    print(child)

结果：

(0, '\n   0nce upon a time there were three little sisters;and their names were\n   ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>)
(2, '\n')
(3, ' Elsie ')
(4, '\n')
(5, '\n   ,\n   ')
(6, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>)
(7, '\n    Lacie\n   ')
(8, '\n   and\n   ')
(9, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>)
(10, '\n    Tillie\n   ')
(11, '\n   ;\nand they lived at the bottom of a well.\n  ')

（2）父节点和祖先节点

......
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
for child in enumerate(soup.a.parents):
    print(child)

结果：

<p class="story">
   0nce upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
</p>
..........

通过HTML的内容可以知道他的父节点是p节点，输出结果便是p节点以及它的全部内容。第二个输出是获得祖先节点，可想而知，返回结果应该是生成器类型，输出结果应该是列表类型的全部祖先节点以及其相应的内容。
（3）兄弟节点
要获取同级的节点，可以用next_sibling、next_siblings、previous_sibling、previous_siblings这些属性来获取。

html ="""
<html>
 <body>
  <p class="story">
   0nce upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
        Hello
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
and they lived at the bottom of a well.
  </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print("下一个兄弟",soup.a.next_sibling)
print("上一个兄弟",soup.a.previous_sibling)
print("下一个兄弟们",list(enumerate(soup.a.next_siblings)))
print("上一个兄弟们",list(enumerate(soup.a.previous_siblings)))

结果：

下一个兄弟 
        Hello
   
上一个兄弟 
   0nce upon a time there were three little sisters;and their names were
   
下一个兄弟们 [(0, '\n        Hello\n   '), (1, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>), (2, '\n   and\n   '), (3, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>), (4, '\nand they lived at the bottom of a well.\n  ')]
上一个兄弟们 [(0, '\n   0nce upon a time there were three little sisters;and their names were\n   ')]

四总结

这篇博客主要介绍了Beautiful Soup库的相关内容，包括解析器、Beautiful Soup的基本用法，重点介绍了节点选择器：选择元素、提取信息、关联选择、等内容。以上内容参考资料：崔庆才《Python3 网络爬虫开发实战》（主要参考），夏敏捷《Python程序设计-从基础到开发》，[挪]芒努斯·利·海特兰德(Magnus Lie Hetland)《Python基础教程第3版 Python编程从入门到实践》，并对以上作者表示感谢。这篇文章就到这里了，欢迎大佬们多批评指正，也欢迎大家积极评论多多交流。