Class 14 - 2 解析库 -- Beautiful Soup

Beautiful Soup是 Python 的一个 HTML 或 XML 的解析库，库借助网页的结构和属性等特性来解析网页

解析器
- Beautiful Soup在解析时依赖解析器，除了支持 Python 标准库中的 HTML 解析器外，还支持一些第三方解析器（比如 lxml ）。
- 以上对比，lxml 解析器有解析 HTML 和 XML 的功能，速度’快，容错能力强，所以推荐使用它。
- 如果使用 lxml，那么在初始化 Beautiful Soup 时，可以把第二个参数改为 lxml。例：
```
from bs4 import BeautifulSoup
Soup = BeautifulSoup('<p>Hello</p>','lxml')
print(Soup.p.string)
```

基本用法

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class ='title' name='dromouse'><b>The Dormouse's story</p></p>
<p class = 'story'>Once upon a time there were three little sisters; and their names were
<a href='http://example.com/elsie' class='sister' id='link1'><!--Elsie--></a>,
<a href='http://example.com/lacie' class='sister' id='link2'>Lacie</a> and
<a href='http://example.com/tillie'class='sister' id='link3'>Tillie</a>;
and they lived at the bottom of a well.</p>
<p class='story'>...</p>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

输出：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!--Elsie-->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

View Code

首先声明变量 html，它是一个 HTML 字符串。需要注意的是，它并不是一个完整的 HTML 字符串，因为 body 和 html 节点都没有闭合。接着，将它当作第一个参数传给 BeautifulSoup 对象，对象的第二个参数为解析器的类型（这里使用 lxml ），此时就完成了 BeaufulSoup 对象的初始化。然后，将这个对象赋值给 soup 变量。
接下来，调用 soup 的各个方法和属性解析HTML 代码。
首先，调用 prettify（）方法可以把要解析的字符串以标准的缩进格式输出。注意：输出结果里面包含 body 和 html 节点，也就是说对于不标准的 HTML 字符串 Beautifol Soup , 可以自动更正格式。这一步不是由 prettify（）方法做的，而是在初始化 Beautifol Soup 时就完成了。
然后调用 soup.title.string，这实际上是输出 HTML 中 title 节点的文本内容。所以，soup.title 可以选出 HTML 中的 title 节点，再调用 string 属性就可以得到里面的文本了，所以我们可以通过简单调用几个属性完成文本提取。

节点选择器
- 直接调用节点的名称就可以选择节点元素，再调用 string 属性就可以得到节点内的文本。如果单个节点结构层次非常清晰，可以选用这种方式来解析。
- 选择元素
  - ```
  html = '''
  --snip--
  '''
  from bs4 import BeautifulSoup
  soup=BeautifulSoup(html,'lxml')
  print(soup.title)
  print(type(soup.title))
  print(soup.title.string)
  print(soup.head)
  print(soup.p)
```
  输出： <title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  - 首先输出 title 节点的选择结果，输出结果正是 title 节点加里面的文字内容。
  - 接下来，输出它的类型，是 bs4.element.Tag 类型，这是 Beautiful Soup 中一个重要的数据结构。经过选择器选择后，选择结果都是这种 Tag 类型。 Tag 具有一些属性，比如调用string 属性可以得到节点的文本内容，所以接下来的输出结果正是节点的文本内容。尝试选择了 head 节点，也是节点加其内部的所有内容。
  - 选择了 p 节点。不过这次情况比较特殊，我们发现结果是第一个 p 节点的内容，后面的几个 p 节点并没有选到。也就是说，当有多个节点时，这种选择方式只会选择到第一个匹配的节点，其他的后面节点都会忽略
- 提取信息　　　　　　
  1. 获取名称
    - 利用 name 属性获取节点的名称，选取 title 节点，然后调用 name属性就可以得到节点名称：
      print(soup.title.name) 输出： title
  2. 获取属性
    - 每个节点可能有多个属性，如 id class 等，选择这个节点元素后，可以调用 attrs 获取所有属性：
      print(soup.p.attrs) print(soup.p.attrs['name']) 输出： {'class':['title'], 'name':'dromouse'} dromouse
      attrs 的返回结果是字典形式，把选择的节点的所有属性和属性值组合成一个字典。如果要获取 name 属性，就相当于从字典中获取某个键值，只需要用中括号加属性名就可以比如，要获取 name 属性，就可以通过 attrs['name'] 来得到
    - 可以不用写 attrs ，直接在节点元素后面加中括号，传入属性名就可以获取属性值了。示例：
      print(soup.p['name']) print(soup.p['class']) 输出： dromouse ['title']
      注意：有的返回结果是字符串，有的返回结果是字符串组成的列表。比如， name 属性的值是唯一的，返回的结果就是单个字符串。而对于 class 一个节点元素可能有多个 class 所以返回的是列表。
  3. 获取内容
    - 可以利用 string 属性获取节点元素包含的文本内容，如获取第一个 p 节点的文本：
      print(soup.p.string) 输出： The Dormouse's story
      注意：这里选择到的 p 节点是第一个 p 节点，获取的文本也是第一个 p 节点里面的文本。
- 嵌套选择
  - 以上例子，每个返回结果都是 bs4 element.Tag 类型，同样可以继续调用节点进行下一步的选择。如：获取了 head 节点元素，可以继续调用 head 来选取内部的 head 节点元素：
```
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.head.title.string)
print(type(soup.head.title))
输出：
The Dormouse's story
<class 'bs4.element.Tag'>
```
    第一行结果是调用 head 之后再次调用 title 而选择的 title 节点元素。输出类型仍然是 bs4.element.Tag 类型。在 Tag 类型的基础上再次选择得到的依然还是 Tag 类型，每次返回的结果都相同。所以这样就可以做嵌套选择了。最后，输出它的 string 属性，也就是节点里的文本内容。
- 关联选择

1. 1. 子节点和子孙节点

选取节点元素之后，如果想要获取它的直接子节点，可以调用 contents 属性。示例：

html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class='story'>
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.p.contents)
输出：

['\n    Once upon a time there were three little sisters; and their names were\n    ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\nand\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\nand they lived at the bottom of a well.\n']

View Code

返回结果是列表形式。p 节点里既包含文本，又含节点，最后会将它们以列表形式统一返回

注意：列表中的每个元素都是 p 节点的直接子节点。如第一个a节点里包含一层 span 节点，这相当于孙子节点了，但是返回结果并没有单独把 span 节点选出来。所以， contents 属性得到的结果是直接子节点的列表

同样，可以调用 children 属性得到相应的结果：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):   
"""enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中"""
    print(i,child)
输出：

<list_iterator object at 0x000002192389F630>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
and they lived at the bottom of a well.

View Code

调用了 children 属性来选择，返回结果是生成器类型。再使用 for 循环输出相应的内容。

如果要得到所有的子孙节点的话，可以调用 descendants 属性：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)
输出：

<generator object descendants at 0x0000022B542F9410>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
and they lived at the bottom of a well.

View Code

返回结果是生成器。遍历输出可以看到，输出结果包含了 span 节点。descendants 会递归查询所有子节点，得到所有的子孙节点　　

父节点和祖先节点

如果要获取某个节点元素的父节点，可以调用 parent 属性：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.a.parent)
输出：

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>

View Code

这里我们选择的是第一个 a节点的父节点元素。它的父节点是 p节点，输出结果便是p节点及其内部的内容。

注意：这里输出的仅仅是 a 节点的直接父节点，而没有再向外寻找父节点的祖先节点。如果想获取所有的祖先节点，调用 parents 属性：

html = '''
<html>
<body>
<p class='story'>
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.a.parents)
print(list(enumerate(soup.a.parents)))
输出：

<generator object parents at 0x0000011701B69410>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]

View Code

返回结果是生成器类型。用列表输出了它的索引和内容，而列表中的元素就是 a 节点的祖先节点。

兄弟节点

html = '''
<html>
<body>
<p class="story">
         Once upon a time there were three sisters;and their names were    
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
         Hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Sibling', list(enumerate(soup.a.next_siblings)))
print('Prev Sibling', list(enumerate(soup.a.previous_siblings)))
输出：

Next Sibling 
         Hello

Prev Sibling 
         Once upon a time there were three sisters;and their names were    

Next Sibling [(0, '\n         Hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n    and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n    and they lived at the bottom of a well.\n')]
Prev Sibling [(0, '\n         Once upon a time there were three sisters;and their names were    \n')]

View Code

这里调用了4 个属性，其中 next_sibling和previous_sibling 获取节点的下一个上一个兄弟元素， next_siblings和 previous_siblings 则分别返回所有前面和后面兄弟节点的生成器

信息提取

html = '''
<html>
<body>
<p class="story">
         Once upon a time there were three sisters;and their names were    
<a href="http://example.com/elsie" class="sister" id="link1">BOb</a><a href="http://example.com/lacie"
class="sister" id="link2">Lacie</a>
</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.previous_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
输出：

Next Sibling:
<class 'bs4.element.NavigableString'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
         Once upon a time there were three sisters;and their names were    
<a class="sister" href="http://example.com/elsie" id="link1">BOb</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']

View Code

如果返回结果是单个节点，可以直接调用 string attrs 等属性获得其文本和属性；如果返回结果是多个节点的生成器，则可以转为列表后取出某个元素，然后再调用 string attrs 等属性获取其对应节点的文本和属性。

方法选择器

前面所讲选择方法是通过属性来选择的，这种方法非常快，但是如果进行比较复杂的选择的话，它就比较烦琐，不够灵活。

find_all()

是查询所有符合条件的元素给它传入一些属性或文本，就可以得到符合条件的元素。API 如下：
```
find_all(name,attrs,recursive,text,**kwargs)
```

name

可以更加节点来查询元素，示例：

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li cass= "element">Bar</li>
<li cass= "element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class= "element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))
输出：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li cass="element">Bar</li>
<li cass="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

View Code

调用了 find_all（）方法，传入 name 参数，其参数值为 ul。也就是说，想要查询所有 ul 节点，返回结果是列表类型，长度为2，每个元素依然都是 bs4.element.Tag类型。

因为都是 Tag 类型，所以依然可以进行嵌套查询还是同样的文本，这里查询出所有节点后，再继续查询其内部的 li 节点：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
输出：
[<li class="element">Foo</li>, <li cass="element">Bar</li>, <li cass="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

返回结果是列表类型，列表中的每个元素依然还是 ag 类型。

接下来，可以遍历每个 li ，获取它的文本了：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)
输出：

[<li class="element">Foo</li>, <li cass="element">Bar</li>, <li cass="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

View Code

atrrs

除了根据节点名查询，我们也可以传入一些属性来查询，示例：

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li cass= "element">Bar</li>
<li cass= "element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class= "element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
输出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li cass="element">Bar</li>
<li cass="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li cass="element">Bar</li>
<li cass="element">Jay</li>
</ul>]

View Code

查询的时候传入的是 attrs 参数，参数的类型是字典类型，如要查询 id为 list-1 的节点，可以传入 attrs ＝｛'id' :' list-1'｝查询条件，得到的结果是列表形式，包含的内容就是符合id为 list-1 的所有节点。以上示例，符合条件的元素个数是1，所以结果是长度为 1 的列表。

一些常用的属性，比如 id 和 class 等，可以不用 attrs 来传递。如，要查询 id为 list-1 的节点，可以直接传人 id 这个参数。换种方式查询：

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li cass= "element">Bar</li>
<li cass= "element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class= "element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_ ='element'))
输出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li cass="element">Bar</li>
<li cass="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

View Code

直接传入 id = 'list-1', 就可以查询 id为list-1的节点元素。对于 class 来说，由于 class在Python 里是一个关键字，所以后面需要加一个下划线，即 class_ = 'element' ，返回结果依然是Tag。

text
- text参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象，示例：
```
import re
html = '''
<div class="panel">
<div class="panel-body">
<a>Hello,this is a link</a>
<a>Hello,this is a link,too</a>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))
输出：
```
```
['Hello,this is a link', 'Hello,this is a link,too']
```
  这里有两个 a 节点，内部包含文本信息。这里在 find_all（）方法传人text 参数，该参数为正则表达式对象，结果返回所有匹配正则表达式的节点文本组成的列表。
- find()

CSS选择器

使用 css 选择器时，只需要调用 select（）方法，传人相应的 css 选择器即可，示例：

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li cass= "element">Bar</li>
<li cass= "element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class= "element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
输出：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li cass="element">Bar</li>, <li cass="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

View Code

这里用了CSS 选择器，返回的结果均是符合 css 选择器的节点组成的列表。如， select( ’ ul li ’）则是选择所有 ul 节点下面的所有li 节点，结果所有的 li 节点组成的列表。输出列表中元素的类型依然是 Tag 类型。

嵌套选择

select（）方法同样支持嵌套选择。如，先选择所有 ul 节点，再遍历每个 ul 节点，选择其 li节点，示例：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))
输出：

[<li class="element">Foo</li>, <li cass="element">Bar</li>, <li cass="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

正常输出了所有 ul 节点下所有 li 节点组成的列表

获取属性　　
- 节点类型是 Tag 类型，获取属性可以用原来的方法。仍然是以上的 HTML 文本，尝试获取每个 ul 节点的 id 属性：
```
from bs4 import BeautifulSoup
soup=BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
输出：
```
  list-1 list-1 list-2 list-2
  View Code
  接传入中括号和属性名，以及通过 attrs 属性获取属性值，都可以成功。

获取文本

获取文本，除了可以用 string 属性。还有一个方法，那就是 get_text() ，效果一致。示例：　　

from bs4 import BeautifulSoup
soup=BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print('Get Text:',li.get_text())
    print('string:',li.string)
输出：

Get Text: Foo
string: Foo
Get Text: Bar
string: Bar
Get Text: Jay
string: Jay
Get Text: Foo
string: Foo
Get Text: Bar
string: Bar

View Code

小结：

使用lxml解析库，必要时使用html.parse.
节点筛选功能弱，但是速度快。
使用find（）或者find_all（）查询匹配单个结果或者多个结果　　　　　　　　　　
CSS选择器，可以使用select（）方法选择

Class 14 - 2 解析库 -- Beautiful Soup

猜你喜欢