【python爬虫自学笔记】-----Beautiful Soup 用法

简介

主要功能是从网页抓取数据。Beautiful Soup提供一些简单的、python式的函数用来处理导航】搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，仅仅说明一下原始编码方式就可以了。

Beautiful Soup已成为和xml、html6lib一样出色的python解释器，为用户灵活提供不同的解析策略或强劲的速度。

安装

python3中的模块为bs4

Beautiful Soup使用

创建Beautiful Soup对象

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
soup = BeautifulSoup(html)
#打印soup对象的内容，格式化输出
print(soup.prettify())

四大对象种类

Beautiful Soup将复杂的HTML文档转换为一个复杂的树形结构，每个结点都是python对象，所有的对象可以归纳为以下4种：

Tag
NavigableString
BeautifulSoup
Comment

(1)Tag HTML中的标签

print(soup.title)
print(soup.head)
print(soup.a)
print(soup.p)

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

#验证对象类型
print(type(soup.a))

<class 'bs4.element.Tag'>

属性

name：soup对象本身比较特殊，它的name即为[document]，对于其他内部标签，输出值变为标签本身名称。

print(soup.name)
print(soup.head.name)

[document]
head

attrs：可以对属性进行删查改

#查看属性
print(soup.p.attrs)
print(soup.p['class'])
print(soup.p.get('class'))
#修改属性
soup.p['class'] = 'newClass'
print(soup.p)
#删除属性
del soup.p['class']
print(soup.p)

{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

（2） NavigableString，可遍历字符串

#获取标签内部的文字
print(soup.p.string)
print(type(soup.p.string))

The Dormouse's story
<class 'bs4.element.NavigableString'>

（3）BeautifulSoup对象表示一个文档的全部内容

print(type(soup.name))
print(soup.name)
print(soup.attrs)

<class 'str'>
[document]
{}  #空字典

（4）Comment对象是一个特殊类型的NavigableString对象，实际是注释内容，但是已经把注释符号去掉了。

print(soup.a)
print(soup.a.string)
print(type(soup.a.string))

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>

遍历文档树

（1）直接子结点

tag.contents属性可以将tag的子结点以列表方式输出，还可以通过列表索引获取某一个元素。

print(soup.head.contents)
print(soup.head.contents[0])

[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>

tag.children 返回的不是list，而是list生成器对象，可以通过它遍历获取所有子结点。

print(soup.head.children)
for child in soup.body.children:
    print(child)

<list_iterator object at 0x000001FCC99A4668>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

（2）所有子孙结点

tag.descendants属性，可以对所有tag的子孙结点进行递归循环，和children类似，也需要遍历获取其中的内容。

for child in soup.descendants:
    print(child)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>

将所有的结点打印出来，先是html，其次是head，一个一个剥离。

（3）结点内容

tag.string属性，如果tag只有一个NavigableString类型子结点，那么这个tag可以使用 .string属性得到子结点；如果一个tag仅有一个子结点，那么也可以使用.string属性获取内容。

print(soup.head.string)
print(soup.title.string)

The Dormouse's story
The Dormouse's story

如果tag包含多个子结点，tag就无法确定，string方法应该调用哪个子结点的内容，.string的输出结果为None。

print(soup.html.string)

None

（4）多个内容

tag.strings属性获取多个内容，需要遍历获取。

for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'

tag.stripped_strings属性，输出的字符串中可能包含了很多空格或空行，使用.stripped_strings可以去除多余空白内容。

for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'

（5）父结点

tag.parent属性

p = soup.p
print(p.parent.name)
content = soup.head.title.string
print(content.parent.name)

body
title

（6）全部父结点

tag.parents属性：递归得到元素的所有父辈结点。

content = soup.head.title.string
for parent in content.parents:
    print(parent.name)

title
head
html
[document]

（7）兄弟结点

tag.next_sibling属性：获取该结点的下一个兄弟结点。如果结点不存在，返回None。

tag.previous_sibling属性：获取该结点的前一个兄弟结点。如果结点不存在，返回None。

注意：实际文档中的tag.next_sibling和tag.previous_sibling属性通常是字符串或者空白，因为空白或者换行也可以被视作一个结点，所哟得到的结果可能是空白或者换行。

print(soup.p.next_sibling)
print(soup.p.prev_sibling)
print(soup.p.next_sibling.next_sibling)

D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
#空白
None#没有前一个兄弟结点，返回None
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
#下一个结点的下一个兄弟结点是可以看到的结点
Process finished with exit code 0

（8）全部兄弟结点

tag.next_siblings属性、tag.previous_siblings属性：对当前结点的兄弟结点迭代输出。

for sibling in soup.a.next_siblings:
    print(repr(sibling))

D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

Process finished with exit code 0

（9）前后结点

tag.next_element属性、tag.previous_element属性：不分层次关系的前后标签。

print(soup.head.next_element)

<title>The Dormouse's story</title>

（10）所有前后结点

tag.next_elements属性和tag.previous_elements属性：通过迭代器向前或者向后访问文档的解析内容。

for element in soup.a.next_elements:
    print(repr(element))

' Elsie '
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
'Lacie'
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

搜索文档树

（1）find_all(name, attrs, recursive, text, **kwargs)：搜索当前tag的所有tag子结点，并判断是否符合过滤器的条件。

name参数：查找所有名字为name的tag，字符串对象会自动忽略掉。

传字符串

print(soup.find_all('b'))
print(soup.find_all('a'))

[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传正则表达式：Beautiful Soup通过正则表达式的match()来匹配内容。

import re
for tag in soup.find_all(re.compile('^b')):  #以b开头的标签
    print(tag.name)

body
b

传列表：如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容返回。

print(soup.find_all(['a','b'])) 

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

传布尔值：True可以匹配任何值。

for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

传方法：如果没有合适的过滤器，可以定义一个方法，方法只接受一个元素参数，如果这个方法返回True表示当前元素匹配并且被找到，如果不是则返回False。

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

keyword参数：如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含一个名字为id的参数，Beautiful Soup会搜索每个tag的id属性。

import re
print(soup.find_all(id = 'link2'))
print(soup.find_all(href = re.compile('elsie')))
print(soup.find_all(href = re.compile('elsie'),id = 'link1'))
print(soup.find_all('a',class_ = 'sister'))#若用class过滤，不过class是python关键词，后面要加下划线

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

text参数：搜索文档中的字符串内容与name参数的可选值一样，text参数接受字符串，正则表达式，列表，True。

print(soup.find_all(text='Elsie'))
print(soup.find_all(text=['Tillie','Elsie','Lacie']))
print(soup.find_all(text=re.compile('Dormouse')))

[]
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]

limit参数：find_all()返回全部的搜索结构，如果文档树很大那么搜索会很慢，如果不需要全部结果，可以使用limit参数限制返回结果的数量，效果与SQL中limit关键字类似，当搜索到的结果数量达到limit的限制时，就停止搜索返回结果。

print(soup.find_all('a',limit=2))
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive参数：调用tag的find_all()方法时，Beautiful Soup会检索当前tag的所有子孙结点，若只想搜索tag的直接子结点，需要使用参数recursive=false。

print(soup.html.find_all('title'))
print(soup.html.find_all('title',recursive = False))
print(soup.html.find_all('head',recursive = False))

[<title>The Dormouse's story</title>]
[]
[<head><title>The Dormouse's story</title></head>]

（2）find(name,attrs,recursive,text,**kwargs):与find_all()的区别是，find_all()方法的返回结果值包含一个元素的列表，而find()直接返回结果。

（3）find_parents()和find_parent()：用来搜索当前结点的父结点，搜索方法与普通tag的搜索方法相同，搜索文档搜索文档包含的内容。

（4）find_next_siblings()和find_next_sibling()：返回所有符合条件的后面的兄弟结点；返回符合条件的后面的第一个tag结点。

（5）find_previous_siblings()和find_previous_sibling()：返回所有符合条件的前面的兄弟结点；返回第一个符合条件的前面的兄弟结点。

（6）find_all_next()和find_next()：通过.next_elements属性对当前tag之后的tag和字符串进行迭代

（7）find_all_previous()和find_previous()：通过.previous_elements属性对当前结点前面的tag和字符串进行迭代。

CSS选择器

使用方法为soup.select()，返回类型为list

（1）通过标签名查找

print(soup.select('title'))
print(soup.select('a'))
print(soup.select('b'))

[<title>The Dormouse's story</title>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<b>The Dormouse's story</b>]

（2）通过类名查找：类名前加 .。

print(soup.select('.sister'))
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过id名查找：id名前加#

print(soup.select('#link1'))
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

print(soup.select('p #link1'))
#直接子标签查找
print(soup.select('head > title'))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<title>The Dormouse's story</title>]

（5）属性查找：属性需要使用[]，属性与标签属于同一结点，中间不能加空格，否则无法匹配。

print(soup.select('a[class="sister"]'))
print(soup.select('a[href="http://example.com/elsie"]'))
print(soup.select('p a[href="http://example.com/elsie"]'))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（6）也可以使用遍历形式输出，然后使用get_text（）方法来获取它的内容。

print(type(soup.select('title')))
print(soup.select('title')[0].get_text())
for title in soup.select('title'):
    print(title.get_text())

<class 'list'>
The Dormouse's story
The Dormouse's story