python爬虫之Beautiful Soup学习笔记01

参考博文：https://cuiqingcai.com/1319.html

本人近期在学习爬虫知识，写正则去匹配不是很友好，Beautiful Soup很友善，对网页解析较好，以下为学习笔记，在把这些基础知识弄完再去写爬虫代码，会更容易上手

1.BS介绍

Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了

安装方式：

pip install beautifulsoup4 或者 conda install beautifulsoup4(前提是安装了anaconda)

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

BeautifulSoup标准化html，如下图所示：

2.BS四大对象

BS的所有对象可以归纳为4种:
1. Tag
2. NavigableString
3. BeautifulSoup

4. Comment

2.1 Tag

Tag 通俗点讲就是 HTML 中的一个个标签，例如

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a

上面的title和a加上里面的内容就是一个tag

print (soup.title)
输出：<title>The Dormouse's story</title>

print (soup.title.name)
输出：title

print (soup.title.string)
输出：The Dormouse's story

print (soup.title.parent.name)
输出：head

print (soup.p)
输出：<p class="title"><b>The Dormouse's story</b></p>

print (soup.p.get('class'))
print (soup.p['class'])
输出：['title']（列表）

print (soup.find_all('a'))
输出：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print (soup.find(id="link3"))
输出：<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

获得标签的属性，为字典类型如下：

print (soup.p.attrs)
{'class': ['title']

2.2 NavigableString

上面介绍了获取标签的内容，但我们很多时候需要获取标签内部的文字信息，我们可用.string，和get_text()等

下述两种方式结果相同

print (soup.a.string)

print (soup.a.get_text())
Elsie
Elsie

我们看看他们的输出类型是什么

print (type(soup.a.string))
输出：<class 'bs4.element.NavigableString'>

2.3 BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性

print (type(soup.name))
print (soup.name)
<class 'str'>
[document]

2.4 Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

print (soup.a)
print (soup.a.string)
print (type(soup.a.string))
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie (和原文不一样)
<class 'bs4.element.Comment'>

3.遍历文本

3.1 contents

Tag的contents可以将tag的内容以列表的形式展出

print (soup.head.contents )
[<title>The Dormouse's story</title>]

print (soup.head.contents[0])
<title>The Dormouse's story</title>

3.2 children

children它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象

print (soup.head.children)
<list_iterator object at 0x000000000D293588>

for child in  soup.body.children:
    print (child)


<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

3.3 所有孙子节点

for child in soup.descendants:
    print (child)
输出：
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story

3.4节点内容

print (soup.head.string)
print (soup.title.string)
The Dormouse's story
The Dormouse's story

print (soup.html.string)

None

`3.5 获取多个内容strings`

strings不需要遍历获取就可以获得多个节点的内容

for string in soup.strings:
    print(repr(string))
'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'

3.6 剔除空白或空格.stripped_strings

for string in soup.stripped_strings:
    print(repr(string))
"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'

3.7 父节点

content = soup.head.title.string
print (content.parent.name)
title

3.8 全部父节点parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点

for parent in  content.parents:
    print (parent.name)
title
head
html
[document]

3.9 兄弟节点（.next_sibling 和 .previous_sibling）

兄弟节点可以理解为和本节点处在同一级的节点，.next_sibling 属性获取了该节点的下一个兄弟节点，.previous_sibling 则与之相反，如果节点不存在，则返回 None

注意：实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

print (soup.p.next_sibling)
#       实际该处为空白
print (soup.p.prev_sibling)
#None   没有前一个兄弟节点，返回 None
print (soup.p.next_sibling.next_sibling)


None
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

3.10全部兄弟节点（.next_siblings 和 .previous_siblings）

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

for sibling in soup.a.next_siblings:
    print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

4 搜索文档树

4.1 find_all( name , attrs , recursive , text , **kwargs )

一、name参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

1 传字符串

用find_all中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,示例

soup.find_all('a')
Out[11]: 
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2 传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.示例如下

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
输出：    
body
b

3 传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签

: soup.find_all(["a", "b"])
Out[15]: 
[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4 传True
True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
    
html
head
title
body
p
b
p
a
a
a
p

5 传方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False,下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')


has_class_but_no_id
Out[19]: <function __main__.has_class_but_no_id>

soup.find_all(has_class_but_no_id)
Out[20]: 
[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

二.text参数

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True

soup.find_all('a')
Out[22]: 
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(text="Elsie")
Out[23]: []
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
Out[24]: ['Lacie', 'Tillie']

import re
soup.find_all(text=re.compile("Dormouse"))
Out[25]: ["The Dormouse's story", "The Dormouse's story"]

三.limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量

soup.find_all("a", limit=2)
Out[27]: 
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

四.recursive 参数

soup.html.find_all("title")
Out[29]: [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
Out[30]: []

五.keywords参数

soup.find_all(id='link2')
Out[31]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))
Out[32]: [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

soup.find_all(href=re.compile("elsie"), id='link1')
Out[33]: [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

4.2 find( name , attrs , recursive , text , **kwargs )

soup.find('a')
Out[37]: <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
soup.find_all('a')
Out[38]: 
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]