Python爬虫之BeautifulSoup库(四)：搜索文档树

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、过滤器

1.字符串

最简单的过滤器是字符串，在搜索方法中传入一个字符串参数

soup.find_all('b') # 查找所有的<b>标签

[<b>The Dormouse's story</b>]

2.正则表达式

import re
# 找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

body
b

# 找出所有包含t的标签
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html
title

3.列表

soup.find_all(["a","b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4.Ture：找出所有的Tag，但不返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

5.方法

如果没有合适的过滤器，那么还可以自己定义一个方法，该方法只接受一个元素参数，如果这个方法返回True表示当前元素匹配，如果不是则返回False

# tag包含class属性但不包含id属性
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

from bs4 import NavigableString
# 所有被文字包含的节点内容
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print(tag.name)

p
a
a
a
p

二、find_all(name,attrs,recursive,text,**kwargs)

1.name参数

name参数可以查找所有名字为name的tag，字符串对象会被自动忽略掉

soup.find_all("title")

[<title>The Dormouse's story</title>]

2.kwargs参数

如果指定参数不是内置的参数，搜索时会把该参数当做指定名字tag的属性来搜索

soup.find_all(id='link2') # id='link2'的tag

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie")) # 属性href包含elsie的tag

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id=True) # 所有包含id属性的tag，不论id取什么值

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(href=re.compile("elsie"), id='link1') # 同时使用多个属性进行过滤

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

有些tag属性在搜索中不能使用，此时可以使用attrs参数

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
data_soup.find_all(attrs={"data-foo": "value"})

[<div data-foo="value">foo!</div>]

3.text参数

通过text参数可以搜索文档中的字符串内容

print(soup.find_all(text="Elsie"))
print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(text=re.compile("Dormouse")))

['Elsie']
['Elsie', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]

4.limit参数

有时文档太大时，使用find_all进行搜索很慢，通过limit参数限制返回结果的数量

soup.find_all('a',limit=2)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5.recursive参数

find_all默认会搜索当前tag的所有子孙节点，当recursive=False时，find_all只会搜索当前tag的直接子节点

print(soup.html.find_all("title"))
print(soup.html.find_all("title", recursive=False))

[<title>The Dormouse's story</title>]
[]

6.按CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过class_参数搜索有指定CSS类名的tag

soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True

soup.find_all(class_=re.compile("itl"))

[<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的class属性是多值属性，在安装CSS进行搜索tag时，可以分别搜索class的每个值

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
print(css_soup.find_all("p", class_="strikeout"))
print(css_soup.find_all("p", class_="body"))

[<p class="body strikeout"></p>]
[<p class="body strikeout"></p>]

通过class_属性进行搜索时，也可以进行完全匹配，而且顺序要完全相同

css_soup.find_all("p", class_="body strikeout")

[<p class="body strikeout"></p>]

7.tag()：快捷的调用find_all()的方法

print(soup.find_all("a"))
print(soup("a"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

三、find()：返回第一匹配项，而不是所有匹配项

soup.find('title')

<title>The Dormouse's story</title>

四、其他搜索函数

1.find_parent()和find_parents()

find_all()和find()主要是用来搜索当前节点的所有子孙节点，而find_parent()和find_parents()则是用来搜索当前节点的父辈节点的。

2.find_next_sibling()和find_next_siblings()

搜索后面的兄弟结点

3.find_pervious_sibling()和find_pervious_siblings()

搜索前面的兄弟结点

4.find_all_next()和find_next()

搜索深度优先的后序节点

5.find_all_perivous()和find_perviuos()

搜索深度优先的前序节点

五、CSS选择器

BeautifulSoup支持大部分的CSS选择器，只要在select()字符串参数即可。详细用法请查找CSS选择器相关教程

print(soup.select("title"))
print(soup.select("html head title"))

[<title>The Dormouse's story</title>]
[<title>The Dormouse's story</title>]