Crawler - BeautifulSoup解析模块

一、BeautifulSoup创建对象

二、BeautifulSoup节点选择

一、BeautifulSoup创建对象

1. BeautifulSoup简介与安装

Beautiful Soup提供python式的函数用来处理导航、搜索、修改分析树等功能。通过解析文档为用户提供需要抓取的数据

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。当文档没有指定编码方式，Beautiful Soup不能识别编码方式，仅需要指明原始编码方式则可

BeautifulSoup安装：pip install beautifulsoup4

2.安装beautifulsoup4解析器

安装pip install lxml

安装pip install html5lib

3.创建 Beautiful Soup对象

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html) #html为字符串或句柄，默认使用html.parser解析器

soup = BeautifulSoup(open('index.html')) #用本地 HTML文件创建

soup = BeautifulSoup(open('index.html',encoding='utf-8'),'html5lib') #指定HTML文件编码格式和解析器

soup = BeautifulSoup(respHtml, fromEncoding="GB2312") #当html为非utf-8，则需指定编码

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8') #创建BeautifulSoup对象

二、BeautifulSoup节点选择

四大对象种类介绍:Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:Tag(标签)、NavigableString(文字内容)、BeautifulSoup、Comment

1.选择元素(Tag标签)

利用soup加标签名可轻松地获取这些标签的内容，不过它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，得用find_all 或 find函数。

print(soup.prettify()) #打印soup对象的内容，格式化输出

soup.head #定位第1个head标签，#<head><title>The Dormouse's story</title></head>

soup.head .title #定位head的第1个title标签， #<title>The Dormouse's story</title>

print soup.head.title.text #获取title的文字内容，#The Dormouse's story

print soup.head.title.get_text() #text与get_text()相同，#The Dormouse's story

print soup.title #输出第一个 title 标签

print soup.title.string #输出第一个 title 标签的包含内容

print soup.p #输出第一个 p 标签

print soup.a #输出第一个 a 标签

print soup.p.contents #输出第一个 p 标签的所有子节点

2.提取信息

(1)获取标签的名称

Tag有两个重要的属性：name和attrs，即名称和属性

soup.name #[document]

#soup对象本身比较特殊，其name即为[document]，其他内部标签的name值为标签本身名称。

soup.head.name #head

soup.head .title.name #title

soup.head .div.name #div

(2)获取标签的属性

Tag有两个重要的属性：name和attrs，即名称和属性

获取标签属性

print soup.p.attrs #获取p标签所有属性，返回一字典，#{'class': ['title'], 'name': 'dromouse'}

print soup.p['class'] #['title']

print soup.p.get('class') #也可用get命令获取属性，#['title']

print soup.p['name'] #['dromouse']

print soup.a['href'] #输出第一个 a 标签的 href 属性内容

修改属性

soup.p['class']="newClass" #将p标签的class属性值改为newClass ，若无class属性则添加

soup.a['href'] = 'http://www.baidu.com/'

soup.a['name'] = u'百度' #给第一个 a 标签添加 name 属性

删除属性

del soup.p['class']

del soup.a['class'] #删除第一个 a 标签的 class 属性为

(3)提取文字内容

用 .string 获取标签内部的文字，例如

print soup.p.string #The Dormouse's story

print type(soup.p.string) #<class 'bs4.element.NavigableString'>

值的类型是一个 NavigableString，翻译过来叫可以遍历的字符串

print soup.head.title.text #获取title的文字内容，#The Dormouse's story

print soup.head.title.get_text() #text与get_text()相同，#The Dormouse's story

(4)提取BeautifulSoup对象内容

BeautifulSoup对象表是一个文档的全部内容，当作一个特殊Tag对象，下面获取它的类型，名称以及属性

print type(soup.name)

#<type 'unicode'>

print soup.name

# [document]

print soup.attrs

#{} 空字典

(5)Comment提取注释内容

如下例,a节点带有注释

print soup.akl;md

print soup.a.string

print type(soup.a.string)

运行结果如下

Elsie

上面的代码中，首先判断是否为Comment类型，然后再进行其他操作，如打印输出

if type(soup.a.string)==bs4.element.Comment:

print soup.a.string

3.选择关联节点

(1)直接子节点

方式1：soup.head.title #直接通过下标名称访问子节点。

方式2：Tag.contents：以列表形式返回所有子节点。 print soup.head.contents

print soup.head.contents[0] 方式3：Tag.children：返回生成器，可通过遍历其获取所有子节点

for child in soup.body.children:

print(child)

(2)所有子孙节点

.descendants 属性可对所有tag的子孙节点进行递归循(生成器)，而 .children 属性仅包含tag子节点

for child in soup.descendants:

print child

(3)节点文字内容

Tag.string：返回标签文字，若Tag只有一个String子节点，则返回标签文字，多个则返回None

.strings获取多个内容，不过需要遍历获取，如例子

.stripped_strings 输出的字符串中去除多余空白内容，如空行，空格，换行符等

返回单个

注意：若一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。若标签里仅有唯一的一个标签，那么 .string 也会返回最里面的内容。若有多个标签的话，那么就会返回None。

print soup.head.string

#The Dormouse's story

print soup.title.string

#The Dormouse's story

tag包含多个子节点，.string 的输出结果是None

print soup.html.string

# None

遍历多个

for string in soup.strings:

print(repr(string))

for string in soup.stripped_strings:

print(repr(string))

(4)父节点

Tag.parent：获取父节点

通过元素的 .parents 属性可以递归得到元素的所有父辈节点

获取父节点

content = soup.head.title.string

print(content.parent.name) #title

递归得到元素的所有父辈节点

content = soup.head.title.string

for parent in content.parents:

print parent.name

(5)前后兄弟节点

使用nextSibling, previousSibling获取前后兄弟

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

注意：实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

print soup.p.next_sibling #实际该处为空白

print soup.p.prev_sibling #None 没有前一个兄弟节点，返回 None

print soup.p.next_sibling.next_sibling

.next方法：只能针对单一元素进行.next，或者说是对contents列表元素的挨个清点。比如

soup.contents[1]=u'HTML'

soup.contents[2]=u'\n'

则soup.contents[1].next等价于soup.contents[2]

head = body.previousSibling # head和body在同一层，是body的前一个兄弟

p1 = body.contents[0] # p1, p2都是body的儿子，我们用contents[0]取得p1

p2 = p1.nextSibling # p2与p1在同一层，是p1的后一个兄弟, 当然body.content[1]也可得到

遍历全部兄弟节点

for sibling in soup.a.next_siblings:

print(repr(sibling))

(6)前后节点

.next_element .previous_element属性，两者并不是针对于兄弟节点，而是在所有节点，不分层次。

<head><title>The Dormouse's story</title></head>

那么它的下一个节点便是 title，它是不分层次关系的

print soup.head.next_element

#<title>The Dormouse's story</title>

.next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:

print(repr(element))

三、BeautifulSoup搜索查询

1.find_all函数

语法：find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件，结果返回一列表

(1)name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略

tag.find_all(‘title’) #[”<title>&%^&*</title>”]

tag.find_all(“title”,class=”sister”) #[”<title class = “sister”>%^*&</title>]

tag.find_all(“title”,”sister”) #[”<title class = “sister”>%^*&</title>]

A.传字符串

soup.find_all('b') #获取所有b标签节点

print soup.find_all('a') #获取所有a标签节点

B.传正则表达式

soup.find_all(re.compile("^b")) #获取所有以b开头的标签节点

C.传列表

soup.find_all(["a", "b"]) #获取文档中所有<a>标签和标签

D.传True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True): print(tag.name)

E.传方法

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all() 方法,将得到所有标签:

soup.find_all(has_class_but_no_id)

# [The Dormouse's story,

# Once upon a time there were...,

# ...]

(2)keyword 参数

soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

(3)text 参数，返回标签文字

soup.find_all(text="Elsie") # [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse")) #[u"The Dormouse's story", u"The Dormouse's story"]

(4)limit 参数

使用 limit 参数限制返回结果的数量.当结果数量达到 limit 的限制时,就停止搜索

文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量

soup.find_all("a", limit=2)]

(5)recursive 参数

find_all() 会检索当前tag的所有子孙节点,若只想搜索tag的直接子节点,则设recursive=False。

2.find函数

语法：find(name=None, attrs={}, recursive=True, text=None, **kwargs)

find方法搜索第一个符合条件的tag节点

(1)tag搜索方法

find('head') # 直接搜索名为head的节点

find(['head', 'body']) # 使用list同时搜索多个tag

find({'head':True, 'body':True}) # 搜索在dict中的tag

find(re.compile('^p')) # 搜索符合正则的tag, 如搜索以p开头的tag

find(lambda name: if len(name) == 1) # 搜索函数返回结果为true的tag, 如搜索长度为1的tag

find(True) # 搜索所有tag

(2)attrs搜索

site.find('a', class_='district').get_text() # 寻找a标签中 ,class='district'

find(id='xxx') # 寻找id属性为xxx的

soup.find(href=re.compile("elsie")) #寻找href属性包含"elsie"字符的

find(attrs={id=re.compile('xxx'), algin='xxx'}) # 寻找id属性符合正则且algin属性为xxx的

find(attrs={id=True, algin=None}) # 寻找有id属性但是没有algin属性的

(3)text搜索

文字的搜索会导致其他搜索给的值如：tag, attrs都失效。方法与搜索tag一致

site1.find(text=re.compile("天润")) #天润城第十二街区

(4)嵌套搜索

site1.find('div',class_='positionInfo').find('a',class_='bizcircle')

recursive和limit属性

recursive=False表示只搜索直接儿子，否则搜索整个子树，默认为True。

当使用findAll或者类似返回list的方法时，limit属性用于限制返回的数量，

如:findAll('p', limit=2)：返回首先找到的两个tag

3.其他搜索命令

(1)find_parents() find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容

(2)find_next_siblings() find_next_sibling()

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点

(3)find_previous_siblings() find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings()方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

(4)find_all_next() find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

(5)find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点

注：以上方法参数用法与 find_all() 完全相同，原理均类似，在此不再赘述。

四、CSS选择器

1.CSS选择器介绍

在写CSS 时，标签名不加任何修饰，类名前加点，id名前加 #

在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

案例:

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

</body></html>

<head><title>The Dormouse's story</title></head>

<title>The Dormouse's story</title>

The Dormouse's story

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

</body>

The Dormouse's story</bZ>

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.

Once upon a time there were three little sisters; and their names were

Elsie

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

Lacie

and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Tillie

;

and they lived at the bottom of a well.

...

...

(1)通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')

#[The Dormouse's story]

(2)通过类名查找

print soup.select('.sister')

(3)通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

(4)组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如：查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

直接子标签查找

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

(5)属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')

print type(soup.select('title'))

print soup.select('title')[0].get_text()

for title in soup.select('title'):

print title.get_text()

这就是另一种与 find_all 方法有异曲同工之妙的查找方法，是不是感觉很方便？

print soup.find_all("a", class_="sister")

print soup.select("p.title")

# 通过属性进行查找

print soup.find_all("a", attrs={"class": "sister"})

# 通过文本进行查找

print soup.find_all(text="Elsie")

print soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# 限制结果个数

print soup.find_all("a", limit=2)

2.CSS选择器案例

(1)搜索tag

soup.select("title") # [<title>The Dormouse's story</title>] soup.select("p nth-of-type(3)") #nth-of-type(3)为选择第3个p标签 # [...]

(2)可以搜索在其他父标签内部的标签，即通过标签的所属关系寻找标签

soup.select("body a") #搜索在body标签内部的a标签 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") #搜索在html->head标签内部的标签 # [<title>The Dormouse's story</title>]

(3)可以直接寻找在其他标签内部的标签

soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > a:nth-of-type(2)") #nth-of-type(2)为选择第2个a标签 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []

(4)通过tags标签获得元素的同胞兄弟

soup.select("#link1 ~ .sister") #获得id为link1，class为sister的兄弟标签内容（所有的兄弟便签） # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("#link1 + .sister") #获得id为link1，class为sister的兄弟标签内容（下一个兄弟便签） # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

(5)通过CSS的类获得tags标签

soup.select(".sister") #获得所有class为sister的标签 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("[class~=sister]") #效果同上一个 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(6)通过id获得标签：

soup.select("#link1") #通过设置参数为id来获取该id对应的tag # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") #这里区别于上一个单纯的使用id，又增添了tag属性，使查找更加具体 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

(7)通过设置select函数的参数为列表，来获取tags

只要匹配列表中的任意一个则就可以捕获。

soup.select(“#link1,#link2”) #捕获id为link1或link2的标签 # [<a class=”sister” href=”http://example.com/elsie” id=”link1”>Elsie</a>, # <a class=”sister” href=”http://example.com/lacie” id=”link2”>Lacie</a>]

(8)按照标签是否存在某个属性来获取

soup.select('a[href]') #获取a标签中具有href属性的标签 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(9)通过某个标签的具体某个属性值来查找tags

soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

这里需要解释一下：

soup.select(‘a[href^=”http://example.com/”]’)意思是查找href属性值是以”http://example.com/“值为开头的标签，可以查看博客介绍。

soup.select(‘a[href$=”tillie”]’)意思是查找href属性值是以tillie为结尾的标签。

soup.select(‘a[href*=”.com/el”]’)意思是查找href属性值中存在字符串”.com/el”的标签，所以只有href=”http://example.com/elsie”一个匹配。

(10)查询符合查询条件的第一个标签

soup.select_one(".sister") #只查询符合条件的第一个tag # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

五、案例

1. find_all()搜索方法

from bs4 import BeautifulSoup

#创建beautifulsoup对象

file=open(r'E:\python\python爬虫\BeautifulSoup案例\京东训练案例\v4ink - 商品搜索 - 京东.html',encoding="utf-8")

str1 = file.read()

soup = BeautifulSoup(str1, 'html.parser')

goods_info=soup.find_all("li",class_="gl-item")

for goods in goods_info:

#价钱

prices=goods.find_all("div",class_="p-price")

price=prices[0]

print("价钱:",price.i.text)

#标题与键链

titles=goods.find_all("div",class_="p-name p-name-type-2")

title=titles[0]

print("标题:",title.a.em.text) #输出标题

print("键链:",title.a['href']) #输出键链

#评论数

reviews=goods.find_all("div",class_="p-commit")

review=reviews[0]

print("评论数:",review.a.text)

print("\n")

2. CSS定位方法

#进行信息的抽取（商品名称，价格）

goods_info = soup.select(".gl-item")

for info in goods_info:

title = info.select(".p-name.p-name-type-2 a")[0].text.strip()

price = info.select(".p-price")[0].text.strip()

print(str(title).replace('\n',' ')+r'|'+str(price).replace('\n',' '))