利用BeautifulSoup4解析和提取 HTML/XML 数据

1 BeautifulSoup4简介

BeautifulSoup4和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单,API非常人性化,

支持CSS选择器(http://www.w3school.com.cn/cssref/css_selectors.asp)、Python标准库中的HTML解析器,也支持 lxml 的 XML解析器。

下面直接上图直观的比较一下:

抓取工具

速度

使用难度

安装难度

扫描二维码关注公众号,回复: 1918852 查看本文章

正则

最快

困难

无(内置)

BeautifulSoup

最简单

简单

lxml

简单

一般

安装过程就不赘述了,命令在下面,直接使用就好:

    Python3中Linux安装命令:sudo pip3 installbeautifulsoup4

    Python2使用 pip 安装命令:sudo pip install beautifulsoup4

    官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

2. 四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

·        Tag,通俗点讲就是 HTML 中的一个个标签

·        NavigableString,获取标签内部的文字

·        BeautifulSoup,对象表示的是一个文档的内容

·        Comment,是一个特殊类型的 NavigableString 对象,其输出的内容不包括注释符号


因为NavigableString实际应用比较多,这里只详细介绍一下NavigableString

#导入BeautifulSoup4

from bs4 import BeautifulSoup

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #创建 Beautiful Soup 对象 soup = BeautifulSoup(html,"lxml")#不写"lxml"好多时候会报错 print(soup.p.string)# The Dormouse's story print(type(soup.p.string))# <class 'bs4.element.NavigableString'>

运行结果:


3. 遍历文档树

3.1. 直接子节点:.contents .children 属性

.contents 和 .children 属性仅包含tag的直接子节点

3.1.1 .content

tag 的 .content 属性可以将tag的子节点以列表的方式输出

print(soup.head.contents) #[<title>The Dormouse's story</title>]

输出方式为列表,我们可以用列表索引来获取它的某一个元素

print(soup.head.contents[0])#<title>The Dormouse's story</title>
#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")
for a in sp.find_all(name="a"):
   content = a.contents
   print(content)

运行效果

3.1.2 .children

它返回的不是一个 list,不过我们可以通过遍历获取所有子节点。

我们打印输出.children 看一下,可以发现它是一个list 生成器对象

print(soup.head.children)#<listiterator object at 0x7f71457f5710>
for child in  soup.body.children:
    print(child)

结果:


#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")

for item in sp.find_all(name="p"):
   #遍历取出所以的p标签,并且取出每个标题的子标签
   for ch in item.children:
      print(ch)

运行效果:


3.2. 所有子孙节点:.descendants 属性

.contents 和 .children 属性仅包含tag的直接子节点,.descendants属性可以对所有tag的子孙节点进行递归循环,和 children类似,我们也需要遍历获取其中的内容。

for child in soup.descendants:
    print(child)

运行结果:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


案例2

#创建 Beautiful Soup 对象
sp = BeautifulSoup(html,"lxml")
list_p = sp.find_all(name="p")
for p in list_p:
   print(p.name,":的所以后代==============")
   for child in p.descendants:
      print(child)


3.3. 节点内容:.string 属性

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是:如果一个标签里面没有标签了,那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 .string 也会返回最里面的内容。例如:

print(soup.head.string)#The Dormouse's story
print(soup.title.string)#The Dormouse's story

4. 搜索文档树--find_all

语法:.find_all(name, attrs, recursive, text, **kwargs)

4.1 name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:

print(soup.find_all('b'))# [<b>The Dormouse's story</b>]
print(soup.find_all('a'))#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到

soup = BeautifulSoup(html,"lxml")
import re
#匹配所以b开头的标题它们是:body标签和b标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)# body# b

C.传列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:

print(soup.find_all(["a", "b"]))# [<b>The Dormouse's story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4.2 keyword 参数

查找id为link3的标签

print(soup.find_all(id='link2'))# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

4.3 text 参数

通过 text 参数可以搜搜文档中的字符串内容,与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表

import re
print(soup.find_all(text="Elsie"))
# []

print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
# ['Lacie', 'Tillie']
# 查找包含’Dormouse’内涵的
print(soup.find_all(text=re.compile("Dormouse")))
#["The Dormouse's story", "The Dormouse's story"]

4.4 href

得到所有的连接

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,"lxml")

links = soup.find_all(href=re.compile(r'http://example.com/'))
#得到所以的链接
for link in links:
   print(link["href"])


猜你喜欢

转载自blog.csdn.net/weixin_42255200/article/details/80946179