BeautifulSoup使用总结

一、介绍

BeautifulSoup为一个python库，它可以接收一个HTML或XML的字符串或文件，并返回一个BeautifulSoup对象，之后我们可以使用BeautifulSoup提供的众多方法来对文件内容进行解析。

二、安装

1、使用pip安装

pip install beautifulsoup4
#安装BeautifulSoup解析器
pip install lxml
pip install html5lib

2、通过apt-get安装

sudo apt-get install Python-bs4
#安装BeautifulSoup解析器
sudo apt-get install Python-lxml
sudo apt-get install Python-html5lib

推荐使用lxml作为解析器，因为其效率更高。

三、常用方法

下面的例子将解析以下字符串：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1、将字符串包装厂BeautifulSoup对象

soup = BeautifulSoup(html, "lxml")
#使用标准的缩进结构输出
print soup.prettify()

输出：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

2、使用name获取标签名称

print soup.a
print soup.a.name

输出：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a

需要注意的是，使用soup.[tag]来访问标签只会返回第一个名为tag的标签，若想返回所有的或者根据条件返回，可以使用find_all()方法。

3、使用string获取标签内容

通过访问标签的string属性可以获取标签的内容。

print soup.title.string

输出：

The Dormouse's story

需要注意的是使用string来访问标签内容时，该标签内只能包含一个子节点，若有多个子节点，使用string会返回None，因为不知道该返回哪个子节点的内容。

print soup.body.string

输出：

None

将string换成strings即可：

strings = soup.body.strings
for string in strings:
    print string

输出：



The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

可以看到输出有很多多余的空行和空格，使用stripped_strings即可去除这些空行和空格：

strings = soup.body.stripped_strings
for string in strings:
    print string

输出：

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

4、获取标签的属性名称

#获取第一个<p>标签的class属性
soup.p["class"]

输出：

['title']

返回的为一个列表，因为class可能有多个值。

#获取第一个<a>标签的href属性
soup.a["href"]

输出：

'http://example.com/elsie'

5、更改标签的属性值

#更改第一个<p>标签的href属性
soup.p["class"] = "new-class"
print soup.p["class"]

#更改第一个<a>标签的href属性
soup.a["href"] = "www.google.com"
print soup.a["href"]

print soup.prettify()

输出：

new-class
www.google.com
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="new-class">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="www.google.com" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

6、find_all方法

6.1 返回所有的标签

#返回文档中所有的<a>标签，返回值为列表
links = soup.find_all("a")
print links

输出：

[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6.2、根据属性名返回标签

#返回文档中所有的类名为sister的<a>标签，返回值为列表
#class为python关键字，所以使用class_代替
links = soup.find_all("a", class_="sister")
print links
print '-'*20
#与上面的相同
links = soup.find_all("a", attrs={"class":"sister"})
print links
print '-'*20
#返回文档中所有的id为link2的<a>标签，返回值为列表
links = soup.find_all("a", id="link2")
print links

输出：

[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
--------------------
[<a class="sister" href="www.google.com" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
--------------------
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

6.3、获取所有标签的href属性

links = soup.find_all("a")
for a in links:
    print a["href"]

输出：

www.google.com
http://example.com/lacie
http://example.com/tillie

三、参考

1、https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html