Python爬虫--Beautiful Soup

如果没有安装bs4，使用pip安装bs4。

test.html文件如下，注意该文件不是标准的html文件，因为body标签没有闭合，后面会讲到：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

以下是Beautiful Soup的简单使用例子：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('test.html'))
#按照一定的缩进打印出test.html文件，会把没有闭合的body标签补全。
print(soup.prettify())

#打印出title标签的类型和整个title标签
print(type(soup.title))
print(soup.title.name)
print(soup.title)

#执行结果如下：
#<class 'bs4.element.Tag'>
#title
#<title>The Dormouse's story</title>

#只打印出title标签的内容
print(type(soup.title.string))
print(soup.title.string)

#执行结果如下：
#<class 'bs4.element.NavigableString'>
#The Dormouse's story

#打印出第一个a标签
print(type(soup.a.string))
print(soup.a.string)

#执行结果如下
#<class 'bs4.element.Comment'> 因为第一个a标签是注释
# Elsie 单独打印soup.a.string是不能区分内容是注释还是真实内容的，可以结合它的类型

#打印出body标签中子标签的名字
for item in soup.body.contents:
    print(item.name)

# 使用CSS查询
print(soup.select('.sister')) #sister前面的点表示根据class来找
print(soup.select('#link1'))  #井号表示根据id来找
print(soup.select('head > title')) #根据标签的父子关系来找

#查找a标签，找出来是一个列表，包含所有找到的a标签
a_s = soup.select('a')
for a in a_s:
    print(a)

#执行结果如下
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Python爬虫--Beautiful Soup

猜你喜欢