爬虫(8)bs4上

1. bs4简介

Beautiful Soup是一个可以从HTML或XML文件中提取提取数据的网页信息提取库。
首先需要安装,最好先安装pip install lxml再安装pip install bs4否则可能会出错。
bs4不需要记语法,直接调用里面的方法就可以了,这是它比正则和xpath方便的地方。

2. bs4入门

我们用一段网页文档来示例一下如何使用bs4。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

如果我们要从上面文档中用Beautiful Soup提取需要的内容,我们要先解析成bs4对象。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类,Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,features='lxml') # 这里我们传入两个对象,一个是刚才的文档,第二个是features='lxml',用来解析文档的。
print(soup)  # 打印一下,得到一个Beautifull Soup对象。

结果

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


如果我想要结构更清晰一点的打印结果可以这样打印:

print(soup.prettify())

得到一个更清晰的结构树,可以方便找到各标签的关系。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

如果我现在要把title标签的元素打印出来,可以这样操作。

print(soup.title)

得到

<title>The Dormouse's story</title>

如果要得到标签名字,和标签内的字符串,可以这样。

print(soup.title.name)
print(soup.title.string)

结果

title
The Dormouse's story

如果我要得到p标签

print(soup.p)

结果发现只找到里面三个p标签的第一个

<p class="title"><b>The Dormouse's story</b></p>

如果我都找到可以用find_all方法

res = soup.find_all('p')
print(res,len(res))

结果我们得到一个列表,返回所有的p标签作为列表的元素。发现有3各p标签,即找到了所有的p标签。

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>] 3

下面我们发现a标签里面有一个href里面有url,我们如何获取呢?可以这样

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

就拿到了

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

以上是我们对bs4的一些入门操作,可以看到是很方便简洁的。

3. bs4对象的种类

Tag :标签
Navigablestring :可导航的字符串
BeautifulSoup :soup对象
Comment :注释

我们来通过操作代码来认识它们。。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

结果我们看到,以上三个都是Tag对象。

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>

对Tag对象我们可以做以下操作

print(soup.p.name)
print(soup.p.attrs)
print(soup.p.string)

结果

p
{'class': ['title']}
The Dormouse's story

其中

print(type(soup.p.string))

得到

<class 'bs4.element.NavigableString'>

是NavigableString字符串类型,它和普通字符串一样,可以做拼接等一样的操作。

print(type(soup))

我们看到得到的是一个soup对象

<class 'bs4.BeautifulSoup'>

下面我们看看注释类型,这个并不常用。我们随便写一个注释:

html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(soup.a.string)

我们先打印一下看看效果

新年快乐!!

把注释打印出来了。我们看看类型。

html = '<a><!--新年快乐!!--></a>'
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.string))

我们看到是注释类对象

<class 'bs4.element.Comment'>

好,通过以上操作我们认识了四个对象类型。

4. 遍历文档树

我们先了解一下常用的解析器:

解析器 使用方法 优势 劣势
python标准库 BeautifulSoup(markup,“html.parser”) python的内置标准库,执行速度适中,文档容错能力强 python 2.7.3以前的版本容错能力差
lxml HTML解析器 BeautifulSoup(markup,“lxml”), 速度快,容错能力强 需要安装C语言库
lxml XML解析器 BeautifulSoup(markup,[“lxml”,“xml”])或者BeautifulSoup(markup,“xml”) 速度快,唯一支持xml的解析器 需要安装C语言库
html5lib BeautifulSoup(markup,“html5lib”) 最好的容错性,不依赖外部的扩展,以浏览器的方式解析文档,生成HTML5格式的文档 速度慢

推荐使用lxml解析器,因为效率更高。当然我们可以根据具体需求更换解析器。

4.1 contents, children, descendants

contents  返回的是所有子节点的列表
children  返回的是一个子节点的迭代器
descendants  返回的是一个生成器,遍历子子孙孙

看代码:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.contents,type(soup.contents))
print('-*'*60)
print(soup.children,type(soup.children))
print('-*'*60)
print(soup.descendants,type(soup.descendants))
print('-*'*60)

结果我们看到第一个得到的是一个列表,第二个得到的是一个迭代器,第三个是一个生成器。

[<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>] <class 'list'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<list_iterator object at 0x000001C65451CA60> <class 'list_iterator'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<generator object Tag.descendants at 0x000001C653EB1580> <class 'generator'>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

我们举个例子更直观的观察:

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
print(html_Tag.contents)

结果我们看到,返回的是所有子节点的一个列表,还包含了其中的一个换行符。我们就可以用列表的操作方法遍历取得里面的元素。

[<head><title>The Dormouse's story</title></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]

第二个既然是迭代器,我们就可以通过遍历取出其中的元素:

soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
for i in html_Tag.children:
    print(i)
    print('*'*100)

结果我们看到得到两个子节点,由于中间有个换行符,所以有一个空的位置打印出来。

<head><title>The Dormouse's story</title></head>
****************************************************************************************************

# 这里原本有个换行符
****************************************************************************************************
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
****************************************************************************************************

第三个是生成器,我们也可以遍历

soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
for i in html_Tag.descendants:
    print(i)
    print('*'*100)

结果我们看到是这样的,所有的子节点,子节点的子节点都被遍历出来了。像剥洋葱一样。

<head><title>The Dormouse's story</title></head>
****************************************************************************************************
<title>The Dormouse's story</title>
****************************************************************************************************
The Dormouse's story
****************************************************************************************************


****************************************************************************************************
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
****************************************************************************************************


****************************************************************************************************
<p class="title"><b>The Dormouse's story</b></p>
****************************************************************************************************
<b>The Dormouse's story</b>
****************************************************************************************************
The Dormouse's story
****************************************************************************************************


****************************************************************************************************
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
****************************************************************************************************
Once upon a time there were three little sisters; and their names were

****************************************************************************************************
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
****************************************************************************************************
Elsie
****************************************************************************************************
,

****************************************************************************************************
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
****************************************************************************************************
Lacie
****************************************************************************************************
 and

****************************************************************************************************
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
****************************************************************************************************
Tillie
****************************************************************************************************
;
and they lived at the bottom of a well.
****************************************************************************************************


****************************************************************************************************
<p class="story">...</p>
****************************************************************************************************
...
****************************************************************************************************


****************************************************************************************************

空白位置都是因为有换行符。

4.2 string, strings, stripped_strings

string  获取标签里面的内容
strings  返回是个生成器,用来获取多个标签的内容
stripped_strings  和strings的作用一样,只不过去除内容里多余的空格

例如,如果我们项获得title标签里的内容可以这样:

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
title_Tag = soup.title
print(title_Tag.string)

结果就得到了title标签里的内容

The Dormouse's story

下面我们再看

soup = BeautifulSoup(html_doc,'lxml')
html_Tag = soup.html
s = html_Tag.strings
for i in s:     #  由于s的结果是个生成器,所以我们用遍历的方法取其中的内容。
    print(i)

结果

The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...


所有的字符串内容都得到了
再看一个

soup = BeautifulSoup(html_doc,'lxml')
# title_Tag = soup.title
# print(title_Tag.string)
html_Tag = soup.html
s = html_Tag.stripped_strings
for i in s:
    print(i)
    

结果发现那些多余的空格不见了

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

这样很方便取出内容的整洁。

4.3 parent, parents

parent  直接获取父节点
parents  获取所有的父节点

例子

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
title_Tag = soup.title
print(title_Tag.parent)

结果

<head><title>The Dormouse's story</title></head>

再看parents,由于结果是个生成器,我们用遍历

for i in title_Tag.parents:
    print(i)

结果所有的父节点都出来了

<head><title>The Dormouse's story</title></head>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

4.4 兄弟节点

next_sibling               下一个兄弟节点
previous_sibling         上一个兄弟节点
next_siblings               下一个所有的兄弟节点
previous_siblings         上一个所有的兄弟节点

例子

from bs4 import BeautifulSoup
html_tex = '<a><b>bbb</b><c>ccc</c><d>dddd</d></a>'
soup2 = BeautifulSoup(html_tex,'lxml')
b_Tag = soup2.b
d_Tag = soup2.d
print(b_Tag.next_sibling)
print(d_Tag.previous_sibling)
print(b_Tag.next_siblings)
print(d_Tag.previous_siblings)

结果,后面两个是生成器

<c>ccc</c>
<c>ccc</c>
<generator object PageElement.next_siblings at 0x0000021679FA1510>
<generator object PageElement.previous_siblings at 0x0000021679FA1510>

我们对后面两个遍历一下

for i in b_Tag.next_siblings:
    print(i)
print('-*'*20)
for j in d_Tag.previous_siblings:
    print(j)

两个结果中间用线隔开了

<c>ccc</c>
<d>dddd</d>
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
<c>ccc</c>
<b>bbb</b>

5. 重点知识find与find_all

我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索。find只返回第一个, findall返回所有结果以列表形式返回。
示例html文档:

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""

5.1 获取所有的tr标签

from bs4 import BeautifulSoup
# html = '这里是示例代码省略'
soup = BeautifulSoup(html,'lxml')
trs = soup.find_all('tr')
print(trs)

结果就打印出了所有的tr标签,以列表形式返回

[<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=29938&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=31235&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=34532&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a href="position_detail.php?id=32218&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="odd">
<td class="l square"><a class="test" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218" id="test" target="_blank">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>]

5.2 获取第二个tr标签

用列表知识提取目标

tr_2 = soup.find_all('tr')[1]
print(tr_2)

结果就打印去第二个tr标签

<tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>

5.3 获取所有class="even"的tr标签

前面类似的代码省略,这里为节省篇幅只写关键代码

trs = soup.find_all('tr',class_="even")
print(trs)

结果就打印出了所有的class="even"的tr标签。

[<tr class="even">
<td class="l square"><a href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>, <tr class="even">
<td class="l square"><a href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218" target="_blank">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>]

说明:这里因为class在python中是类的特别标识符,所以这里用“class_”来代替“class”。

5.4 提取所有id="test"且class="test"的a标签

lst = soup.find_all('a',class_="test",id="test")
print(lst)

结果就提取出来了我们要的目标

[<a class="test" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218" id="test" target="_blank">SNG11-高级业务运维工程师(深圳)</a>]

5.5 获取所有a标签的href属性

a_lst = soup.find_all('a')
for a in a_lst:
    href = a.get('href')
    print(href)

结果就提取出了所有的href值

position_detail.php?id=33824&keywords=python&tid=87&lid=2218
position_detail.php?id=29938&keywords=python&tid=87&lid=2218
position_detail.php?id=31236&keywords=python&tid=87&lid=2218
position_detail.php?id=31235&keywords=python&tid=87&lid=2218
position_detail.php?id=34531&keywords=python&tid=87&lid=2218
position_detail.php?id=34532&keywords=python&tid=87&lid=2218
position_detail.php?id=31648&keywords=python&tid=87&lid=2218
position_detail.php?id=32218&keywords=python&tid=87&lid=2218
position_detail.php?id=32217&keywords=python&tid=87&lid=2218
position_detail.php?id=34511&keywords=python&tid=87&lid=2218

这一方法以后可以用来获取网页的url
除了上述的写法,还可以这么写:

a_lst = soup.find_all('a')
for a in a_lst:
    href = a['href']
    print(href)

结果一模一样,就不再列出。

5.6 获取所有职位信息的文本

通过观察我们发现,除了第一个tr标签内没有职位信息,其他的tr标签内都有职位信息。那么我就可以这样写:

trs = soup.find_all('tr')[1:]
for tr in trs:
    position = tr.find_all('td')[0].string
    print(position)

结果,我们就提取出来所有的文本信息

22989-金融云区块链高级研发工程师(深圳)
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师(深圳)
SNG16-腾讯音乐业务运维工程师(深圳)
TEG03-高级研发工程师(深圳)
TEG03-高级图像算法研发工程师(深圳)
TEG11-高级AI开发工程师(深圳)
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师(深圳)

以后我们讲的案例几乎都是用find()和find_all()方法。
后面我们会讲select()方法,需要预习一下css语法,提供一个链接:
https://www.w3school.com.cn/cssref/css_selectors.asp
这是相关知识的参考。
以上就是本节所有内容。

猜你喜欢

转载自blog.csdn.net/m0_46738467/article/details/112464099
今日推荐