Beautiful Soup库的学习

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Yk_0311/article/details/82351259

Beautiful Soup 是Python的一个HTML或XML的解析库,可以用来从网页中提取数据

引用

from bs4 import BeautifulSoup

解析器

Beautiful在解析时实际上依赖解析器
以下是Beautiful Soup支持的解析器

这里写图片描述

BeautifulSoup类的基本元素

这里写图片描述

提取信息

1.Tag

import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.title)
tag=soup.a
print(tag)


#输出
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在于HTML语法中的标签都可以用soup.<tag>访问获得
如果存在多个相同的标签,那么soup.<tag>只返回第一个

2.Tag的attrs(属性)

import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
tag=soup.a
print(tag)
print(tag.attrs)#获得这个标签的全部的属性
print(tag.attrs['href'])#获得href属性

#输出
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
http://www.icourse163.org/course/BIT-268001

3.Tag的NavigableString(标签内非属性字符串)

import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.a)
print(soup.a.string)
print(soup.p)
print(soup.p.string)
print(type(soup.a.string))


#输出
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
<class 'bs4.element.NavigableString'>

4.Tag的Comment(标签内字符串的注释部分这里写图片描述

基于bs4库的HTML内容遍历方法

这里写图片描述

这里写图片描述

1.下行遍历

这里写图片描述

######contents
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body)
print(soup.body.contents)

#输出
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

可以看到返回的结果类型是列表形式,body节点里既包含了文本,又包含了节点
需要注意的是,列表中的每个元素都是body节点的直接子节点,比如第二个p节点中包含的b节点,就相当于子孙节点了

######chlidren
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body.children)
for i,item in enumerate(soup.body.children):#返回的结果是迭代类型
    print(i,item)
#enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标

#输出
<list_iterator object at 0x0000011B94A0A0B8>
0 

1 <p class="title"><b>The demo python introduces several python courses.</b></p>
2 

3 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
4 
###descendants

获得所有的子孙节点

import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
# print(soup.prettify())#清晰漂亮的打印出来
print(soup.body.descendants)
for i,item in enumerate(soup.body.descendants):#返回的结果是迭代类型
    print(i,item)



#输出
<generator object descendants at 0x000001B909020C50>
0 

1 <p class="title"><b>The demo python introduces several python courses.</b></p>
2 <b>The demo python introduces several python courses.</b>
3 The demo python introduces several python courses.
4 

5 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
6 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

7 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
8 Basic Python
9  and 
10 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
11 Advanced Python
12 .
13 

找到了所有的子孙节点

2.上行遍历

这里写图片描述

######parent
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
print(soup.a.parent.prettify())
#这里我们找的是第一个a节点的父节点元素


#输出
<p class="course">
 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
  Basic Python
 </a>
 and
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
  Advanced Python
 </a>
 .
</p>
#####parents
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')#''html.parser''这个是解析器
for i,item in enumerate(soup.a.parents):
    print(i,item)

#输出
0 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
1 <body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
2 <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
3 <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

3.平行遍历

这里写图片描述

以上就是基于bs4库的HTML内容遍历方法
我们可以看到结果,有时候输出的内容并不是一个节点,那该怎么办?

import bs4
if isinstance(tr, bs4.element.Tag):

#判断是不是标签

基于bs4库的HTML内容遍历方法

<>.find_all(name,attrs,recursive,string.**kwargs)

import requests
from bs4 import BeautifulSoup
import re

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
#print(soup.prettify())
'''
<>.find_all(name,attrs,recursive,string.**kwargs)
name:对标签名称的检索字符串
attrs:对标签属性值的检索字符串,可以标记属性检索
recursive:是否对子孙全部检索,默认为True
string:<>...</>中字符串区域的检索字符串
'''
# name
# 检索a标签
print(1, soup.find_all('a'))  # 输出了一个列表的类型
# 检索a,b标签
print(2, soup.find_all(['a', 'b']))  # 找到两个标签
# name=True
print(3, soup.find_all(True))  # 当name=True,将给出所有标签的信息
for tag in soup.find_all(True):
    print(tag.name)

# attrs
# 4.1|4.2|4.3 都是一样的
print(4.1, soup.find_all('p', attrs={'class': 'course'}))  # 输出了带有course属性的p标签
print(4.2, soup.find_all('p', class_='course'))  # class是关键字,所以后面加了_
print(4.3, soup.find_all('p', 'course'))
#5.1|5.2 也是一样的
print(5.1, soup.find_all(attrs={"id": 'link1'}))  # 输出属性中id=linkb的标签
print(5.2,soup.find_all(id='link'))
# message1 = soup.find_all("tr", attrs={"bgcolor": "#ffffff"})

# recursive
print(6, soup.find_all('a'))
print(7, soup.find_all('a', recursive=False))

# string
print(8, soup.find_all(string='Basic Python'))
print(9, soup.find_all(string=re.compile("python")))  # 使用了正则表达式,返回一个列表,里面包含有python的字符串

输出

#输出
1 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
2 [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
3 [<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>, <head><title>This is a python demo page</title></head>, <title>This is a python demo page</title>, <body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>, <p class="title"><b>The demo python introduces several python courses.</b></p>, <b>The demo python introduces several python courses.</b>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
html
head
title
body
p
b
p
a
a
4.1 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
4.2 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
4.3 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
5.1 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
5.2 []
6 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
7 []
8 ['Basic Python']
9 ['This is a python demo page', 'The demo python introduces several python courses.']

其他函数
这里写图片描述

猜你喜欢

转载自blog.csdn.net/Yk_0311/article/details/82351259