0.信息标记的三种形式
标记后的信息可用于通信、存储或展示,标记的结构与信息一样具有重要价值。
(1)XML(eXtensible Markup Language)
(2)JSON(JavaScript Object Notation)
JSON可以作为JavaScript程序的一部分。
有类型的键值对 key:value 如"name":["浙江大学","北京大学"],即对应多个值。
(3)YAML(YAML Ain't Markup Language)
无类型的键值对 key:value
用减号对齐表示并列信息,#号表示注释,无类型(不加双引号)。
1.三种信息标记形式的比较
2.信息标记的一般方法:
(一)方法一:完整解析信息的标记形式,再提取关键信息,需要标记解析器,例:bs4库的标签树遍历
优点:信息解析准确
缺点:提取过程繁琐,速度慢
(二)方法二:无视标记形式,直接搜索关键信息,对信息的文本调用查找函数即可。
优点:提取过程间接,速度较快
缺点:提取结果准确性与信息内容相关
融合方法:结合形式解析与搜索方法,提取关键信息,需要标记解析器及文本查找函数。
下面提取HTML中所有URL连接
思路:1)搜索到所有<a>标签
2)解析<a>标签格式,提取href后的链接内容
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
print(link.get('href'))
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
3.基于bs4库的HTML内容查找方法
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):
print(tag.name)
html
head
title
body
p
b
p
a
a
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
print(tag.name)
body
b
>>> soup.find_all('p','courser')
[]
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string="Basic Python")
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']
简写形式:
扩展方法: