爬虫(四)信息提取

0.信息标记的三种形式

标记后的信息可用于通信、存储或展示,标记的结构与信息一样具有重要价值。

(1)XML(eXtensible Markup Language)

(2)JSON(JavaScript Object Notation)

JSON可以作为JavaScript程序的一部分。

有类型的键值对 key:value 如"name":["浙江大学","北京大学"],即对应多个值。

(3)YAML(YAML Ain't Markup Language) 

无类型的键值对 key:value

用减号对齐表示并列信息,#号表示注释,无类型(不加双引号)。

1.三种信息标记形式的比较

2.信息标记的一般方法:

(一)方法一:完整解析信息的标记形式,再提取关键信息,需要标记解析器,例:bs4库的标签树遍历

优点:信息解析准确   

缺点:提取过程繁琐,速度慢

(二)方法二:无视标记形式,直接搜索关键信息,对信息的文本调用查找函数即可。

优点:提取过程间接,速度较快  

缺点:提取结果准确性与信息内容相关

融合方法:结合形式解析与搜索方法,提取关键信息,需要标记解析器及文本查找函数。

下面提取HTML中所有URL连接

思路:1)搜索到所有<a>标签

           2)解析<a>标签格式,提取href后的链接内容

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
	print(link.get('href'))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

3.基于bs4库的HTML内容查找方法

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):
	print(tag.name)

html
head
title
body
p
b
p
a
a
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
	print(tag.name)

body
b
>>> soup.find_all('p','courser')
[]
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string="Basic Python")
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']

简写形式:

扩展方法:

4.单元小结

发布了219 篇原创文章 · 获赞 13 · 访问量 9801

猜你喜欢

转载自blog.csdn.net/qq_35812205/article/details/104270202