爬虫初级二

解析网页

1.BeautifulSoup解析库

1.它是一个类,功能就是把不友好的html文件转换成有好的形式,就像我们在浏览器里面看到的源代码一样,除此之外我们还可以通过它找到标签,找到html的整个树形结构。从而找到每一个节点信息。

2.安装:命令提示符:pip install BeautifulSoup4

3.测试

import requests
from bs4 import BeautifulSoup

url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
print("原生获取的html/n",r.text)
demo = r.text
soup = BeautifulSoup(demo,"html.parser")  #把demo 做成BeautifulSoup能理解的汤,另外的参数是标准卡的解析器
print("解析之后/n",soup.prettify()) #这里的prettify()可以让网页换行,标签识别等,常用。

4.BeautifulSoup将任何输入都转换成UTF8格式:

bs4库的基本元素:Tag  Name Attributes NavigableString Comment

bs4库的遍历功能:

向下:.content   .children .decendants

向上:  .parent  .parents

同行遍历:.next_sibling  .previous_sibling .next_siblings .previous_siblings

5.看看一般的html格式:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

在看看它的抽象模型:


6.组织信息

插入遍历方法的代码
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> soup = BeautifulSoup(r.text,"html.parser")
>>> soup.head # soup的head标签
<head><title>This is a python demo page</title></head>
>>> soup.head.contents # 获取head标签下的所有儿子标签
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> # 上面的列表就是body的所有儿子节点,包括 标签,和标签同级的字符串 这里只有儿子节点不会含有孙子节点
>>> #但是 .parents 与decendents就是所的,包括一切比他们“小一级的标签”
>>> for parent in soup.a.parents:
	if parent is None:
		print(parent)
	else:
		print(parent.name)

		
p
body
html
[document]
信息标记的三种方式(标记后的信息更加容易理解)

xml(html) json方式 yamel方式

就比如html就是一种标记了的信息。


信息提取的方法:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> soup = BeautifulSoup(r.text,"html.parser")
#提取信息
for link in soup.find_all('a'):
    print(link.get('href'))#


猜你喜欢

转载自blog.csdn.net/tommy1295/article/details/80687435
今日推荐