解析网页
1.BeautifulSoup解析库
1.它是一个类,功能就是把不友好的html文件转换成有好的形式,就像我们在浏览器里面看到的源代码一样,除此之外我们还可以通过它找到标签,找到html的整个树形结构。从而找到每一个节点信息。
2.安装:命令提示符:pip install BeautifulSoup4
3.测试
import requests from bs4 import BeautifulSoup url = 'http://python123.io/ws/demo.html' r = requests.get(url) print("原生获取的html/n",r.text) demo = r.text soup = BeautifulSoup(demo,"html.parser") #把demo 做成BeautifulSoup能理解的汤,另外的参数是标准卡的解析器 print("解析之后/n",soup.prettify()) #这里的prettify()可以让网页换行,标签识别等,常用。
4.BeautifulSoup将任何输入都转换成UTF8格式:
bs4库的基本元素:Tag Name Attributes NavigableString Comment
bs4库的遍历功能:
向下:.content .children .decendants
向上: .parent .parents
同行遍历:.next_sibling .previous_sibling .next_siblings .previous_siblings
5.看看一般的html格式:
<html> <head> <title> This is a python demo page </title> </head> <body> <p class="title"> <b> The demo python introduces several python courses. </b> </p> <p class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </p> </body> </html>
在看看它的抽象模型:
6.组织信息
插入遍历方法的代码
>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get("http://python123.io/ws/demo.html") >>> soup = BeautifulSoup(r.text,"html.parser") >>> soup.head # soup的head标签 <head><title>This is a python demo page</title></head> >>> soup.head.contents # 获取head标签下的所有儿子标签 [<title>This is a python demo page</title>] >>> soup.body.contents ['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n'] >>> # 上面的列表就是body的所有儿子节点,包括 标签,和标签同级的字符串 这里只有儿子节点不会含有孙子节点 >>> #但是 .parents 与decendents就是所的,包括一切比他们“小一级的标签” >>> for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name) p body html [document]
信息标记的三种方式(标记后的信息更加容易理解)
xml(html) json方式 yamel方式
就比如html就是一种标记了的信息。
信息提取的方法:
>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get("http://python123.io/ws/demo.html") >>> soup = BeautifulSoup(r.text,"html.parser") #提取信息 for link in soup.find_all('a'): print(link.get('href'))#