Python爬虫学习笔记(BeautifulSoup4库：标签树的上、下、平行遍历)

BeautifulSoup4：beautifulsoup库是解析、遍历、维护“标签树”的功能库。安装参考requests库

用法：

from bs4 import BeautifulSoup

soup = BeautifulSoup(‘data’,’html.parser’)

#测试

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser") #对demo进行HTML的解析

Soup2 =BeautifulSoup(open(“D://demo.html”),”html.parser”) #写入文档
print(soup.prettify()) #将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

基本解析器：

bs4的HTML解析器：BeautifulSoup(mk,’html.parser’)(安装bs4)

lxml的HTML解析库：BeautifulSoup(mk,’lxml’)(安装lxml)

lxml的XML 解析库：BeautifulSoup(mk,’html.xml’)（安装lxml）

html5lib的解析库：BeautifulSoup(mk,’html5lib’) (安装html5lib)

基本元素：

Tag:<></>

Name:标签的名字<>中的内容，<tag>.name

Attributes:属性，<tag>.attrs

NavigableString：标签之间的内容，<tag>.string

Comment:标签中字符串的注释部分comment.replace_with(cdata)

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#print(soup.title)
tag = soup.a #提取标签为a的代码段，但是只能获得第一个改标签内容
aptag = soup.a.parent.name #获取第一个a标签的父标签
ta = tag.attrs #获得标签的属性,其以字典的形式存在
tac = tag.attrs['class'] #获得标签属性中class内容
tah = tag.attrs['href'] #获得标签中的链接内容
tat = type(tag.attrs) #获取标签属性的类型
tta =type(tag) #获得标签的类型
tcont=tag.string #a标签之间的内容，即字符串信息
newsoup = BeautifulSoup("This is not a comment") #comment是注释的类型，此中内容为this is a comment

标签树的下行遍历：

.contents:子节点的列表，将<tag>所有儿子界定存入列表

.children:子节点的迭代类型，与.content类似，用于循环(for)遍历儿子节点

.descendants:子孙节点的迭代类型，包含所有子孙节点，用于循环(for)遍历

标签树的上行遍历：

.parent：节点的父亲标签

.parents:节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历

注意：平行遍历发生在同一个父节点下的各个节点间

.next_sibling:返回按照HTML文本顺序的下一个平行节点的标签

.previous_sibing:返回按照HTML文本顺序的上一个平行节点的标签

.next_siblings:迭代类型(for)，返回按照HTML文本顺序的后续所有平行节点的标签

.previous_sibings:迭代类型(for)，返回按照HTML文本顺序的前续所有平行节点的标签

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
#下行遍历
sh = soup.head #获取head标签段
shc = soup.head.contents #获取head标签的儿子标签段
sbc = soup.body.contents #获取body标签段
sn = len(sbc) #获取body儿子节点的数量，以list的形式存在于body段
#下行遍历body的儿子节点
for child in soup.body.children:
    print(child)

#上行遍历
stp = soup.title.parent #获取title的父亲标签
shp = soup.html.parent #html作为最高标签，其父标签是他自己
sop = soup.parent #soup的父标签为空
#标签树的上行遍历
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

#平型遍历
sans = soup.a.next_sibling #获取a标签的下一个平行标签
sanbs = soup.a.next_sibling.next_sibling
saps = soup.a.previous_sibling #获取a标签的前一个平行标签
sapspa = soup.a.previous_sibling.previous_sibling #为空
#遍历前后续节点
for sibling in soup.a.previous_siblings:
    print(sibling)
for sibling in soup.a.next_siblings:
    print(sibling)

二叉叔

发布了17 篇原创文章 · 获赞 11 · 访问量 1万+

私信关注

Python爬虫学习笔记(BeautifulSoup4库：标签树的上、下、平行遍历)

猜你喜欢