一、xpath解析简介
- 作用:
xpath
是一种非常简单好用的页面提取方案。
- 安装:使用前,请安装好
lxml
模块,到本地终端下,输入以下代码,即可安装
pip install lxml
- 导包:
from lxml import etree
二、xpath语法
1. 用于测试的页面源代码
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Title</title>
</head>
<body>
<div>
<p>一个很厉害的人</p>
<ol>
<li id="10086">周大强</li>
<li id="10010">周芷若</li>
<li class="joy">周杰伦</li>
<li class="jolin">蔡依林</li>
<ol>
<li>阿信</li>
<li>信</li>
<li>信不信</li>
</ol>
</ol>
</div>
<hr />
<ul>
<li><a href="http://www.baidu.com">百度</a></li>
<li><a href="http://www.google.com">谷歌</a></li>
<li><a href="http://www.sogou.com">搜狗</a></li>
</ul>
<ol>
<li><a href="feiji">飞机</a></li>
<li><a href="dapao">大炮</a></li>
<li><a href="huoche">火车</a></li>
</ol>
<div class="job">李嘉诚</div>
<div class="common">胡辣汤</div>
</body>
</html>
2. xpath解析入门
from lxml import etree
f = open("xpath测试.html", mode='r', encoding='utf-8')
page_source = f.read()
'''
以后写代码,没有提示怎么办?
用type() 得到数据类型
去变量被赋值位置,添加 # type: 类型
'''
hm = etree.HTML(page_source)
html = hm.xpath("/html")
print(html)
body = hm.xpath("/html/body")
print(body)
p = hm.xpath("/html/body/div/p/text()")
print(p)
print(p[0])
print("".join(p))
p = hm.xpath("//p/text()")
print(p)
li = hm.xpath("//div/ol/li/text()")
print(li)
li = hm.xpath("//div/ol//text()")
print(li)
print("".join(li).replace(" ", "").replace("\n", ""))
3. xpath解析进阶
from lxml import etree
f = open("xpath测试.html", mode='r', encoding='utf-8')
page_source = f.read()
hm = etree.HTML(page_source)
li = hm.xpath("//ol/li[2]/text()")
print(li)
li = hm.xpath("//ol/ol/li[2]/text()")
print(li)
li = hm.xpath("//li[@id='10086']/text()")
print(li)
li = hm.xpath("//*[@class='joy']/text()")
print(li)
li = hm.xpath("//*[@class]/text()")
print(li)
li_list = hm.xpath("//ol/ol/li")
for li in li_list:
print(li.xpath("./text()"))
li_list = hm.xpath("//ul/li")
for li in li_list:
print(li.xpath("./a/text()"))
print(li.xpath("./a/@href"))
li = hm.xpath("//body/ol/li[last()]/a/text()")
print(li)
f.close()
4. xpath解析实战(中国票房)
'''
需求: 获取中国票房网站中的数据信息
'''
import requests
from lxml import etree
import time
import csv
url = 'http://www.boxofficecn.com/boxoffice2022'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
resp = requests.get(url, headers=headers)
main_page = etree.HTML(resp.text)
tr_list = main_page.xpath('//table/tbody/tr')
'''
# 打开一个文件,用于写入操作
with open('2022move.text', 'w') as f:
for tr in tr_list[1:-1]:
# print(tr==None)
num = tr.xpath('./td[1]//text()')[0] # 编号
year = tr.xpath('./td[2]//text()') # 年份
name = tr.xpath('./td[3]//text()') # 电影名
# 判断数据是否为空
if name:
name = ''.join(name)
else:
name = '未输入'
money = tr.xpath('./td[4]//text()') # 票房
if not year:
year = '未输入'
else:
year = ''.join(year)
if not money:
money = '未输入'
else:
money = ''.join(money)
# print(num, year, name, money)
f.write(f'{num}|{year}|{name}|{money}')
f.write('\n')
'''
with open('2022move.csv', 'w') as f:
writer = csv.writer(f)
for tr in tr_list[1:-1]:
num = tr.xpath('./td[1]//text()')
year = tr.xpath('./td[2]//text()')
name = tr.xpath('./td[3]//text()')
if num:
num = ''.join(num)
else:
num = '未输入'
if name:
name = ''.join(name)
else:
name = '未输入'
money = tr.xpath('./td[4]//text()')
if not year:
year = '未输入'
else:
year = ''.join(year)
if not money:
money = '未输入'
else:
money = ''.join(money)
writer.writerow([num, year, name, money])
三、关于xpath解析总结
xpath
提取到的内容不论多少, 都会返回列表
.
text()
-> 提取标签下的文本内容
[@属性='值']
-> 获取某指定的标签
@属性
-> 表示提取某对应属性
xpath解析
中的索引
是从1
开始的,不是从0
开始的
//
表示在页面任意位置找,跳过前面的标签,就找符合条件的标签.