最近隔壁的童鞋在学习爬虫中重要的一步:XPath语法。可惜在线测试(比如https://www.bejson.com/testtools/xpath/)的脚本可能有问题?利用“/@xx”获取属性的时候失败,所以就用Python3写了一个小脚本,用来做练习。
目前还有两个小问题,一是无法使用etree的parse函数自己读取文件,二是无法在输入字符串中使用双引号。目测不会再继续更新。
#!/usr/bin/python3
# -*- coding: utf-8 -*-
from lxml import etree
import sys
if __name__ == '__main__':
if len(sys.argv) == 2:
filename = sys.argv[1]
else:
filename = input("Please input a filename which includes xml/html content: ")
try:
file = open(filename, 'r')
data = file.read()
html = etree.HTML(data)
# html = etree.parse(filename)
while True:
xpath = input("XPath: ").strip()
if xpath == '*':
print("By.")
break
try:
html_data = html.xpath(xpath)
for item in html_data:
if type(item) == etree._ElementUnicodeResult:
print(item)
elif type(item) == etree._Element:
print(item.tag, item.attrib)
else:
print(item, type(item))
except Exception as err:
print("发生了一个错误!")
print(err)
except:
print("文件不存在或不可读!")