[python爬虫之路day4]：xpath基本知识&&lxml结合xpath进行数据分析&&爬取豆瓣电影

一.**********XPath:******XPath是一门在xml和html语言中查找信息的一门语言，可以对xml和html文档的元素和属性进行遍历。
chrome中的插件: XPath helper
Firefox插件：Try XPath
XPath语法：
谓语（Predicates）:
谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中。
xpath使用方法:用 “ // ” 获取整个页面中的元素，然后写标签名，然后写谓词进行提取。例如： //div[@class=“abc”]
注意：
1.“/”获取子结点，“//” 获取全部结点。
2.contains函数：某个属性包含多个值可以用这个函数。例如： “//div[contains(@class,“abject”)]”
3.注意谓词：谓词中的下标是从1开始而非0
xpath更详细的介绍参见：https://www.w3school.com.cn/xpath/xpath_syntax.asp

二.使用lxml库解析hmtl代码：
1.解析html字符串，使用“lxml.etree.HTML”进行解析，示例代码如下：

html_element=etree.HTML(text)
print(etree.tostring(html_element,encoding='utf-8').decode('utf-8'))

2.解析html文件，使用“lxml.etree.parse”进行解析，示例代码如下：

html_element = etree.parse("lagou.html")
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

这个函数默认xml解析器，所以对于不规范的html，有时候会报错，所以此时应自己创建一个‘html’解析器，代码如下：

parser=etree.HTMLParser(encoding='utf-8')
html_element = etree.parse("lagou.html",parser=parser)
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

经过lxml库解析hmtl代码后就可以进行xpath提取了。
lxml结合xpath注意事项：
1.使用xpath语法，应使用"element.xpath",示例代码如下：

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
html = etree.parse("tengxun.html",parser=parser)
***div=html.xpath("//div[2]")[0]***
print(etree.tostring(div, encoding='utf-8').decode('utf-8'))
print(div)

#xpath函数返回的是列表
2.获取某个标签的属性：
示例如下：

href=html.xpath("//a/@href")
#获取a标签的href属性的值

3.获取文本，通过xpath中的“text”函数获取。示例如下：
(#在某个标签下获取某个子孙元素，在“/”之前加“.”)

adress=tr.xpath("./td[4]/text()")[0]

实例：爬取豆瓣网即将上映的电影信息：

from lxml import etree
import requests
url="https://movie.douban.com/" 
response=requests.get(url,headers=headers)#解析器
text=response.text
html=etree.HTML(text)
ul=html.xpath("//ul[@class='ui-slide-content']")[0]
#print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis=ul.xpath("./li")
movies=[]
for li in lis:
    #print(etree.tostring(li,encoding='utf-8').decode('utf-8'))
    title=li.xpath("@data-title")[0]
    year=li.xpath("@data-release")[0]
    director=li.xpath("@data-director")[0]
    actors=li.xpath("@data-actors")[0]
    picture=li.xpath(".//img/@src")
    movie={"title":title,
           "year":year,
           "director":director,
           "actors":actors,
           "picture":picture}
    movies.append(movie)
print(movies)

结果如下：
C:\python38\python.exe “C:/python38/new project/mydi/day4.py”
[{‘title’: ‘六人-泰坦尼克上的中国幸存者 The Six’, ‘year’: ‘2020’, ‘director’: ‘罗飞’, ‘actors’: ‘施万克’, ‘picture’: [‘https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2581067467.jpg’]}, {‘title’: ‘通往春天的列车’, ‘year’: ‘2019’, ‘director’: ‘李骥’, ‘actors’: ‘任素汐 / 李岷城 / 陈宇星’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2574813382.jpg’]}, {‘title’: ‘不欺不遇’, ‘year’: ‘2020’, ‘director’: ‘金光利’, ‘actors’: ‘肖旭 / 张学恒 / 张维威’, ‘picture’: [‘https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2572668239.jpg’]}, {‘title’: ‘合法伴侣’, ‘year’: ‘2019’, ‘director’: ‘黄雷’, ‘actors’: ‘李治廷 / 张榕容 / 白客’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2581586285.jpg’]}, {‘title’: ‘月半爱丽丝’, ‘year’: ‘2020’, ‘director’: ‘张林子’, ‘actors’: ‘关晓彤 / 黄景瑜 / 官鸿’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2580211645.jpg’]}, {‘title’: ‘金禅降魔’, ‘year’: ‘2020’, ‘director’: ‘彭发’, ‘actors’: ‘释小龙 / 胡军 / 姚星彤’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2564190636.jpg’]}, {‘title’: ‘大红包’, ‘year’: ‘2020’, ‘director’: ‘李克龙’, ‘actors’: ‘包贝尔 / 李成敏 / 贾冰’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg’]}, {‘title’: ‘撼山瑶’, ‘year’: ‘2020’, ‘director’: ‘马雍’, ‘actors’: ‘曾格格 / 姜永波 / 辛祚宇’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2568504230.jpg’]}, {‘title’: ‘五彩缤纷’, ‘year’: ‘2020’, ‘director’: ‘胡安’, ‘actors’: ‘朱珠 / 艾米·欧文 / 李雅男’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg’]}, {‘title’: ‘奇妙王国之魔法奇缘’, ‘year’: ‘2020’, ‘director’: ‘陈设’, ‘actors’: ‘卢瑶 / 张洋 / 陈新玥’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg’]}]

Process finished with exit code 0

此次爬取若有相关前端知识便会方便很多。
总结：
通过4天的学习，已经可以爬取轻量级的爬虫，本人也将持续更新学习笔记于此，欢迎同为小白的我们一起努力。

slow.ver

发布了5 篇原创文章 · 获赞 1 · 访问量 182

私信关注

[python爬虫之路day4]：xpath基本知识&&lxml结合xpath进行数据分析&&爬取豆瓣电影

猜你喜欢