[python爬虫之路day4]:xpath基本知识&&lxml结合xpath进行数据分析&&爬取豆瓣电影

一.**********XPath:******XPath是一门在xml和html语言中查找信息的一门语言,可以对xml和html文档的元素和属性进行遍历。
chrome中的插件: XPath helper
Firefox插件:Try XPath

XPath语法:
谓语(Predicates):
谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中。
xpath使用方法:用 “ // ” 获取整个页面中的元素,然后写标签名,然后写谓词进行提取。例如: //div[@class=“abc”]
注意:
1.“/”获取子结点,“//” 获取全部结点。
2.contains函数:某个属性包含多个值可以用这个函数。例如: “//div[contains(@class,“abject”)]”
3.注意谓词:谓词中的下标是从1开始而非0
xpath更详细的介绍参见:https://www.w3school.com.cn/xpath/xpath_syntax.asp

二.使用lxml库解析hmtl代码:
1.解析html字符串,使用“lxml.etree.HTML”进行解析,示例代码如下:

html_element=etree.HTML(text)
print(etree.tostring(html_element,encoding='utf-8').decode('utf-8'))

2.解析html文件,使用“lxml.etree.parse”进行解析,示例代码如下:

html_element = etree.parse("lagou.html")
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

这个函数默认xml解析器,所以对于不规范的html,有时候会报错,所以此时应自己创建一个‘html’解析器,代码如下:

parser=etree.HTMLParser(encoding='utf-8')
html_element = etree.parse("lagou.html",parser=parser)
print(etree.tostring(html_element, encoding='utf-8').decode('utf-8'))

经过lxml库解析hmtl代码后就可以进行xpath提取了。
lxml结合xpath注意事项:
1.使用xpath语法,应使用"element.xpath",示例代码如下:

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
html = etree.parse("tengxun.html",parser=parser)
***div=html.xpath("//div[2]")[0]***
print(etree.tostring(div, encoding='utf-8').decode('utf-8'))
print(div)

#xpath函数返回的是列表
2.获取某个标签的属性:
示例如下:

href=html.xpath("//a/@href")
#获取a标签的href属性的值

3.获取文本,通过xpath中的“text”函数获取。示例如下:
(#在某个标签下获取某个子孙元素,在“/”之前加“.”)

adress=tr.xpath("./td[4]/text()")[0]

实例:爬取豆瓣网即将上映的电影信息:

from lxml import etree
import requests
url="https://movie.douban.com/" 
response=requests.get(url,headers=headers)#解析器
text=response.text
html=etree.HTML(text)
ul=html.xpath("//ul[@class='ui-slide-content']")[0]
#print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis=ul.xpath("./li")
movies=[]
for li in lis:
    #print(etree.tostring(li,encoding='utf-8').decode('utf-8'))
    title=li.xpath("@data-title")[0]
    year=li.xpath("@data-release")[0]
    director=li.xpath("@data-director")[0]
    actors=li.xpath("@data-actors")[0]
    picture=li.xpath(".//img/@src")
    movie={"title":title,
           "year":year,
           "director":director,
           "actors":actors,
           "picture":picture}
    movies.append(movie)
print(movies)

结果如下:
C:\python38\python.exe “C:/python38/new project/mydi/day4.py”
[{‘title’: ‘六人-泰坦尼克上的中国幸存者 The Six’, ‘year’: ‘2020’, ‘director’: ‘罗飞’, ‘actors’: ‘施万克’, ‘picture’: [‘https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2581067467.jpg’]}, {‘title’: ‘通往春天的列车’, ‘year’: ‘2019’, ‘director’: ‘李骥’, ‘actors’: ‘任素汐 / 李岷城 / 陈宇星’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2574813382.jpg’]}, {‘title’: ‘不欺不遇’, ‘year’: ‘2020’, ‘director’: ‘金光利’, ‘actors’: ‘肖旭 / 张学恒 / 张维威’, ‘picture’: [‘https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2572668239.jpg’]}, {‘title’: ‘合法伴侣’, ‘year’: ‘2019’, ‘director’: ‘黄雷’, ‘actors’: ‘李治廷 / 张榕容 / 白客’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2581586285.jpg’]}, {‘title’: ‘月半爱丽丝’, ‘year’: ‘2020’, ‘director’: ‘张林子’, ‘actors’: ‘关晓彤 / 黄景瑜 / 官鸿’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2580211645.jpg’]}, {‘title’: ‘金禅降魔’, ‘year’: ‘2020’, ‘director’: ‘彭发’, ‘actors’: ‘释小龙 / 胡军 / 姚星彤’, ‘picture’: [‘https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2564190636.jpg’]}, {‘title’: ‘大红包’, ‘year’: ‘2020’, ‘director’: ‘李克龙’, ‘actors’: ‘包贝尔 / 李成敏 / 贾冰’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg’]}, {‘title’: ‘撼山瑶’, ‘year’: ‘2020’, ‘director’: ‘马雍’, ‘actors’: ‘曾格格 / 姜永波 / 辛祚宇’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2568504230.jpg’]}, {‘title’: ‘五彩缤纷’, ‘year’: ‘2020’, ‘director’: ‘胡安’, ‘actors’: ‘朱珠 / 艾米·欧文 / 李雅男’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2580890062.jpg’]}, {‘title’: ‘奇妙王国之魔法奇缘’, ‘year’: ‘2020’, ‘director’: ‘陈设’, ‘actors’: ‘卢瑶 / 张洋 / 陈新玥’, ‘picture’: [‘https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg’]}]

Process finished with exit code 0

此次爬取若有相关前端知识便会方便很多。
总结:
通过4天的学习,已经可以爬取轻量级的爬虫,本人也将持续更新学习笔记于此,欢迎同为小白的我们一起努力。

发布了5 篇原创文章 · 获赞 1 · 访问量 182

猜你喜欢

转载自blog.csdn.net/dinnersize/article/details/104320276