1. Relevant knowledge

1.1 Use of etrees

1.1.1 Coding process

Load HTML text into etree object

Call the xpath() function of etree to complete the label positioning

Do whatever you want with the tags (the tags obtained at this time are actually the objects returned by the xpath() function)

1.1.2 Environment installation

pip install lxml

1.1.3 Instantiate the etree object

# 首先导入模块
from lxml import etree

Load HTML locally

filePath='你的HTML文件路径'
tree=etree.parse(filePath)

Load HTML from response data

url='某个网址'
import requests
respose=requests.get(url)
tree=etree.HTML(respose.text)

1.2 XPath syntax

For XPath related knowledge, see: https://www.runoob.com/xpath/xpath-syntax.html

xpath('xpath expression'):
-/: The leftmost slash indicates positioning from the root node, and the middle slash indicates a level
-//: The middle // indicates multiple levels, the last The // on the left indicates positioning from any position
- attribute positioning: //tag[@attribute name='attribute value']
-index positioning: //tag[@attribute name='attribute value']/p[3] ==> The index starts from 1
- take the text:
-/text(): get the text content directly in the label
-//text(): get all the text content under the label
- get the attribute value:
-/ @property name

1.3 Simple example of XPath

HTML file content:

#test.html文件中的内容
 
 
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>测试bs4</title>
</head>
<body>
    <div>
        <p>百里守约</p>
    </div>
    <div class="song">
        <p>李清照</p>
        <p>王安石</p>
        <p>苏轼</p>
        <p>柳宗元</p>
        <a href="http://www.song.com/" title="赵匡胤" target="_self">
            <span>this is span</span>
        宋朝是最强大的王朝，不是军队的强大，而是经济很强大，国民都很有钱</a>
        <a href="" class="du">总为浮云能蔽日,长安不见使人愁</a>
        <img src="http://www.baidu.com/meinv.jpg" alt=""/>
    </div>
    <div class="tang">
        <ul>
            <li><a href="http://ww.baidu.com" title= "ging">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有，牧童遥指杏花村</a></li>
            <li><a href="http://ww.163 .com" title="gin">秦时明月汉时关，万里长征人未还,但使龙城飞将在，不教胡马度阴山</a></li>
            <li><a href="http://ww.126.com" alt= "qi">岐王宅里寻常见，崔九堂前几度闻，正是江南好风景,落花时节又逢君</a></li>
            <li><a href="http://www.sina.com" class="du">杜甫</a></li>
            <li><a href="http://www.dudu.com" class="du">杜牧</a></li>
            <li><b>杜小月</b></li>
            <li><i>度蜜月</i></li>
            <li><a href="http://www.haha.com" id="feng">凤凰台上凤凰游，凤去台空江自流，吴宫花草埋幽径,晋代衣冠成古丘</a></li>
        </ul>
    </div>
</body>
</html>

Test code:

from lxml import etree
 
if __name__ == '__main__':
    #实例化一个对象,且将解析的源码加载到该对象中
    tree=etree.parse('test.html')
 
    print(tree.xpath('/html/head/title'))#从根目录开始遍历html,html>head>title
    print(tree.xpath('/html//title'))#同上,这里的//表示的是多个层级
    print(tree.xpath('//div'))#最左侧的//表示从任意位置开始定位title
    print(tree.xpath('//div[@class="song"]'))#class属性定位
    print(tree.xpath('//div[@class="song"]/p[3]'))#索引定位,索引下标从1开始
    print(tree.xpath('//div[@class="tang"]/ul/li[5]/a/text()')[0])#杜牧
    print(tree.xpath('//li[7]/i/text()'))#返回的是列表['度蜜月'],若想拿值,可以加上[0]
    print(tree.xpath('//li[7]//text()'))  # 同上
    print(tree.xpath('//div[@class="song"]/img/@src')[0])#取属性值

operation result:

2. Case - crawling Douban movies and scoring

2.1 Web page analysis

2.1.1 Background introduction

Website to crawl: https://movie.douban.com/

2.1.2 Page Analysis

获取li标签：'//*[@id="screening"]/div[2]/ul/li'获得的是一个li标签列表

获取电影名字：''@data-title' ==>直接获取属性值

【或者】'./ul/li[@class="title"]/a/text()' ==>获取标签文本

获得详情页的url：'./ul/li[@class="poster"]/a/@href'

进入详情页后操作类似于上面几步：

2.2 代码

import requests
from lxml import etree

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
url="https://movie.douban.com/"
res = requests.get(url, headers=header)
# 将相应数据加载到etree对象
tree=etree.HTML(res.text)
li_list=tree.xpath('//*[@id="screening"]/div[2]/ul/li')

for li in li_list:
    name=li.xpath('@data-title')[0]#获取属性值
    rate=li.xpath('@data-rate')[0]
    print('电影名字：',name)
    print('电影评分：',rate)
    #进入详情页
    detail_url=li.xpath('./ul/li[@class="poster"]/a/@href')[0]
    detail_res=requests.get(detail_url,headers=header)
    new_tree=etree.HTML(detail_res.text)
    new_span_list=new_tree.xpath('//div[@id="info"]/span')
    detail_info=dict()
    for span in new_span_list:
        name=span.xpath('./span[1]/text()')[0]
        value=span.xpath('./span[2]//text()')[0]
        print("%s:%s"%(name,value))

    break

运行结果：

Simple review of python crawlers 1 [using etree for XPath analysis]