Crawler 09 - xpath analysis

1. Understand XPath

xpath is a language for searching content in XML documents
html is a subset of xml

<book>
    <id>1</id>
    <name>野花满地香</name>
    <price>1.23</price>
    <author>
        <nick>周大强</nick>
        <nick>周诺宁</nick>
    </author>
</book>

In xml, these tags are called nodes. In the above case, <book> is the parent node of <id>, <name>, <price>, <author>, otherwise, <id>, <name>, <price>, <author> are child nodes of <book>.

<id>,<name>,<price>,<author> are sibling nodes.

All in all, whoever wraps whoever is in the outer layer is the parent node.

#当要查找price的值，应该先从文档的根目录开始寻找
/book/price

2. Introduction to xpath

2.1 Install the lxml module

Using some functions in the lxml module, you can use xpath to parse

pip install lxml

2.2 Some simple cases (grammar rules)

# xpath 是在XML文档中搜索内容的一门语言
# html 是xml的一个子集
from lxml import etree
xml = """      #首先导入一个xml数据
    <book>
        <id>1</id>
        <name>野花满地香</name>
        <price>1.23</price>
        <author>
            <nick id="10086">周大强</nick>
            <nick id="10010">周诺宁</nick>
            <nick class="joy">周杰伦</nick>
            <nick class="jolin">蔡依林</nick>
            <div>
                <nick>热热热热热热</nick>
            </div>
            <span>
                <nick>热热热热热热1</nick>
                <div>
                    <nick>热热热热热热3</nick>
                </div>
            </span>
        </author>
        <partner>
            <nick id="ppc">胖胖陈</nick>
            <nick id="ppbc">胖胖不陈</nick>
        </partner>
    </book>
"""
tree = etree.XML(xml)
#1. 想要拿到name的值
result1 = tree.xpath("/book/name/text()") # text()是用来获取文本值
print(result1) # >>> ['野花满地香']

#2. 获取author里nick的值
result2 = tree.xpath("/book/author/nick/text()")
print(result2)  # >>> ['周大强', '周诺宁', '周杰伦', '蔡依林']
#因为div下的nick与上面的nick不是在同一阶级上，所以找不到

#3. 获取author中div里的nick值
result3 = tree.xpath("/book/author/div/nick/text()")
print(result3)  # >>> ['热热热热热热']

#4. 获取author中所有的nick值
result4 = tree.xpath("/book/author//nick/text()") # // 获取父节点下所有的后代
print(result4)  # >>> ['周大强', '周诺宁', '周杰伦', '蔡依林', '热热热热热热', '热热热热热热1', '热热热热热热3']

#5. 获取热，热1
result5 = tree.xpath("/book/author/*/nick/text()") # * 表示获取该阶级下所有的nick值 相当于斗地主中的赖子
print(result5)  # >>> ['热热热热热热', '热热热热热热1']

#6. 获取book下所有的nick值
result6 = tree.xpath("/book//nick/text()")
print(result6)  # >>> ['周大强', '周诺宁', '周杰伦', '蔡依林', '热热热热热热', '热热热热热热1', '热热热热热热3', '胖胖陈', '胖胖不陈']

2.3 Some in-depth cases (grammar rules)

2.3.1 First create an html file for the practice of the case

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <ul>
        <li><a href="http://www.baidu.com">百度</a></li>
        <li><a href="http://www.google.com">谷歌</a></li>
        <li><a href="http://www.sogou.com">搜狗</a></li>
    </ul>
    <ol>
        <li><a href="feiji">飞机</a></li>
        <li><a href="dapao">大炮</a></li>
        <li><a href="huojian">火箭</a></li>
    </ol>
    <div class="job">李嘉诚</div>
    <div class="common">胡辣汤</div>
</body>
</html>

2.3.2 Case practice code + analysis

from lxml import etree
#新版本的lxml中没有集成etree，所以需要在b.html后增加一个解析
tree = etree.parse("b.html",etree.HTMLParser()) #parse用于导入文件

#1. 获取百度，谷歌，搜狗
result1 = tree.xpath('/html/body/ul/li/a/text()')
print(result1) #   >>> ['百度', '谷歌', '搜狗']

#2. 根据索引来寻找想要的值->获取百度，谷歌，搜狗中其中的某一个
# xpath中索引是从1开始的
result2 = tree.xpath('/html/body/ul/li[1]/a/text()') # [数字] 表示索引
result3 = tree.xpath('/html/body/ul/li[2]/a/text()')
result4 = tree.xpath('/html/body/ul/li[3]/a/text()')
print(result2) #   >>> ['百度']
print(result3) #   >>> ['谷歌']
print(result4) #   >>> ['搜狗']

#3. 根据属性对应的属性值来寻找元素->寻找href的值是大炮的元素
result5 = tree.xpath('/html/body/ol/li/a[@href="dapao"]/text()') # [@xxx=xxx] 表示属性的筛选
result6 = tree.xpath('/html/body/ol/li/a[@href="huojian"]/text()')
print(result5) #   >>> ['大炮']
print(result6) #   >>> ['火箭']

#4. 遍历元素
request7 = tree.xpath('/html/body/ol/li')
for li in request7:
    # print(li) # 此时的request7里应该是存放着三个li节点
    #1. 接着从每一个li中提取到文字信息
    # 但是现在的li已经不是整体的根节点了，所以需要增加 ' ./ ' 表示定位到当前节点
    result8 = li.xpath('./a/text()') # 在li中继续去寻找，此时为相对查找
    print(result8)

    #2. 获取到值对应的属性，-> 拿到href值 @属性
    result9 = li.xpath('./a/@href') #拿到属性对应的值是加[],去掉[]就是获取属性了
    print(result9)
'''
    ['飞机']
    ['feiji']
    ['大炮']
    ['dapao']
    ['火箭']
    ['huojian']
'''
#5. 获取ul下所有的href属性
result10 = tree.xpath('/html/body/ul/li/a/@href')
print(result10)
#   >>> ['http://www.baidu.com', 'http://www.google.com', 'http://www.sogou.com']

2.3.3 Some tips

First open the html we created in the browser, right-click to check, when the page has a lot of content and looks messy, you can click on the content you want, and you will find that the corresponding position will be located in the check bar superior.

Then right click, there will be an option to copy xpath in the copy column, at this time we copy its xpath.

/html/body/div[1]

Then we import the copied xpath into the code to get the data we want

#6. 通过网页复制的xpath进行获取数据
result10 = tree.xpath('/html/body/div[1]/text()')
print(result10) #   >>> ['李嘉诚']

3. XPath actual combat, grabbing Zhubajie.com information

Website address [Ningbo Art Price_Ningbo Art Quote]_Ningbo Art Service Outsourcing Information-Ningbo Zhubajie.com

Crawl the name, price, introduction and address of each store

3.1 First check whether the information is in the source code, you can find it by searching for related words, it exists in the source code

3.2 Then through the case study above, it is enough to grab the content layer by layer

Through the analysis of the source code, it can be found that all the service providers are circled by the frame, and each corresponding div below is the information of each service provider we are looking for. , you can use the above tips to get the xpath, or search layer by layer from the root node.

By obtaining the xpath, we need to make a slight modification. We need to change the last div[1] to div, because [1] represents the first service provider in the full text, and what we need to obtain is the information of all service providers. So you need to directly locate the div to represent the whole.

Through careful observation, we can obtain all the information of each service provider. Only one service provider is output here, and all service provider information only needs to comment out breal.

import requests
from lxml import etree
url = 'https://ningbo.zbj.com/search/f/?kw=%E7%BE%8E%E5%B7%A5'
response = requests.get(url=url)
#print(response.text)

#解析
html = etree.HTML(response.text)
# 定位
# 获取到的xpath -> /html/body/div[6]/div/div/div[2]/div[5]/div[1]
divs = html.xpath('/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div') # 获取到所有服务商

#遍历，div就表示页面上一个个的服务商
for div in divs:
    name = div.xpath('./div/div/a[1]/div[1]/p/text()') #服务商店名
    addr = div.xpath('./div/div/a[1]/div[1]/div/span/text()') #服务商地址
    money = div.xpath('./div/div/a[2]/div[2]/div[1]/span[1]/text()') #服务费
    tittle = div.xpath('./div/div/a[2]/div[2]/div[2]/p/text()') #标签
    print(name)
    print(addr)
    print(money)
    print(tittle)
    break #用于方便观察，所以只输出一次

operation result

However, observing the running results shows that it is not perfect enough and needs to be repaired

3.3 Improvement

#遍历，div就表示页面上一个个的服务商
for div in divs:
    name = div.xpath('./div/div/a[1]/div[1]/p/text()')[1].strip('\n')  # 服务商店名
    addr = ''.join(div.xpath('./div/div/a[1]/div[1]/div/span/text()'))  # 服务商地址
    money = ''.join(div.xpath('./div/div/a[2]/div[2]/div[1]/span[1]/text()')).strip('¥')  # 服务费
    tittle = ''.join(div.xpath('./div/div/a[2]/div[2]/div[2]/p/text()'))  # 标签
    print(name)
    print(addr)
    print(money)
    print(tittle)
    break  # 用于方便观察，所以只输出一次