XPath Analysis of Python Crawler Actual Combat

XPath is a language for finding information in XML documents. It was originally developed for searching XML documents, but it is equally applicable to searching HTML documents.
So in Python crawlers, we often use xpath analysis, an efficient and convenient way to extract information.

environment installation

To use xpath, you need to install the lxml library

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

basic use

Instantiate an etree object, and the parsed page source code data needs to be loaded into the object. There are two ways:
1. Load the source code data in the local html document into the etree object
etree.parse('filePath', etree.HTMLParser()) # filePath为文件的路径
Example:

from lxml import etree # 导包
html = etree.parse('./test.html', etree.HTMLParser()) # ./test.html为本地的html文件的路径
html.xpath('xpath表达式')

2. Load the source code data obtained from the Internet into the etree object
etree.HtML('page_data') # page_data为从页面获取的源码数据
Example:

from lxml import etree # 导包
html = etree.HtML('page_data') # page_data为从页面获取的源码数据
html.xpath('xpath表达式')

Using xpath to parse data, the most important step is writing xpath expressions. The common expressions of xpath are introduced below.

xpath common expressions

expression meaning
nodename Select all child nodes of this node
/ Indicates that the location starts from the root node. represents a level
// Select descendant nodes (descendants) from the current node
. Select current node
@ select attribute
text() get text
* wildcard, any element node
nodename[@attrib=‘value’] Selects the specified elements with the given value for the given attribute. For example, div[@class='cell'] means all div elements whose class attribute value is cell

Detailed explanation of the examples of the above expressions
First introduce a piece of HTML code to be tested

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>测试</title>
</head>
<body>
    <div class="big">
        <ul>
            <li><a href="https://www.baidu.com/">百度</a></li>
            <li><a href="https://weibo.com/">微博</a></li>
            <li><a href="https://www.tmall.com/">天猫</a></li>
            <p>test1</p>
        </ul>
        <div>
            <a id="aa" href="https://www.iqiyi.com/">爱奇艺</a>
            <a id="bb" href="https://v.qq.com/">腾讯视频</a>
            <p>test2</p>
        </div>
    </div>
</body>
</html>

For convenience and intuition, we perform a local reading test on the HTML file

fetch node

Usage of / and //

Let's first experience the process of parsing a web page using xpath Code
:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result1 = html.xpath('/html/body/div/ul/li/a') # /表示层级关系,第一个/是根节点
print(result1)

operation result:

[<Element a at 0x28696f20ac0>, <Element a at 0x28696f20b00>, <Element a at 0x28696f20b40>]

It can be seen that there are 3 nodes in the running result, exactly the first 3 a nodes from top to bottom. We can also see that the above uses /to represent the hierarchical relationship. The so-called hierarchical relationship is actually a layer wrapped in a layer. For example, the HTML node in the test HTML code wraps the body node. If we want to get the body node, we can use /html/bodyit to express. By analogy, we can get the node we want.

Since we can take out the first 3 a-nodes, how to take out all a-nodes?
At this point we will //use
the code:

result2 = html.xpath('/html/body/div//a')
print(result2)

operation result:

[<Element a at 0x222bf9afc80>, <Element a at 0x222bf9afcc0>, <Element a at 0x222bf9afd00>, <Element a at 0x222bf9afd40>, <Element a at 0x222bf9afd80>]

It is not difficult to see that the div node with class="big" encloses all a nodes, that is, all a nodes are descendants of div nodes with class="big" (or simply understood as descendants), we /html/body/div//acan a node
insert image description here

Usage of wildcard *

For example, if we want to get the two p nodes in the figure below, we can use wildcards *to extract
insert image description here

code:

'''
第一个p节点的表达式为: '/html/body/div/ul/p'
第二个p节点的表达式为: '/html/body/div/div/p'
可以看到这两个表达式唯一不同的地方的是 p节点前面的节点(父节点)一个是ul,一个是div
这时我们只需用通配符 * 来代替p节点前面的父节点,因为通配符 * 可以表示任意节点
'''
# html.xpath('/html/body/div/ul/p')
# html.xpath('/html/body/div/div/p')
result = html.xpath('/html/body/div/*/p')
print(result)

operation result:

[<Element p at 0x1e2c335f880>, <Element p at 0x1e2c335f8c0>]

Of course, if you want to take these two p nodes, you can also use // to take them, here is just to demonstrate the usage *of

index positioning

For example, if we want to get the first a node, we can use the index to locate
the code:

result3 = html.xpath('/html/body/div/ul/li[1]/a') #li[1]表示第一个li节点,注意索引是从[1]开始
print(result3)

operation result:

[<Element a at 0x28705120ac0>]

It must be noted here that the index [1]starts ! ! !

attribute positioning

For example, if I want to get the a node with id="aa", I can use attribute positioningnodename[@attrib='value']
insert image description here

code:

result4 = html.xpath('//a[@id="aa"]')
print(result4)

operation result:

[<Element a at 0x2718236fc00>]

fetch text

In xpath, you text()can extract the text information in the web page
Code:

result5 = html.xpath('/html/body/div/ul/li[1]/a/text()') # text()获取文本
print(result5)

operation result:

['百度']

You can see that the result is a list. If we want to get the string inside, we can do this

result6 = html.xpath('/html/body/div/ul/li[1]/a/text()')[0] # 从列表中取第一个元素
print(result6)

operation result:

百度

Take attributes

If we want to get the property shown below, I can use @属性this expression
insert image description here

code:

result7 = html.xpath('//a[@id="bb"]/@href')
print(result7)

operation result:

['https://v.qq.com/']

Tips for writing xpath expressions

When we write crawlers, we often encounter more complex page codes, and it is more difficult for us to write xpath expressions. Here to tell you a lazy way, directly use the browser to copy

learn to be lazy

Specific operation: F12 in the browser to open the developer tool --> select the required content with the arrow in the upper left corner --> right click on the corresponding code --> Copy --> Copy XPath
insert image description here

Copy的结果:
//*[@id="js_top_news"]/div[2]/h2/a

In this way, we can easily get the xpath expression we need.
Of course, we can't just use this copy method, but also really understand the grammatical rules of xpath expressions. Because xpath parsing is limited in Python crawlers, we cannot use xpath expressions in some cases.

Limitations of xpath parsing

If the data of the webpage is dynamically loaded through Ajax, we cannot use xpath expressions to extract information.
A simple way to judge: right click on the webpage --> view the source code of the webpage --> ctrl+F to search for what you want Information ——> Search No Results ——> Cannot use xpath to parse
insert image description here

Pit avoidance guide for writing xpath expressions

Sometimes, when we directly copy the xpath expression or write the xpath expression by ourselves, we will encounter the situation that an empty list is returned after extracting the information. After checking the code repeatedly, we find that we have written the right thing. What's going on?
In fact, it is very likely that you did notweb page source codeWrite the xpath expression based on the standard, but write the xpath expression according to the code displayed by the developer tool. The reason is that the developer tools are real-time web page codes (for example, after loading some data through js), and the page source data we extracted may not be real-time web page codes.
for example:
insert image description here

insert image description here

In general, we can write xpath expressions directly based on the code displayed by the developer tool, but it must be combined with the source code of the web page, and the source code of the web page shall prevail!
Say important things three times! The source code of the webpage shall prevail! The source code of the webpage shall prevail! The source code of the webpage shall prevail!

Reptile combat

Let's enter the actual combat of reptiles. Our goal is to crawl the emoticon pack pictures and description text of a certain DouTu.com, and use the description text as the file name of the emoticon pack picture

# 导入必要的库
import requests
from lxml import etree
import time
import re
import os
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
    "Referer": "https://www.doutub.com/"  # 防盗链:溯源,当前本次请求的上一级是谁。请看下面图一、图二
}
num = int(input("你想爬取前几页:"))
if os.path.exists("images") == False:
    os.mkdir("images") #如果不存在images文件夹,则创建images文件夹
for n in range(num): # for循环提取多页内容
    url = f'https://www.doutub.com/img_lists/new/{
      
      n+1}'
    # url 是网址,这里使用字符串拼接网址
    # 如 https://www.doutub.com/img_lists/new/1(第一页网址)  https://www.doutub.com/img_lists/new/2(第二页网址)等等
    resp = requests.get(url, headers=headers)
    html = etree.HTML(resp.text)
    divs = html.xpath("//div[@class='cell']")[0:50]
    # 返回的 divs 是一个列表,切片去除无用信息,第51个div我们不需要,详细看图三

    for div in divs:
        imgSrc = div.xpath("./a/img/@data-src")[0]
        word = div.xpath("./a/span/text()")[0].strip()
        name = re.sub(r'[\:*?"<>/|]', '', word) #使用正则表达式sub函数去除 \:*?"<>/|这些字符。原因看图四
        img_type = imgSrc.split(".")[-1] #因为图片文件的格式有些是jpg,有些是gif,这里取出图片格式
        # 下载图片
        img_resp = requests.get(imgSrc, headers=headers)
        with open("images/" + name + "." + img_type, mode="wb") as f:
            f.write(img_resp.content)
        print(name + "." + img_type, "下载完成")
        time.sleep(0.3)  # 防止频繁访问被封ip,这里休息0.3秒
    print(f"\n第{
      
      n+1}页下载完成!\n")
print("全部下载完成!!!")

Figure 1:
Anti-leech problem
insert image description here

Figure 2:
Solve the anti-leech
insert image description here

Figure 3:
insert image description here

Figure 4:
insert image description here

final effect:
insert image description here

You're done!

Guess you like

Origin blog.csdn.net/weixin_58667126/article/details/126105955