Python: Introduction to Xpath and Examples


foreword

There are already many big guys on CSDN who have posted Xpath, and they are all very good. Because I just started learning web crawlers, I don't know much about these basic and important knowledge, so I wrote it to deepen my impression. This article is just a brief introduction. Xpath and its use are generally more basic.

1. Introduction to Xpath

XPath (XML Path Language - XML ​​Path Language), it is a language used to determine the location of a part of an XML document.
Xpath is based on XML and provides users with the ability to find nodes in the data structure tree. Xpath is affectionately called by many developers 小型查询语言.

2. Xpath grammar rules

Xpath can use path expressions to select nodes on XML, so as to achieve the purpose of identifying elements. Let's first introduce the following grammar rules.

grammar rules

expression effect
nodename Select all child nodes below this level node
/ Represents selection from the root node
// It can be understood as matching, that is, selecting this node from all nodes until it matches
. select current node
Select the previous level of the current node (upper level directory)
@ Pick properties (also match)

Label positioning

Way Effect
/html/body/div Indicates that the search starts from the root node, between the label and the label/represents a level
/html//div Indicates that multiple levels act between two tags (it can also be understood as matching under html to find the tag div)
//div Start searching from any node, that is, find all div tags
./div Indicates that the div is searched from the current label

attribute targeting

need Format
Locate the div tag in the div whose attribute name is href and whose attribute value is 'www.baidu.com' @property name=property value
href is the attribute name 'www.baidu.com' is the attribute value /html/body/div[href=‘www.baidu.com’]

Index positioning

need Format
Locate the second li tag under ul (below) //ul/li[2]
The index value starts at 1

get text content

method Effect
/text() Get the content of the label directly under the label
//text() Get all the text content in the label
string() Get all the text content in the label

It is actually very easy to get Xpath on the web page. After directly finding the label, right-click and copy it.
insert image description here

3. Practice grammar rules

Next, let's practice local import and deepen our understanding. This is a relatively simple web page structure. We can learn how to use it first.
insert image description here
Mission requirements: can achieve arbitrary positioning of each element

Ready to work

#导入所需要的包
from lxml import etree
#采用本地源码获取方式并加载到etree内
tree = etree.parse('test.html')

1. Get Baidu, Google, Sogou text content

#引用xpath方法并进行标签定位
#''.join是取字符串内的内容
text = ' '.join(tree.xpath('/html/body/ul/li/a/text()'))
print(text)

insert image description here
2. Get a single google

text1 = tree.xpath("//ul/li[2]/a/text()")[0]
print(text1)

insert image description here
3. Get the attribute values ​​of Beijing, Shanghai and Tianjin

text2 = ' '.join(tree.xpath("//ol/li/a/@href"))
print(text2)

insert image description here
4. Get Henan text

#获取河南文本
text3 = tree.xpath("/html/body/div[2]/text()")[0]
print(text3)

insert image description here
5. Get google attribute value

text4 = tree.xpath("//ul/li[2]/a/@href")[0]
print(text4)

insert image description here

So far, we can position any label at will to complete the task and call it a day

Guess you like

Origin blog.csdn.net/xiaobai729/article/details/124079260