Article directory
foreword
There are already many big guys on CSDN who have posted Xpath, and they are all very good. Because I just started learning web crawlers, I don't know much about these basic and important knowledge, so I wrote it to deepen my impression. This article is just a brief introduction. Xpath and its use are generally more basic.
1. Introduction to Xpath
XPath (XML Path Language - XML Path Language), it is a language used to determine the location of a part of an XML document.
Xpath is based on XML and provides users with the ability to find nodes in the data structure tree. Xpath is affectionately called by many developers 小型查询语言
.
2. Xpath grammar rules
Xpath can use path expressions to select nodes on XML, so as to achieve the purpose of identifying elements. Let's first introduce the following grammar rules.
grammar rules
expression | effect |
---|---|
nodename | Select all child nodes below this level node |
/ | Represents selection from the root node |
// | It can be understood as matching, that is, selecting this node from all nodes until it matches |
. | select current node |
… | Select the previous level of the current node (upper level directory) |
@ | Pick properties (also match) |
Label positioning
Way | Effect |
---|---|
/html/body/div | Indicates that the search starts from the root node, between the label and the label/represents a level |
/html//div | Indicates that multiple levels act between two tags (it can also be understood as matching under html to find the tag div) |
//div | Start searching from any node, that is, find all div tags |
./div | Indicates that the div is searched from the current label |
attribute targeting
need | Format |
---|---|
Locate the div tag in the div whose attribute name is href and whose attribute value is 'www.baidu.com' | @property name=property value |
href is the attribute name 'www.baidu.com' is the attribute value | /html/body/div[href=‘www.baidu.com’] |
Index positioning
need | Format |
---|---|
Locate the second li tag under ul (below) | //ul/li[2] |
The index value starts at | 1 |
get text content
method | Effect |
---|---|
/text() | Get the content of the label directly under the label |
//text() | Get all the text content in the label |
string() | Get all the text content in the label |
It is actually very easy to get Xpath on the web page. After directly finding the label, right-click and copy it.
3. Practice grammar rules
Next, let's practice local import and deepen our understanding. This is a relatively simple web page structure. We can learn how to use it first.
Mission requirements: can achieve arbitrary positioning of each element
Ready to work
#导入所需要的包
from lxml import etree
#采用本地源码获取方式并加载到etree内
tree = etree.parse('test.html')
1. Get Baidu, Google, Sogou text content
#引用xpath方法并进行标签定位
#''.join是取字符串内的内容
text = ' '.join(tree.xpath('/html/body/ul/li/a/text()'))
print(text)
2. Get a single google
text1 = tree.xpath("//ul/li[2]/a/text()")[0]
print(text1)
3. Get the attribute values of Beijing, Shanghai and Tianjin
text2 = ' '.join(tree.xpath("//ol/li/a/@href"))
print(text2)
4. Get Henan text
#获取河南文本
text3 = tree.xpath("/html/body/div[2]/text()")[0]
print(text3)
5. Get google attribute value
text4 = tree.xpath("//ul/li[2]/a/@href")[0]
print(text4)
So far, we can position any label at will to complete the task and call it a day