Python crawler learning 29

Python crawler learning 29

Five, the use of Xpath four

5-11 Multi-attribute matching

Earlier we learned how to match a node with multi-valued attributes, so how to match nodes with multiple attributes?

This uses the operator

For example, let's slightly modify the html node here

insert image description here

Now we want to match the content of the a node under both the class and name nodes:

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
# 使用 and 运算符连接两个属性
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)

Running results: You can see that the feedback has been extracted by us

insert image description here

Operators in Xpath

insert image description here

5-12 Select in order

For example, when we obtained a node before, many matching results were returned. If we want to use one or a few of them, we need to do the following:

For example: still three nodes here

insert image description here

# 按序获取

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
# 选取返回结果中的第一个结果 注意这里与我们在python进行索引时的区别
result0 = html.xpath('//li[1]/a/text()')
print(result0)
# 选取返回结果中的最后一个
result1 = html.xpath('//li[last()]/a/text()')
print(result1)
# 选取位置中小于等于2(前两个)的节点
result2 = html.xpath('//li[position()<=2]/a/text()')
print(result2)
# 获取最后一个结果前一个结果
result3 = html.xpath('//li[last()-1]/a/text()')
print(result3)
result = html.xpath('//li/a/text()')
print(result)

operation result:

insert image description here

5-14 Node axis selection

Some common calling methods about node axis:

# 节点轴选择

from lxml import etree as e

html = e.parse('./python.html', e.HTMLParser())
# 选取第一个li节点的所有祖先节点
result = html.xpath('//li[1]/ancestor::*')
print(result)
# 选取第一个li节点的特定的祖先节点(在::后面加想要获得的节点名)
result = html.xpath('//li[1]/ancestor::div')
print(result)
# 选取一个节点的所有属性 已知li节点中的class字段属性为“li”
result = html.xpath('//li[1]/attribute::*')
print(result)
# 选取li节点下的target属性为"_blank"的直接子节点a
result = html.xpath('//li[1]/child::a[@target="_blank"]')
print(result)
# 选取第一个ul节点下的所有a子孙节点
result = html.xpath('//ul[1]/descendant::a')
print(result)
# 选取当前节点下的所有节点
result = html.xpath('//ul[1]/following::*')
print(result)
# 选取当前节点后所有的同级节点
result = html.xpath('//li[1]/following-sibling::*')
print(result)

operation result:

insert image description here

So far we have basically mastered the use of Xpath

It ends today, to be continued!

Guess you like

Origin blog.csdn.net/szshiquan/article/details/124106791