Python crawler learning 27

Python crawler learning 27

Continue learning Xpath today

Continued from the previous article

5. The second use of Xpath

If you are unfamiliar, please attach the rules first:

insert image description here

5-5 Child Nodes

Use / or // to find the child nodes of an element.

from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
# 检索所有li节点下的a节点
result = html.xpath('//li/a')
print(result)

operation result:

insert image description here

Of course, the above example is to find the direct child nodes under a specific node. To find all child nodes within a node:

from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

operation result:

insert image description here

Of course, the result is the same for the html we selected, and the meanings expressed by the above two methods were completely different at that time.

Next, please think about, if we write the Xpath as follows, what results will be matched?

'//ul/a'
from lxml import etree

html = etree.parse('./text.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)

For the Xpath language, a / represents the direct node a under the ul node, then the result is of course:

insert image description here

5-6 Parent Node

Knowing how to query child nodes and descendant nodes, how to query parent nodes?

This time I changed a piece of html, for example, we want to find the div node here

insert image description here

If the h1 node whose class attribute is baikeLogo is selected, and we want to get the id attribute of its parent node, we can do this:

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//h1[@class="baikeLogo"]/../@id')
print(result)

operation result:

insert image description here

You can also get the parent node through parent:::

# parent::
from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//h1[@class="baikeLogo"]/parent::*/@id')
print(result)

operation result:

insert image description here

5-7 Attribute Matching

To match a node with a specific attribute, you need to use @, which we have seen in the previous case:

For example, we want to match all li nodes with class attribute li here:

insert image description here

# 属性匹配

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li[@class="li"]')
print(result)

Running results: From the screenshot we can see that there should be 3 matching results, and there are also three running results

insert image description here

So the question is, are the three things returned to us the three we want?

That's it for today, let's sell it first. . .

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123999302