Python crawler learning 27
Continue learning Xpath today
Continued from the previous article
content
5. The second use of Xpath
If you are unfamiliar, please attach the rules first:
5-5 Child Nodes
Use / or // to find the child nodes of an element.
from lxml import etree
html = etree.parse('./text.html', etree.HTMLParser())
# 检索所有li节点下的a节点
result = html.xpath('//li/a')
print(result)
operation result:
Of course, the above example is to find the direct child nodes under a specific node. To find all child nodes within a node:
from lxml import etree
html = etree.parse('./text.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
operation result:
Of course, the result is the same for the html we selected, and the meanings expressed by the above two methods were completely different at that time.
Next, please think about, if we write the Xpath as follows, what results will be matched?
'//ul/a'
from lxml import etree
html = etree.parse('./text.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)
For the Xpath language, a / represents the direct node a under the ul node, then the result is of course:
5-6 Parent Node
Knowing how to query child nodes and descendant nodes, how to query parent nodes?
This time I changed a piece of html, for example, we want to find the div node here
If the h1 node whose class attribute is baikeLogo is selected, and we want to get the id attribute of its parent node, we can do this:
from lxml import etree
html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//h1[@class="baikeLogo"]/../@id')
print(result)
operation result:
You can also get the parent node through parent:::
# parent::
from lxml import etree
html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//h1[@class="baikeLogo"]/parent::*/@id')
print(result)
operation result:
5-7 Attribute Matching
To match a node with a specific attribute, you need to use @, which we have seen in the previous case:
For example, we want to match all li nodes with class attribute li here:
# 属性匹配
from lxml import etree
html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li[@class="li"]')
print(result)
Running results: From the screenshot we can see that there should be 3 matching results, and there are also three running results
So the question is, are the three things returned to us the three we want?
That's it for today, let's sell it first. . .