Python crawler learning 28

Continue learning Xpath today

Continued from the previous article

content

- Python crawler learning 28
- - Five, the use of Xpath third

Five, the use of Xpath third

Attach the rules as usual:

insert image description here

5-8 Text Acquisition

Yesterday, I dug a hole very kindly, and it will be filled at 5-8. We can use the text() method to get the text under the node:

The html text we intercepted yesterday is as follows:

insert image description here

At that time, we successfully matched the li node with the attribute clas as li:

insert image description here

So how to test it?

# 获取文本

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li[@class="li"]//text()')
print(result)

Result: All text in the node will be returned as a list
insert image description here

Hey, isn't this checked out?

After learning to use text(), please think about the results of the following situations

result = html.xpath('//li[@class="li"]/text()')

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li[@class="li"]/text()')
print(result)

operation result:

insert image description here

The meaning of / is to select the direct child node. It can be found that only the text in the li node node is selected in this case, and the text in the child node of the li node is not matched.

5-9 Property Acquisition

For getting the attributes of a node, we have already covered it when we learned attribute matching before. We can use @ to locate an attribute and make it output:

# 获取属性

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
# 这里的含义为匹配所有li节点下所有a节点的herf属性
# 获取属性与匹配属性的不同之处在于 获取属性时不用加 []
result = html.xpath('//li//a/@href')
print(result)

Running results: All matching results will be stored in a list and returned to us

insert image description here

5-10 Attribute multi-value matching

First of all, please distinguish the difference between attribute matching and attribute acquisition. Then a certain attribute of some nodes may have two values, this time:

insert image description here

Join us to slightly modify the properties of the first li node, and then match the regular properties:


from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li[@class="li"]')
print(result)

Running result: Obviously only two can be matched

insert image description here

But you may be a little dissatisfied at this time, try to complete the attribute and struggle:

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
result = html.xpath('//li li-first[@class="li"]')
print(result)

Running result: It turned out to be an illegal expression

insert image description here

Then at this time, we have to use the contains method to match:

from lxml import etree

html = etree.parse('./python.html', etree.HTMLParser())
# 该处意为 属性中含有 “li” 即返回一个结果
result = html.xpath('//li[contains(@class, "li")]')
print(result)

operation result:

insert image description here

Ends today, continues tomorrow!