Xpath matching guide to avoid pits

This article summarizes some of the pitfalls of xpath encountered daily. In order to prevent everyone from wasting time in the same place, they are advertised. Please declare the source for reprinting!

table of Contents

1. Xpath matching query tool

Two, the existence of tbody in the path causes the match to be empty

Third, the matching of interlaced tags

Four, skip a certain serial number label to match

Five, the matching item labels are different, and the number of matches is inconsistent

Sixth, update from time to time...


1. Xpath matching query tool

There is a plug-in on Chrome called "XPath Helper". After installing it, press "ctrl+shift+x" to summon and close it. It is very convenient to use and convenient to debug whether the xpath path is written correctly.

Two, the existence of tbody in the path causes the match to be empty

When there is tbody in the matching path, an empty list will be matched. The reason is that the browser has "optimized" the xpath, and the xpath copied directly from the browser cannot be matched when running in python. For example, there are tbody in the path in the following two cases :

XXX = XXXX.xpath("//div[@class='tabset']/table[2]/tbody")
XXX = XXXX.xpath("//div[@class='tabset']/table[2]/tbody/tr/td[2]/a/text()")

At this time, just delete the "tbody" in the path.

Third, the matching of interlaced tags

For interlaced labels, there are two solutions. (1) You can bypass this tag and find other tags or attributes to match. (2) The two types can be matched separately. After the a type is matched, the b type is matched again, and they can be combined together.

Four, skip a certain serial number label to match

For example, if you want to skip the first <tr> tag, you can use position>1.

XXX.xpath("//tr[position()>1]/td[1]/input/@value").extract()

Five, the matching item labels are different, and the number of matches is inconsistent

Sometimes, some websites will add "NEW", "SALE" and other tags to certain items, which causes the number of tags corresponding to the items in the same table on the web page to be inconsistent. For example, the situation in the following figure:

At this time, find a way to circumvent this extra "span" tag.

XX.xpath("tr/td[2]/span/a/@href").extract()
# 绕过span标签
XXXX.xpath("tr/td[2]//a/@href").extract()

Sixth, update from time to time...

 

 

 


Organizing is not easy...

Guess you like

Origin blog.csdn.net/Ryan_lee9410/article/details/107144213