[Thinking] XPath selects and specifies sibling nodes, child nodes, grandchildren nodes, current nodes and other elements or text content between two nodes

XPath is very powerful. It not only supports special comparison operators, but also supports common functions such as filter expressions, which is unique in batch selection.

If you only look at the method, skip directly to the next three-level heading.

Today I encountered a small business scenario, see the figure below
website
How to extract the content name="ContenStart"of name="ContenEnd"the two nodes? Maybe everyone thought of the brother node.

//meta[@name="ContentStart"]/following-sibling::*

However, using this method, other nodes that are not needed will be selected.
sibling node
Here you can use the method that comes with XPath notto exclude table tags, but if you encounter too many tags at the same level, it will cause redundant expressions and reduce readability. .
After a little observation, you can find that the labels between the two nodes are all p labels, so you can "*"replace them with p "p"to solve the problem.
But what if there are not only p but also other labels between two nodes?

A universal method to solve the problem of selecting content between two nodes

It is mainly the use of the Xpath axis, and the detailed axis method is attached.

Axes: Allows defining a node set relative to the current node —w3school

The above scenario can be optimized as:
Downward:

#following选取文档中当前节点的结束标签之后的所有节点。
//meta[@name="ContentStart"]/following::*

up

#preceding直到所有这个节点的父辈节点,顺序选择每个父辈节点前的所有同级节点
//meta[@name="ContentEnd"]/preceding::*   

The XPaths in the two directions must have an intersection, and the intersection part is the content you want to extract.
jupyter notebook

The rest is the basics of Python, slicing or other methods, to extract duplicate content.
If you need to extract sibling nodes, child nodes or grandchildren nodes, just replace the axis method.
This idea is time-consuming and labor-intensive, and is suitable for result-oriented scenarios! ! !


Attachment: XPath axis method

axis name result
ancestor Selects all ancestors (parents, grandparents, etc.) of the current node.
ancestor-or-self Selects all of the current node's ancestors (parents, grandparents, etc.) as well as the current node itself.
attribute Selects all attributes of the current node.
child Selects all child elements of the current node.
descendant Selects all descendant elements (children, grandchildren, etc.) of the current node.
descendant-or-self Selects all descendant elements (children, grandchildren, etc.) of the current node as well as the current node itself.
following Selects all nodes in the document after the closing tag of the current node.
namespace Selects all namespace nodes of the current node.
parent Select the parent node of the current node.
preceding Selects all nodes in the document before the opening tag of the current node.
preceding-sibling Selects all sibling nodes before the current node.
self Select the current node.

Guess you like

Origin blog.csdn.net/qq_44491709/article/details/108002960
Recommended