Xpath
Xpath is a finding information in an XML document language, is used to navigate through elements and attributes in the XML document, it can also be used to work in an HTML document. Python Reptile development, which often use Xpath to find information extracted page, Xpath therefore very important.
1, Xpath node
In the Xpath, XML document is treated as a node of the tree, there are seven types of nodes:
- element
- Attributes
- text
- Namespaces
- Processing instructions
- Note
- Document (root) nodes
Look at an example:
<xml version="1.0" encoding="ISO-8859-1">
<classroom>
<student>
<id>1001</id>
<name lang="en">marry</name>
<age>20</age>
<country>China</country>
</student>
</classroom>
<classroom>
Is a document node
<id>1001</id>
is an element node
lang="en"
is an attribute node
marry
is a text node
node relationships include parent (parent), the child (children), compatriots (sibling), ancestors (ancestor), offspring (descendent).
2, Xpath syntax
Path expression
Xpath use path expressions to select nodes in an XML document or set of nodes. Node is a step or steps along the path to the path selected.
Common path expressions:
expression | description |
---|---|
nodename | Selects all child nodes of node |
/ | Choose from the root node |
// | Select a node anywhere |
. | Select the current node |
.. | Select the parent of the current node |
@ | Select Properties |
Look at an example:
<xml version="1.0" encoding="ISO-8859-1">
<classroom>
<student>
<id>1001</id>
<name lang="en">marry</name>
<age>20</age>
<country>China</country>
</student>
<student>
<id>1002</id>
<name lang="en">jack</name>
<age>25</age>
<country>USA</country>
</student>
</classroom>
Achieve results | Path expression |
---|---|
Select all child nodes of the classroom | classroom |
Select the root element of the classroom | /classroom |
Select the sub-elements belonging to all of the student element classroom | classroom/student |
Select all sudent sub-elements, regardless of their position in the document | //student |
Select the descendants of all the elements belonging to the classroom student element, no matter what position they are located beneath the classroom | classroom//student |
Select all of the property named lang | // @ lang |
predicate
To select a particular node or a node containing a predicate specified value.
Use brackets [] to represent the predicate:
Achieve results | Path expression |
---|---|
Select an element belonging to the first sub classroom student element | /classroom/student[1] |
Select last child elements belonging to a classroom student element | /classroom/student[last()] |
Select the sub-elements belonging to the reciprocal of the second classroom student element | /classroom/student[last()-1] |
Select classroom subelements first two elements studaent | /classroom/student[position()<3] |
Select all elements with the lang attribute name | //name[@lang] |
Select all lang attribute name element and has a value of "en" of | // name [@ lang = 'en'] |
Select all classroom element studnet elements, and the element is greater than the value of its age 20 | /classroom/student[age>20] |
Select all of the element name element student classroom elements, and the value of its age element is greater than 20 | /classroom/student[age>20]/name |
Tsuhaifu
- Wildcard "*" matches the unknown element
- Use wildcard "|" select multiple paths
Achieve results | Path expression |
---|---|
Select the classroom element of all child elements | /classroom/* |
Selects all elements in the document | //* |
Select all the elements with a name attribute | //name[@*] |
Select the student element of all elements name and age elements | //student/name | //student/age |
Select the student elements belonging to classroom element of all elements name, age, and document all the elements | /classroom/student/name | //age |