Python Reptile --Xpath

Xpath

Xpath is a finding information in an XML document language, is used to navigate through elements and attributes in the XML document, it can also be used to work in an HTML document. Python Reptile development, which often use Xpath to find information extracted page, Xpath therefore very important.

1, Xpath node

In the Xpath, XML document is treated as a node of the tree, there are seven types of nodes:

  1. element
  2. Attributes
  3. text
  4. Namespaces
  5. Processing instructions
  6. Note
  7. Document (root) nodes

Look at an example:

<xml version="1.0" encoding="ISO-8859-1">
<classroom>
    <student>
        <id>1001</id>
        <name lang="en">marry</name>
        <age>20</age>
        <country>China</country>
    </student>
</classroom>

<classroom>Is a document node
<id>1001</id>is an element node
lang="en"is an attribute node
marryis a text node
node relationships include parent (parent), the child (children), compatriots (sibling), ancestors (ancestor), offspring (descendent).

2, Xpath syntax

Path expression

Xpath use path expressions to select nodes in an XML document or set of nodes. Node is a step or steps along the path to the path selected.
Common path expressions:

expression description
nodename Selects all child nodes of node
/ Choose from the root node
// Select a node anywhere
. Select the current node
.. Select the parent of the current node
@ Select Properties

Look at an example:

<xml version="1.0" encoding="ISO-8859-1">
<classroom>
    <student>
        <id>1001</id>
        <name lang="en">marry</name>
        <age>20</age>
        <country>China</country>
    </student>
    <student>
        <id>1002</id>
        <name lang="en">jack</name>
        <age>25</age>
        <country>USA</country>
    </student>
</classroom>
Achieve results Path expression
Select all child nodes of the classroom classroom
Select the root element of the classroom /classroom
Select the sub-elements belonging to all of the student element classroom classroom/student
Select all sudent sub-elements, regardless of their position in the document //student
Select the descendants of all the elements belonging to the classroom student element, no matter what position they are located beneath the classroom classroom//student
Select all of the property named lang // @ lang
predicate

To select a particular node or a node containing a predicate specified value.
Use brackets [] to represent the predicate:

Achieve results Path expression
Select an element belonging to the first sub classroom student element /classroom/student[1]
Select last child elements belonging to a classroom student element /classroom/student[last()]
Select the sub-elements belonging to the reciprocal of the second classroom student element /classroom/student[last()-1]
Select classroom subelements first two elements studaent /classroom/student[position()<3]
Select all elements with the lang attribute name //name[@lang]
Select all lang attribute name element and has a value of "en" of // name [@ lang = 'en']
Select all classroom element studnet elements, and the element is greater than the value of its age 20 /classroom/student[age>20]
Select all of the element name element student classroom elements, and the value of its age element is greater than 20 /classroom/student[age>20]/name
Tsuhaifu
  • Wildcard "*" matches the unknown element
  • Use wildcard "|" select multiple paths
Achieve results Path expression
Select the classroom element of all child elements /classroom/*
Selects all elements in the document //*
Select all the elements with a name attribute //name[@*]
Select the student element of all elements name and age elements //student/name | //student/age
Select the student elements belonging to classroom element of all elements name, age, and document all the elements /classroom/student/name | //age

Guess you like

Origin www.cnblogs.com/lykxbg/p/12002936.html