[Python reptile] Xpath

First, what is xml

  1, define: an extensible markup language
  2, characteristics: xml is a self-describing semi-structured data structure.
  3, action: xml primarily designed to be used to transmit data. He can also serve as a configuration file.

Second, the difference between the xml and html

  1, requires different syntax: xml syntax requirements more stringent.

    (1) html insensitive, xml distinction.
    (2) html tail tag may sometimes be dispensed with. xml not omit any labels, in strict accordance with the first nested structure.
    (3) only in autistic xml tag (tag without content, only attributes.) <a class='abc'/>
    (. 4) in html attribute names can be used without an attribute value. xml attribute value must band.  
    (5) in the xml attribute must be enclosed in quotes, html may not be quoted.

  2, different roles

    html designed primarily to better display data and display data.
    The main purpose is to use xml designed to transfer data

  3, different tag: xml tag is not fixed, html tag is fixed and can not be customized.

Three, xpath

  1. What is the xpath?

    xpath is a screening html or xml page elements Grammar

  2, xml and html some nouns

    Elements, tags, attributes, content

  3, xml two analytical methods

    dom and sax

  4, xpath grammar

    (1) select nodes

        nodename --- Select this tab and all of its word label.
        / ---- select start from the root.
        // ---- Starting at any node, regardless of their location.
        // book --- book regardless of position, remove all of the tags in xml book.
        .---- current node start looking
        ..---- from the parent node
        @ --- select Properties
        text () --- choose content

    (2) predicate: defining the role played, the content is defined in front of him.

        [] Who written on the back, to define who is generally used to define an element or tag.

        //book[@class='abc']

        Common predicate:
          [@class] There ---- select class
          [@ class = 'abc'] --- select abc class attribute nodes.
          [contains (@ href, 'baidu ')] --- baidu select href attribute contains a label
          [1], select the first ---
          [Last ()] --- select the last
          [last () - 1] - - select penultimate
          [position ()> 2] --- skip the first two.
          book [price> 30]

    (3) wildcard

        * --- matches any node
        @ * --- matches any property

    (4) Select several paths

      | Xpath selected --- left and right --- and be content

  5, lxml module html and xml ---- Python processing module.

    (1) create a character type xml parsing

. 1  from lxml Import etree
 2 text = '' ' 
. 3      HTML page content
 . 4  ' '' 
. 5 Tree = etree.HTML (text) --- The return value is a target element
 . 6  # element object has xpath method, by expression xpath to filter content. 
7  # Select below li class attribute item-1 content of a tag 
. 8 a_contents = tree.xpath ( ' // li [@ class = "item-1"] / a / text () ' )
 . 9  
10  The element string object becomes way
 . 11  # html_str = etree.tostring (Tree, pretty_print = True) .decode ( 'UTF-. 8') 
12 is  # Print (type (html_str))

      #elment objects xpath put to screen, the return value is a list.
      #xpath last expression is an element (tag), list all the elements elment
      #xpath last expression is a property, list all the attribute string
      #xpath last expression is a content, list all contents of the string

    (2) parsing xml or html file      

from lxml Import etree 

# the parse method in accordance with the xml way to resolve, if the syntax problems, it will error. 
= etree.parse HTML ( ' demo.html ' ) 

# Print (HTML) #_ the ElementTree 
li_texts = html.xpath ( ' // Li / A / text () ' ) 

Print (li_texts)

Guess you like

Origin www.cnblogs.com/Tree0108/p/12074912.html