One of extracting data: xpath

Use of Xpath

XPath uses path expressions to select nodes in an XML document. Along the path through the node, or to select a step.

Listed below are the most useful path expressions:

expression description Examples result
nodename Select all the child nodes of this node. bookstore Selects all child nodes of the bookstore element.
/ If at the top, select representatives from the root node. Otherwise, select a node in a node. /bookstore Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element!
// Select a node from the global node, regardless of their location. //book Select all book elements.
. Select the current node.
.. Select the parent of the current node.
@ Select Properties. //book[@price] Select node containing price book property

1: predicate (Predicates)

Predicate is used to find a specific node or a node that contains the value specified. Predicate is embedded in [].

PS: index starts at 1

Path expression result
/bookstore/book[1] Select an element belonging to the first sub bookstore book element.
/bookstore/book[last()] Select the sub-elements belonging to the bookstore last book element.
/bookstore/book[last()-1] Select the sub-elements belonging to the reciprocal bookstore's second book element.
/bookstore/book[position()<3] Select book element belonging to two sub-elements of the bookstore element foremost.
// book [@lang] Select all book elements with the lang attribute.
//book[@price='10'] Select all book elements, and the price attribute value is equal to the book element 10
/bookstore/book[price>35.00] Select all the book elements bookstore element, and wherein the price element value must be greater than 35.00.
/bookstore/book[price>35.00]/title Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00.

2: Select the unknown node

Wildcards description Examples result
* Matches any element node. /bookstore/* Select the bookstore element of all child elements.
@* Matches any attribute node. //title[@*] Select all the title elements with attributes.

3: Select several paths

By using the path expression "|" operator, you can select several paths.

Path expression result
//book/title | //book/price Select the book title and price elements of all the elements.
//title | //price Select all title and price elements in the document.
/bookstore/book/title | //price Select book element of the bookstore element belonging to all the title elements, as well as document all the price elements.

4: xpath syntax:

Use:

Use // Get element among the entire page, and then write the tag name, and then extract the write predicates. such as:

//div[@class='abc']

Note that knowledge:

  1. / And // difference: / representative for only the direct child. // Get descendant nodes. // with more than general. Of course, also, as the case may be.

  2. contains: Sometimes an attribute contains multiple values, you can use containsthe function. Sample code is as follows:

    //div[contains(@class,'job_detail')]
  3. Predicate subscripts starting from 1, not 0's.

  4. Parse html string: Use lxml.etree.HTMLfor resolution. Sample code is as follows:

from lxml import etree     #导入etree模块
test=''' HTML代码'''
htmlElement = etree.HTML(text)    #HTML(),返回一个Element对象
result=etree.tostring(htmlElement)   #tostring()输出修正后的html代码,但是结果是bytes类型
print(result.decode('utf-8'))  #转成str类型

5. Parse html files: Use lxml.etree.parseparsing. Sample code is as follows:

from lxml import etree
html = etree.parse("./tencent.html",etree.HTMLParser())   #parse()
result=etree.tostring(html)
print(result.decode('utf-8'))

6. This function is used by default XMLparser, so if you encounter some non-standard HTMLcode when it will parse error, this time we must create your own HTMLparser.

parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("lagou.html",parser=parser)
print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

lxml combined xpath Note:

  1. Use xpathsyntax. You should use the Element.xpathmethod. To perform xpath choice. Sample code is as follows:

    trs = html.xpath("//tr[position()>1]")

    xpath函数It is always a return to the list.

  2. Gets the value of a property tag:

    href = html.xpath("//a/@href")
    # 获取a标签的href属性对应的值
  3. Gets the text, by xpaththe text()function. Sample code is as follows:

    address = tr.xpath("./td[4]/text()")[0]
  4. In a tab, and then perform the function xpath, get descendant elements under this label, you should add a point before the slash, is to get representatives in the current element. Sample code is as follows:

    address = tr.xpath("./td[4]/text()")[0] 
  5. Parent ..

    result=html.xpath('//a[@href='link1.html']/../@class')
    #先获取href属性为link1.html的a节点,然后获取其父节点,最后获取其class属性值
  6. ---- @ match attributes using attribute filter

    .xpath('//li[@class="item-0"]') 
    #获取所有class="item-0"的li标签,返回列表
  7. Property - a multi-value match

    .xpath('//li[contains(@class),"属性值"]')
    #contains()方法,第一个参数传入属性名称,第二个参数传入属性值
  8. Multi-attribute - match

    .xpath('//li[contains(@class),"属性值"  and @name="属性值"]')
    .xpath('//li[@class="属性值"  and @name="属性值"]')
    and or < ><= >= 等等
    

Guess you like

Origin www.cnblogs.com/zhoujun007/p/12342963.html