Python Reptile 2.2 - xpath usage Tutorial

Overview

At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

Previous articles about how to crawl a page from the Web site data were explained, and the use BeautifulSoupfor page data extraction, continue on how to resolve the crawl pages following analytical analysis to get the data we want explained. This article mainly Reference: rookie Tutorial

XPath Introduction

What is XPath

XPath (XML Path Language) is an XML and find information in the HTML document language, can be used to traverse the elements and attributes in XML and HTML documents.

XPath path expression

XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions we see in conventional computer file systems are very similar.

XPath standard functions

XPath contains over 100 built-in functions. These functions for string values, numeric, date and time comparison, node and QName processing, the processing sequence, the logical values ​​and the like.

XPath Development Tools

  1. Chrome plug-in XPath Helper (recommended).
  2. FireFox plugin Try XPath.

XPath syntax

XPath uses path expressions to select nodes in an XML document or set of nodes. Along the path through the node (path) or step (steps) to select the.

XML instance document

We will use this XML document in the examples below.

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

    <book>
        <title lang="eng">平凡的世界</title>
        <author>路遥</author>
        <price>40.8</price>
    </book>

    <book>
        <title lang="zh_CN"></title>
        <author>莫言</author>
        <price>23.6</price>
    </book>
    
</bookstore>

Select node

XPath uses path expressions to select nodes in an XML document. Along the path through the node, or to select a step. Listed below are the most useful path expressions:

expression description
nodename Select all the child nodes of this node.
/ Choose from the root node.
// Select the document matches the selected node from the current node, regardless of their location.
. Select the current node.
Select the parent of the current node.
@ Select Properties.

Specific examples:

Path expression result
bookstore Selects all child nodes of the bookstore element.
/bookstore Select the root element bookstore.
Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element!
bookstore/book Select the sub-elements belonging to the bookstore of the book all the elements.
//book Select all book sub-elements, regardless of their position in the document.
bookstore//book Select all book elements that belong to the descendants of the bookstore element, and no matter what position they are located below the bookstore.
// @ lang Select all of the property named lang.

predicate

Predicate is used to find a specific node or a node that contains the value specified.
Predicate is embedded in square brackets.
In the table below, lists some path expressions with predicates, and the result of the expression:

Path expression result
/bookstore/book[1] Select an element belonging to the first sub bookstore book element.
/bookstore/book[last()] Select the sub-elements belonging to the bookstore last book element.
/bookstore/book[last()-1] Select the sub-elements belonging to the reciprocal bookstore's second book element.
/bookstore/book[position()❤️] Select book element belonging to two sub-elements of the bookstore element foremost.
// title [@lang] Select all of lang has a property named title element.
// title [@ lang = 'a'] Select all the title elements, and these elements have a lang attribute value of eng.
/bookstore/book[price>35.00] Select all the book elements bookstore element, and wherein the price element value must be greater than 35.00.
/bookstore/book[price>35.00]/title Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00.

Select the unknown node

XPath wildcards can be used to select unknown XML elements.

Wildcards description
* Matches any element node.
@* Matches any attribute node.
node() Match any type of node.

In the table below, lists some path expressions and the result of these expressions:

Path expression result
/bookstore/* Select the bookstore element of all child elements.
//* Select all elements in the document.
//title[@*] Select all the title elements with attributes.

Select several paths

By using the path expression "|" operator, you can select several paths.
In the table below, lists some path expressions and the result of these expressions:

Path expression result
//book/title | //book/price Select the book title and price elements of all the elements.
//title | //price Select all title and price elements in the document.
/bookstore/book/title | //price Select book element of the bookstore element belonging to all the title elements, as well as document all the price elements.

It should be noted

  1. / And // difference: / representatives of obtaining direct child node. // Get all descendants of nodes. // used more generally, the main subject to availability.
  2. contains: a property sometimes contains a plurality of values, you can use containsthe function, the following sample code:
    //input[contains(@class,"s_i")]
  1. Predicate subscripts starting from 1, not 0's.

XPath Examples

  1. Positioning Properties
    //input[@id='kw']
  1. Index positioning, hierarchical positioning
    //div[@id='head']/div/div[2]/a[1]
    //div[@id='head']//a[@class='toindex']
  1. logic operation
    //input[@class="s_ipt" and @name="wd"]
  1. Fuzzy matching
    contains
    //input[contains(@class,"s_i")]
    starts-with
    //input[starts-with(@class,"s")]
  1. Take text
    //div[@id="head"]//a/text()

lxml library

lmxl is an HTML / XML parser, how the main function is to parse and extract HTML \ XML data.

lxml is a third party Python library, you must install it before use:

    $ pip install lxml 

Use lxml parse the HTML code

  1. 解析html字符串:使用lxml.etree.HTML进行解析,示例代码如下:
    # 引入lxml库
    from lxml import etree
    
    html_element = etree.HTML(text)
    print(etree.tostring(html_element, encoding='utf-8').decode())
  1. 解析html文件:使用lxml.etree.parse进行解析,示例代码如下:
    # 引入lxml库
    from lxml import etree
    
    # 生成对象
    html_element = etree.parse('xpath.html')
    print(etree.tostring(html_element, encoding='utf-8').decode())

这个函数默认的是XML解析器,所以如果碰到一些不规范的HTML代码的时候就会解析错误,这时候就要自己创建HTML解析器。示例代码如下:

    # 引入lxml库
    from lxml import etree
    
    # 自定义解析器
    parse = etree.HTMLParser(encoding='utf-8')
    # 生成对象
    html_element = etree.parse('xpath1.html', parse=parse)
    print(etree.tostring(html_element, encoding='utf-8').decode())

在lxml中使用XPath语法

根据html是文件还是字符串判断进行分别使用

    # 引入lxml库
    from lxml import etree
    
    # 生成对象
    tree = etree.parse('xpath.html')
    # ret = tree.xpath('//div[@class="tang"]/ul/li[1]/text()')
    # ret = tree.xpath('//div[@class="tang"]/ul/li[last()]/a/@href')
    ret = tree.xpath('//div[@class="tang"]/ul/li[@class="love" and @name="yang"]')
    print(ret)

其他博文链接

发布了154 篇原创文章 · 获赞 404 · 访问量 65万+

Guess you like

Origin blog.csdn.net/Zhihua_W/article/details/100688907