Python Reptile 2.2 - xpath usage Tutorial
Overview
At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4
Previous articles about how to crawl a page from the Web site data were explained, and the use BeautifulSoup
for page data extraction, continue on how to resolve the crawl pages following analytical analysis to get the data we want explained. This article mainly Reference: rookie Tutorial
XPath Introduction
What is XPath
XPath (XML Path Language) is an XML and find information in the HTML document language, can be used to traverse the elements and attributes in XML and HTML documents.
XPath path expression
XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions we see in conventional computer file systems are very similar.
XPath standard functions
XPath contains over 100 built-in functions. These functions for string values, numeric, date and time comparison, node and QName processing, the processing sequence, the logical values and the like.
XPath Development Tools
- Chrome plug-in XPath Helper (recommended).
- FireFox plugin Try XPath.
XPath syntax
XPath uses path expressions to select nodes in an XML document or set of nodes. Along the path through the node (path) or step (steps) to select the.
XML instance document
We will use this XML document in the examples below.
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="eng">平凡的世界</title>
<author>路遥</author>
<price>40.8</price>
</book>
<book>
<title lang="zh_CN">蛙</title>
<author>莫言</author>
<price>23.6</price>
</book>
</bookstore>
Select node
XPath uses path expressions to select nodes in an XML document. Along the path through the node, or to select a step. Listed below are the most useful path expressions:
expression | description |
---|---|
nodename | Select all the child nodes of this node. |
/ | Choose from the root node. |
// | Select the document matches the selected node from the current node, regardless of their location. |
. | Select the current node. |
… | Select the parent of the current node. |
@ | Select Properties. |
Specific examples:
Path expression | result |
---|---|
bookstore | Selects all child nodes of the bookstore element. |
/bookstore | Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element! |
bookstore/book | Select the sub-elements belonging to the bookstore of the book all the elements. |
//book | Select all book sub-elements, regardless of their position in the document. |
bookstore//book | Select all book elements that belong to the descendants of the bookstore element, and no matter what position they are located below the bookstore. |
// @ lang | Select all of the property named lang. |
predicate
Predicate is used to find a specific node or a node that contains the value specified.
Predicate is embedded in square brackets.
In the table below, lists some path expressions with predicates, and the result of the expression:
Path expression | result |
---|---|
/bookstore/book[1] | Select an element belonging to the first sub bookstore book element. |
/bookstore/book[last()] | Select the sub-elements belonging to the bookstore last book element. |
/bookstore/book[last()-1] | Select the sub-elements belonging to the reciprocal bookstore's second book element. |
/bookstore/book[position()❤️] | Select book element belonging to two sub-elements of the bookstore element foremost. |
// title [@lang] | Select all of lang has a property named title element. |
// title [@ lang = 'a'] | Select all the title elements, and these elements have a lang attribute value of eng. |
/bookstore/book[price>35.00] | Select all the book elements bookstore element, and wherein the price element value must be greater than 35.00. |
/bookstore/book[price>35.00]/title | Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00. |
Select the unknown node
XPath wildcards can be used to select unknown XML elements.
Wildcards | description |
---|---|
* | Matches any element node. |
@* | Matches any attribute node. |
node() | Match any type of node. |
In the table below, lists some path expressions and the result of these expressions:
Path expression | result |
---|---|
/bookstore/* | Select the bookstore element of all child elements. |
//* | Select all elements in the document. |
//title[@*] | Select all the title elements with attributes. |
Select several paths
By using the path expression "|" operator, you can select several paths.
In the table below, lists some path expressions and the result of these expressions:
Path expression | result |
---|---|
//book/title | //book/price | Select the book title and price elements of all the elements. |
//title | //price | Select all title and price elements in the document. |
/bookstore/book/title | //price | Select book element of the bookstore element belonging to all the title elements, as well as document all the price elements. |
It should be noted
- / And // difference: / representatives of obtaining direct child node. // Get all descendants of nodes. // used more generally, the main subject to availability.
- contains: a property sometimes contains a plurality of values, you can use
contains
the function, the following sample code:
//input[contains(@class,"s_i")]
- Predicate subscripts starting from 1, not 0's.
XPath Examples
- Positioning Properties
//input[@id='kw']
- Index positioning, hierarchical positioning
//div[@id='head']/div/div[2]/a[1]
//div[@id='head']//a[@class='toindex']
- logic operation
//input[@class="s_ipt" and @name="wd"]
- Fuzzy matching
contains
//input[contains(@class,"s_i")]
starts-with
//input[starts-with(@class,"s")]
- Take text
//div[@id="head"]//a/text()
lxml library
lmxl is an HTML / XML parser, how the main function is to parse and extract HTML \ XML data.
lxml is a third party Python library, you must install it before use:
$ pip install lxml
Use lxml parse the HTML code
- 解析html字符串:使用
lxml.etree.HTML
进行解析,示例代码如下:
# 引入lxml库
from lxml import etree
html_element = etree.HTML(text)
print(etree.tostring(html_element, encoding='utf-8').decode())
- 解析html文件:使用
lxml.etree.parse
进行解析,示例代码如下:
# 引入lxml库
from lxml import etree
# 生成对象
html_element = etree.parse('xpath.html')
print(etree.tostring(html_element, encoding='utf-8').decode())
这个函数默认的是XML
解析器,所以如果碰到一些不规范的HTML
代码的时候就会解析错误,这时候就要自己创建HTML
解析器。示例代码如下:
# 引入lxml库
from lxml import etree
# 自定义解析器
parse = etree.HTMLParser(encoding='utf-8')
# 生成对象
html_element = etree.parse('xpath1.html', parse=parse)
print(etree.tostring(html_element, encoding='utf-8').decode())
在lxml中使用XPath语法
根据html是文件还是字符串判断进行分别使用
# 引入lxml库
from lxml import etree
# 生成对象
tree = etree.parse('xpath.html')
# ret = tree.xpath('//div[@class="tang"]/ul/li[1]/text()')
# ret = tree.xpath('//div[@class="tang"]/ul/li[last()]/a/@href')
ret = tree.xpath('//div[@class="tang"]/ul/li[@class="love" and @name="yang"]')
print(ret)