Python Reptile (x) _XPath and lxml library

Some students said that I was the bad use, processing HTML document tired, there is no other way?
Have! That is XPath, we can convert with first HTML documents to XML documents, and then find the HTML node or element with XPath.

What is XML

    • XML refers to extensible markup language (Extensible Markup Language)
    • XML is a markup language, much like HTML
    • XML is designed to transmit data rather than displaying data.
    • XML tags need our own definition.
    • XML is designed to be self-descriptive.
    • XML is a W3C Recommendation.

W3School official document: http: //www.w3school.com.cn/xml/index.asp

The difference between XML and HTML

Data Format description Design goals
XML Extensible Markup Language (Extensible Markup Language) It is designed to transmit and store data, which is the focus of the content data.
HTML HyperText Markup Language (HTML) How to better display data and display data.
HTML DOM Document Object Model for HTML (Document Object Model) Through the HTML DOM, you can access all the HTML elements, together they contain text and attributes. The contents of which can be modified and deleted, but can also create new elements.

XML document instance

<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

HTML DOM model example

HTML DOM standard defines access and method of operation of an HTML document in a way to express the tree structure HTML documents.

XML node relationship

1. Parent (the Parent)
each has a parent element and attribute.
Here is a simple XML example, when the book element title, author, year and price

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

 

2. Sub (Children)
element node may have zero, one or more sub.
In the following example, title, author, year, and price elements are child elements of the book:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

3. fellow (the Sibling)
have the same parent node
in the following example, title, author, year, and price elements are brothers:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

 

4. ancestors (the Ancestor)
the parent of a node, parent's parent, and so on.
In the following example, the ancestors of the title element is book element and the bookstore element:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

 

5. descendant
child of a node, sub-sub, and so on.
In the following example, bookstore offspring is book, title, author, year, and price elements:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

 

What is XPath?

XPath (XML Path Language) is an XML document to find information in the language, it can be used to traverse the elements and attributes in the XML.

W3School official document: http: //www.w3school.com.cn/xpath/index.asp

XPath Development Tools

  1. Open source XPath expression editing tools: XML Quire (XML format available)
  2. Chrome plug-in Xpath Helper
  3. Firefox plug-in Xpath Checker

Select node

XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions of our conventional computer file system to see very similar.
The following lists the most common path expressions:

expression description
nodename Select all the child nodes of this node
/ Choose from the root node
// Select the document matches the selected node from the current node, regardless of their location.
. Select the current node.
.. Select the parent of the current node
@ Select Properties

In the table below, we have listed the results of some path expressions and expressions:

Path expression description
bookstore Selects all child nodes of the bookstore element.
/bookstore Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element!
bookstore/book All book elements selected sub-elements belonging to bookstore
//book Select all book sub-elements, regardless of their position in the document
bookstore//book Select the descendants of all the elements belonging to the bookstore bok elements, no matter what position they are located below the bookstore
// @ lang Select all of the property named lang.

Predicate (Predicates)

The predicate is used to locate a specific node or a node containing a specific value, it is fitted in the square brackets.
In the table below, we have listed some path expressions with predicates, and the result of the expression:

Path expression result
/bookstore/book[1] Select an element belonging to the first sub bookstore book element.
/bookstore/book[last()] Select Data bookstore child element of the last book element
/bookstore/book[last()-1] Select the part of the bookstore element of the penultimate book element
/bookstore/book[position()<3] Select the first two child elements belonging to the book element of the bookstore element
//title[@lang] Select all of lang has a property named title element
//title[@lang="eng"] Select all the title elements, and these elements have a lang attribute value eng
/bookstore/book[price>35.00] Select all elements book bookstore element, and wherein the value of the price element must be greater than 35.00
/bookstore/book[price>35.00]/title Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00

Select the unknown node

XPath wildcards can be used to select unknown XML elements.

Tsuhaifu description
* Matches any element node
@* Matches any attribute node
node() Matches any type of node

In the table below, we have listed some path expressions and the result of these expressions:

Path expression result
/bookstore/* Select all the child elements of the bookstore element
//* Selects all elements in the document
title[@*] Select all the title elements with attributes

Select several paths

By using the path expression "|" operator, you can select a number of Road King.
Examples
in the table below, we have listed some path expressions and the result of these expressions:

Path expression result
'//book/title | //book/price' Select the book title and price elements of all the elements.
//title &#124 //price All title and price elements selected document
/bookstore/book/title | //price Select book element of the bookstore element belonging to the title element, and all price elements in the document

XPath operator

xpath运算符

以上就是XPath的语法内容,在运用到Python抓取时要先转换为xml.

lxml库

lxml是一个HTML/XML的解析器,主要的功能是如何提取和解析HTML/XML数据。
lxml和正则一样,也是用C实现,是一款高性能的Python HTML/XML解析器,我们可以利用之前学习的XPath语法,来快速的定位特定元素以及节点信息。
lxml python官方文档:http://lxml.de/index.html
需要安装C语言库,可使用pip安装:pip install lxml(或通过wheel方式安装)

初步使用

我们利用它来解析HTML代码,简单实例:

#-*- coding:utf-8 -*-
#lxml_test.py

#使用lxml的etree库
from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签
     </ul>
 </div>
'''

#利用etree.HTML,将字符串解析为HTML文档
html = etree.HTML(text)

#按字符串序列化为HTML文档
result = etree.tostring(html)

print(result)

 

输出结果:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

lxml可以自动修正html代码,例子里不仅补全里li标签,还添加了body/html标签

文件读取:

除了直接读取字符串,lxml还支持从文件里读取内容。我们新建一个hello.html文档:

<!--hello.html-->
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>

再利用etree.parse()方法来读取文件。

#lxml_parse.py
from lxml import etree

#读取外部文件hello.html
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)

输出结果与之前相同:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

 

XPath实例测试

1.获取所有的<li>标签

#xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
print type(html) #显示etree.parse()返回类型

result = html.xpath('//li')

print result  #打印<li>标签的的元素集合
print len(result)
print type(result)
print type(result[0])

 

输出结果:

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type 'list'>
<type 'lxml.etree._Element'>

2.继续获取<li>标签的所有class属性

#xpath_li.py
from lxml import etree

html = etree.parse('htllo.html')
result = html.xpath('//li/@class')

print result

 

运行结果:

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

 

3.继续获取<li>标签下hreflink1.html<a>标签

#xpath_li.py
from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a[@href="link1.html"]')

print result

 

运行结果:

[<Element a at 0x10ffaae18>]

 

4.获取<li>标签下的所有<span>标签

#xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

#result = html.xpath('//li/span')
#注意这么写是不对的

#因为/是用来获取子元素的,而<span>不是<li>的子元素,所以,要用双斜杠

result = html.xpath('//li//span')

print result

 

运行结果:

[<Element span at 0x10d698e18>]

 

5.获取<li>标签下的<a>标签里的所有class

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li/a//@class')

print result

 

运行结果

['blod']

 

6.获取最后一个<li><a>的href

#xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()]/a/@href')
#谓语[last()]可以找到最后一个元素

print result

 

运行结果

['link5.html']

 

7.获取倒数第二个元素的内容

#xpath_li.py
from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()-1]/a')

#text方法可以获取元素内容
print(result[0].text)

 

运行结果

fourth item

 

8.获取class值为bold的标签名

#xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//*[@class="bold"]')

#tag方法可以获取标签名
print result[0].tag

 

运行结果

span

 

Guess you like

Origin www.cnblogs.com/moying-wq/p/11569986.html