Python reptile from entry to advanced way of (X)

Previous article, we introduce a bit Python regular expression and re module to do a case crawling "Encyclopedia of embarrassing stories to tell," embarrassments and stored locally. In this chapter we look at another way to get data to climb XPath.

We crawling "embarrassments Encyclopedia" in front when dealing with HTML documents will find some tiring, but also very familiar with regular expressions was handy to get up, and that there is no more convenient way to do that, the answer is yes we can first convert HTML files into an XML document, then locate the HTML node or element with XPath.

What is XML

  • XML refers to extensible markup language (EXtensible Markup Language)
  • XML is a markup language, much like HTML
  • XML is designed to transmit data rather than displaying data
  • XML tags need our own definition.
  • XML is designed to be self-descriptive.
  • XML is a W3C Recommendation

The difference between XML and HTML

Data Format description Design goals
XML Extensible Markup Language (可扩展标记语言) It is designed to transmit and store data, which is the focus of the content data.
HTML HyperText Markup Language (超文本标记语言) How to better display data and display data.
HTML DOM Document Object Model for HTML (文档对象模型) Through the HTML DOM, you can access all the HTML elements, together they contain text and attributes. The contents of which can be modified and deleted, but can also create new elements.
XML document example
1 <?xml version="1.0" encoding="utf-8"?>
2 <bookstore>
3   <book category="cooking">
4     <title lang="en">this is title</title>
5     <content>hello world</>
6   </book>
7 </bookstore>
HTML DOM model example

HTML DOM defines a standard way to access and manipulate HTML document to an HTML document way to express the tree structure.

What is XPath?

XPath (XML Path Language) is an XML document to find information in the language, it can be used to traverse the elements and attributes in an XML document.

XPath Development Tools

  1. Open source XPath expression editing tools: XMLQuire (XML format available)
  2. Chrome plug-in XPath Helper
  3. Firefox plug-in XPath Checker

Select node

XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions we see in conventional computer file systems are very similar.

The following lists the most common path expressions:

expression description
nodename Select all the child nodes of this node.
/ Choose from the root node.
// Select the document matches the selected node from the current node, regardless of their location.
. Select the current node.
.. Select the parent of the current node.
@ Select Properties.

 

In the table below, we have listed the results of some path expressions and expressions:


Path expression
result
bookstore Selects all child nodes of the bookstore element.
/bookstore Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element!
bookstore/book Select the sub-elements belonging to the bookstore of the book all the elements.
//book Select all book sub-elements, regardless of their position in the document.
bookstore//book Select all book elements that belong to the descendants of the bookstore element, and no matter what position they are located below the bookstore.
// @ lang Select all of the property named lang.

Predicate (Predicates)

The predicate is used to locate a specific node or the specified node contains a value, is fitted in the square brackets.

In the table below, we have listed some path expressions with predicates, and the result of the expression:

 

Path expression result
/bookstore/book[1] Select an element belonging to the first sub bookstore book element.
/bookstore/book[last()] Select the sub-elements belonging to the bookstore last book element.
/bookstore/book[last()-1] Select the sub-elements belonging to the reciprocal bookstore's second book element.
/bookstore/book[position()<3] 选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang] 选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=’eng’] 选取所有 title 元素,且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00] 选取 bookstore 元素的所有 book 元素,且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title 选取 bookstore 元素中的 book 元素的所有 title 元素,且其中的 price 元素的值须大于 35.00。

 

选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

 

通配符 描述
* 匹配任何元素节点。
@* 匹配任何属性节点。
node() 匹配任何类型的节点。

在下面的表格中,我们列出了一些路径表达式,以及这些表达式的结果:

 

路径表达式 结果
/bookstore/* 选取 bookstore 元素的所有子元素。
//* 选取文档中的所有元素。
//title[@*] 选取所有带有属性的 title 元素。

 

选取若干路径

通过在路径表达式中使用“|”运算符,您可以选取若干个路径。

在下面的表格中,我们列出了一些路径表达式,以及这些表达式的结果:

 

路径表达式 结果
//book/title | //book/price 选取 book 元素的所有 title 和 price 元素。
//title | //price 选取文档中的所有 title 和 price 元素。
/bookstore/book/title | //price 选取属于 bookstore 元素的 book 元素的所有 title 元素,以及文档中所有的 price 元素。

XPath的运算符

下面列出了可用在 XPath 表达式中的运算符:

 

 

这些就是XPath的语法内容,在运用到Python抓取时要先转换为xml。

lxml库

lxml 是 一个HTML/XML的解析器,主要的功能是如何解析和提取 HTML/XML 数据。

lxml和正则一样,也是用 C 实现的,是一款高性能的 Python HTML/XML 解析器,我们可以利用之前学习的XPath语法,来快速的定位特定元素以及节点信息。

lxml python 官方文档:http://lxml.de/index.html

需要安装C语言库,可使用 pip 安装:pip install lxml (或通过wheel方式安装)

 我们利用它来解析 HTML 代码,简单示例:

 1 from lxml import etree
 2 
 3 text = '''
 4 <div>
 5     <ul>
 6          <li class="item-0"><a href="link1.html">first item</a></li>
 7          <li class="item-1"><a href="link2.html">second item</a></li>
 8          <li class="item-inactive"><a href="link3.html">third item</a></li>
 9          <li class="item-1"><a href="link4.html">fourth item</a></li>
10          <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 li 闭合标签
11      </ul>
12  </div>
13 '''
14 
15 # 利用etree.HTML,将字符串解析为HTML文档
16 html = etree.HTML(text)
17 
18 # 按字符串序列化HTML文档
19 # html = etree.tostring(html).decode("utf8")  # 不能正常显示中文
20 html = etree.tostring(html, encoding="utf-8", pretty_print=True, method="html").decode("utf-8")  # 可以正常显示中文
21 
22 print(html)

运行结果如下:

 1 <html><body>
 2 <div>
 3     <ul>
 4          <li class="item-0"><a href="link1.html">first item</a></li>
 5          <li class="item-1"><a href="link2.html">second item</a></li>
 6          <li class="item-inactive"><a href="link3.html">third item</a></li>
 7          <li class="item-1"><a href="link4.html">fourth item</a></li>
 8          <li class="item-0">
 9 <a href="link5.html">fifth item</a> # 注意,此处缺少一个 li 闭合标签
10      </li>
11 </ul>
12  </div>
13 </body></html>

lxml 可以自动修正 html 代码,例子里不仅补全了 li 标签,还添加了 body,html 标签。

文件读取:

除了直接读取字符串,lxml还支持从文件里读取内容。我们新建一个 index.html 文件:

1 <div>
2     <ul>
3          <li class="item-0"><a href="link1.html">first item</a></li>
4          <li class="item-1"><a href="link2.html">second item</a></li>
5          <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
6          <li class="item-1"><a href="link4.html">fourth item</a></li>
7          <li class="item-0"><a href="link5.html">fifth item</a></li>
8      </ul>
9  </div>

再利用 etree.parse() 方法来读取文件。

1 from lxml import etree
2 
3 # 读取外部文件 hello.html
4 html = etree.parse('./index.html', etree.HTMLParser())  # 指定解析器HTMLParser会根据文件修复HTML文件中缺失的如声明信息
5 html = etree.tostring(html, encoding="utf-8", pretty_print=True, method="html").decode("utf-8")
6 
7 print(html)

运行结果:

 1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
 2 <html><body><div>
 3     <ul>
 4          <li class="item-0"><a href="link1.html">first item</a></li>
 5          <li class="item-1"><a href="link2.html">second item</a></li>
 6          <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
 7          <li class="item-1"><a href="link4.html">fourth item</a></li>
 8          <li class="item-0"><a href="link5.html">fifth item</a></li>
 9      </ul>
10  </div></body></html>

 

接下来我们看一下 XPath 的实力测试。

1. 获取所有的 <li> 标签

 1 from lxml import etree
 2 
 3 html = etree.parse('./index.html', etree.HTMLParser())
 4 print(type(html))  # <class 'lxml.etree._ElementTree'>
 5 
 6 result = html.xpath('//li')
 7 
 8 print(result)  # [<Element li at 0x109c66248>, <Element li at 0x109c66348>, <Element li at 0x109c66388>, <Element li at 0x109c663c8>, <Element li at 0x109c66408>]
 9 print(len(result))  # 5
10 print(type(result))  # <class 'list'>
11 print(type(result[0]))  # <class 'lxml.etree._Element'>

2. 继续获取<li> 标签的所有 class属性

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//li/@class')
5 
6 print(result)  # ['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

3. 继续获取<li>标签下hre 为 link1.html 的 <a> 标签

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//li/a[@href="link1.html"]')
5 
6 print(result)  # [<Element a at 0x10b324288>]

4. 获取<li> 标签下的所有 <span> 标签

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 # result = html.xpath('//li/span')
5 # 注意这么写是不对的:因为 / 是用来获取子元素的,而 <span> 并不是 <li> 的子元素,所以,要用双斜杠
6 
7 result = html.xpath('//li//span')
8 
9 print(result)  # [<Element span at 0x10a59b308>]

5. 获取 <li> 标签下的<a>标签里的所有 class

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//li/a//@class')
5 
6 print(result)  # ['bold']

6. 获取最后一个 <li> 的 <a> 的 href

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//li[last()]/a/@href')
5 # 谓语 [last()] 可以找到最后一个元素
6 
7 print(result)  # ['link5.html']

7. 获取倒数第二个元素的内容

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//li[last()-1]/a')
5 
6 # text 方法可以获取元素内容
7 print(result[0].text)  # fourth item

8. 获取 class 值为 bold 的标签名

1 from lxml import etree
2 
3 html = etree.parse('./index.html', etree.HTMLParser())
4 result = html.xpath('//*[@class="bold"]')
5 
6 # tag方法可以获取标签名
7 print(result[0].tag)  # span

XPath的更多用法参考:http://www.w3school.com.cn/xpath/index.asp

python lxml库的更多用法参考:http://lxml.de/

 

 
 

 

XPath的更多用法参考:http://www.w3school.com.cn/xpath/index.asp

python lxml库的更多用法参考:http://lxml.de/

 

Guess you like

Origin www.cnblogs.com/weijiutao/p/10879871.html