Some students said that I was the bad use, processing HTML document tired, there is no other way?
Have! That is XPath, we can convert with first HTML documents to XML documents, and then find the HTML node or element with XPath.
What is XML
- XML refers to extensible markup language (Extensible Markup Language)
- XML is a markup language, much like HTML
- XML is designed to transmit data rather than displaying data.
- XML tags need our own definition.
- XML is designed to be self-descriptive.
- XML is a W3C Recommendation.
W3School official document: http: //www.w3school.com.cn/xml/index.asp
The difference between XML and HTML
Data Format | description | Design goals |
---|---|---|
XML | Extensible Markup Language (Extensible Markup Language) | It is designed to transmit and store data, which is the focus of the content data. |
HTML | HyperText Markup Language (HTML) | How to better display data and display data. |
HTML DOM | Document Object Model for HTML (Document Object Model) | Through the HTML DOM, you can access all the HTML elements, together they contain text and attributes. The contents of which can be modified and deleted, but can also create new elements. |
XML document instance
<?xml version="1.0" encoding="utf-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> </book> <book category="web" cover="paperback"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book> </bookstore>
HTML DOM model example
HTML DOM standard defines access and method of operation of an HTML document in a way to express the tree structure HTML documents.
XML node relationship
1. Parent (the Parent)
each has a parent element and attribute.
Here is a simple XML example, when the book element title, author, year and price
<?xml version="1.0" encoding="utf-8"?> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>
2. Sub (Children)
element node may have zero, one or more sub.
In the following example, title, author, year, and price elements are child elements of the book:
<?xml version="1.0" encoding="utf-8"?> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>
3. fellow (the Sibling)
have the same parent node
in the following example, title, author, year, and price elements are brothers:
<?xml version="1.0" encoding="utf-8"?> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>
4. ancestors (the Ancestor)
the parent of a node, parent's parent, and so on.
In the following example, the ancestors of the title element is book element and the bookstore element:
<?xml version="1.0" encoding="utf-8"?> <bookstore> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>
5. descendant
child of a node, sub-sub, and so on.
In the following example, bookstore offspring is book, title, author, year, and price elements:
<?xml version="1.0" encoding="utf-8"?> <bookstore> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>
What is XPath?
XPath (XML Path Language) is an XML document to find information in the language, it can be used to traverse the elements and attributes in the XML.
W3School official document: http: //www.w3school.com.cn/xpath/index.asp
XPath Development Tools
- Open source XPath expression editing tools: XML Quire (XML format available)
- Chrome plug-in Xpath Helper
- Firefox plug-in Xpath Checker
Select node
XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions of our conventional computer file system to see very similar.
The following lists the most common path expressions:
expression | description |
---|---|
nodename | Select all the child nodes of this node |
/ | Choose from the root node |
// | Select the document matches the selected node from the current node, regardless of their location. |
. | Select the current node. |
.. | Select the parent of the current node |
@ | Select Properties |
In the table below, we have listed the results of some path expressions and expressions:
Path expression | description |
---|---|
bookstore | Selects all child nodes of the bookstore element. |
/bookstore | Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element! |
bookstore/book | All book elements selected sub-elements belonging to bookstore |
//book | Select all book sub-elements, regardless of their position in the document |
bookstore//book | Select the descendants of all the elements belonging to the bookstore bok elements, no matter what position they are located below the bookstore |
// @ lang | Select all of the property named lang. |
Predicate (Predicates)
The predicate is used to locate a specific node or a node containing a specific value, it is fitted in the square brackets.
In the table below, we have listed some path expressions with predicates, and the result of the expression:
Path expression | result |
---|---|
/bookstore/book[1] | Select an element belonging to the first sub bookstore book element. |
/bookstore/book[last()] | Select Data bookstore child element of the last book element |
/bookstore/book[last()-1] | Select the part of the bookstore element of the penultimate book element |
/bookstore/book[position()<3] | Select the first two child elements belonging to the book element of the bookstore element |
//title[@lang] | Select all of lang has a property named title element |
//title[@lang="eng"] | Select all the title elements, and these elements have a lang attribute value eng |
/bookstore/book[price>35.00] | Select all elements book bookstore element, and wherein the value of the price element must be greater than 35.00 |
/bookstore/book[price>35.00]/title | Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00 |
Select the unknown node
XPath wildcards can be used to select unknown XML elements.
Tsuhaifu | description |
---|---|
* | Matches any element node |
@* | Matches any attribute node |
node() | Matches any type of node |
In the table below, we have listed some path expressions and the result of these expressions:
Path expression | result |
---|---|
/bookstore/* | Select all the child elements of the bookstore element |
//* | Selects all elements in the document |
title[@*] | Select all the title elements with attributes |
Select several paths
By using the path expression "|" operator, you can select a number of Road King.
Examples
in the table below, we have listed some path expressions and the result of these expressions:
Path expression | result |
---|---|
'//book/title | //book/price' | Select the book title and price elements of all the elements. |
//title | //price | All title and price elements selected document |
/bookstore/book/title | //price | Select book element of the bookstore element belonging to the title element, and all price elements in the document |
XPath operator
以上就是XPath的语法内容,在运用到Python抓取时要先转换为xml.
lxml库
lxml是一个HTML/XML的解析器,主要的功能是如何提取和解析HTML/XML数据。
lxml和正则一样,也是用C实现,是一款高性能的Python HTML/XML解析器,我们可以利用之前学习的XPath语法,来快速的定位特定元素以及节点信息。
lxml python官方文档:http://lxml.de/index.html
需要安装C语言库,可使用pip安装:pip install lxml(或通过wheel方式安装)
初步使用
我们利用它来解析HTML代码,简单实例:
#-*- coding:utf-8 -*- #lxml_test.py #使用lxml的etree库 from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 </ul> </div> ''' #利用etree.HTML,将字符串解析为HTML文档 html = etree.HTML(text) #按字符串序列化为HTML文档 result = etree.tostring(html) print(result)
输出结果:
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
lxml可以自动修正html代码,例子里不仅补全里li标签,还添加了body/html标签
文件读取:
除了直接读取字符串,lxml还支持从文件里读取内容。我们新建一个hello.html文档:
<!--hello.html--> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>
再利用etree.parse()方法来读取文件。
#lxml_parse.py from lxml import etree #读取外部文件hello.html html = etree.parse('./hello.html') result = etree.tostring(html, pretty_print=True) print(result)
输出结果与之前相同:
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
XPath实例测试
1.获取所有的<li>
标签
#xpath_li.py from lxml import etree html = etree.parse('hello.html') print type(html) #显示etree.parse()返回类型 result = html.xpath('//li') print result #打印<li>标签的的元素集合 print len(result) print type(result) print type(result[0])
输出结果:
<type 'lxml.etree._ElementTree'> [<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>] 5 <type 'list'> <type 'lxml.etree._Element'>
2.继续获取<li>
标签的所有class
属性
#xpath_li.py from lxml import etree html = etree.parse('htllo.html') result = html.xpath('//li/@class') print result
运行结果:
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
3.继续获取<li>
标签下href
为link1.html
的<a>
标签
#xpath_li.py from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/a[@href="link1.html"]') print result
运行结果:
[<Element a at 0x10ffaae18>]
4.获取<li>
标签下的所有<span>
标签
#xpath_li.py from lxml import etree html = etree.parse('hello.html') #result = html.xpath('//li/span') #注意这么写是不对的 #因为/是用来获取子元素的,而<span>不是<li>的子元素,所以,要用双斜杠 result = html.xpath('//li//span') print result
运行结果:
[<Element span at 0x10d698e18>]
5.获取<li>
标签下的<a>
标签里的所有class
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/a//@class') print result
运行结果
['blod']
6.获取最后一个<li>
的<a>
的href
#xpath_li.py from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()]/a/@href') #谓语[last()]可以找到最后一个元素 print result
运行结果
['link5.html']
7.获取倒数第二个元素的内容
#xpath_li.py from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()-1]/a') #text方法可以获取元素内容 print(result[0].text)
运行结果
fourth item
8.获取class
值为bold
的标签名
#xpath_li.py from lxml import etree html = etree.parse('hello.html') result = html.xpath('//*[@class="bold"]') #tag方法可以获取标签名 print result[0].tag
运行结果
span