Data of the road - Python Reptile - Xpath

XML is introduced

What is XML?

  • XML refers to extensible markup language (EXtensible Markup Language)
  • XML is a markup language, much like HTML
  • XML is designed to transmit data rather than displaying data
  • XML tags need our own definition.
  • XML is designed to be self-descriptive.
  • XML is a W3C Recommendation

W3School official document: HTTP: //www.w3school.com.cn/xm ...

The difference between XML and HTML

Different syntax requirements

  • Case-insensitive in html, strict distinction in xml.
  • In HTML, sometimes strict, if the context clearly show paragraph or list of keys at the end where you can omit an end tag </ p> or </ li> and the like. In XML, it is a strict tree structure must not omit end tags out.
  • In XML, a single marker has ended without a matching element must be marked with a / character as the end. So the parser knows not find the end tag.
  • In XML, attribute values ​​must be dispensed in quotation marks. In HTML, quotation marks may or may not.
  • In HTML, you can have the attribute name without value. In XML, all attributes must have a corresponding value.
  • In an XML document, a blank part will not be automatically deleted parser; however html is to filter out spaces.

Different design goals

  • XML is designed to transmit and store data, which is the focus of the content data.
  • HTML display data, and how to better display data.

XML node relationship

1, the parent (parent)
for each element and attribute has a parent.
Here is a simple example of XML, book element is the parent title, author, year, and price elements are:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> 

2, sub (Children)
element node may have zero, one or more sub.
In the following example, title, author, year, and price elements are child elements of the book:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> 

3, fellow (the Sibling)
have the same parent node
in the following example, title, author, year, and price elements are brothers:

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> 

4, ancestors (the Ancestor)
the parent of a node, parent's parent, and so on.
In the following example, the ancestors of the title element is book element and the bookstore element:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore> 

5, the offspring (Descendant)
child of a node, sub-sub, and so on.
In the following example, bookstore offspring is book, title, author, year, and price elements:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore> 

Xpath

What is Xpath?

Xpath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language, can be used to traverse the elements and attributes in an XML document. . It was originally used to search XML documents, but it also applies to search and HTML documents.
In doing so the crawler can be used to do the corresponding XPath information extraction.

W3School official document: HTTP: //www.w3school.com.cn/xp ...

Xpath Development Tools

  1. Open source XPath expression editing tools: XMLQuire (XML format available)
  2. Chrome plug-in XPath Helper
  3. Firefox plug-in XPath Checker

Using Xpath

XPath uses path expressions to select nodes in an XML document or set of nodes. These path expressions and expressions we see in conventional computer file systems are very similar.
1, Xpath common rules

expression description
nodename Select all the child nodes of this node
/ Selected direct child node from the current node
// Descendants of the current node from the selected node
. Select the current node
.. Select the parent of the current node
@ Select Properties

2, Xpath use examples
to xmL document the following example:

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>
Path expression result
bookstore Select all the child nodes of the bookstore element
/bookstore Select the root element bookstore. NOTE: If the path starts with a forward slash /, then this path represents the absolute path of an element
bookstore/book All book elements selected sub-elements belonging to bookstore
//book Select all book elements, regardless of any position in the document
bookstore//book Select all book elements that belong to the descendants of the bookstore element, and no matter what position they are located below the bookstore.
// @ lang Select all of the property named lang

lxml library

lxml libraries installed

1, window installation
cmd into the command line, execution

pip3 install lxml

2, ubuntu16.04 mounted
ctrl + alt + t into terminal mode, to execute:

sudo apt-get install -y build-essential libssl-devl libffi-dev libxml2-dev libxslt1-dev zlib1g-dev

After installation dependent libraries, perform installation pip:

sudo pip3 install lxml

3, verify the installation
introduced lxml module, there is no error if the installation is successful.

$ python3
>>> import lxml

etree module

The initial use of
the file namelxml_test.py

# 使用 lxml 的 etree 库
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 </ul> </div> ''' #利用etree.HTML,将字符串解析为HTML文档,etree模块可自动修正HTML文本 html = etree.HTML(text) # 按字符串序列化HTML文档 ret = etree.tostring(html) # torstring()方法返回的结果是bytes类型,这里用decode()方法将其转化为字符串 print(ret.decode('utf-8')) 

Output:

<html><body>
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>

etreeModule can automatically correct html code, an example not only of the completion of the li tag, but also adds body, html tags.

File read
in addition to directly read the string, lxml also supports reading content from the file. Here I will be content after the implementation of the above lxml_test.py file is savedtest.html

python lxml_test.py >> test.html 

The above content is the output  cat test.html:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> 

Use etree.parse () method to read the file.

from lxml import etree

html = etree.parse('./test.html',HTMLParser())

ret = etree.tostring(html)
print(ret.decode('utf-8'))

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>

Output of more than a DOCTYPE declaration, did not affect the result of the analysis.

Guess you like

Origin www.cnblogs.com/Iceredtea/p/11291973.html