Introduce python Xpath syntax

The python video tutorial section introduces python's Xpath syntax.

Free recommendation: python video tutorial
1. Introduction to XMl
(1) What is XML
XML refers to Extensible Markup Language (EXtensible)
XML is a markup language, very similar to HTML.
XML is designed to transmit data, not display data.
XML tags need to be defined by us.
XML is designed to be self-describing.
XML is the recommended standard of W3C.
W3School official document: http://www.w3school.com.cn/xml/index.asp

(2) The difference between XML and HTML.
Both of them are used to manipulate data or structure data. They are roughly the same in structure, but there are obvious differences in their essence.

Data format description design goal
XML Extensible Markup Language (Extensible Markup Language) is designed to transmit and store data, and its focus is on the content of the data.
HTML HyperText Markup Language (Hypertext Markup Language) display data and how to better display the data.
HTML DOM Document Object Model for HTML (Hypertext Document Object Model) Through HTML DOM, you can access all HTML elements, as well as the text and attributes they contain. The content can be modified and deleted, and new elements can also be created.
(3) The node relationship of XML

J K.Rowling

2005

29.00

1. Parent (Parent)
Every element and attribute has a parent. The above is a simple XML example, the book element is the parent of the title, author, year, and price elements

2. Children (Children)
element node can have zero, one or more child elements. In the above example, the title, author, year, and price elements are all child elements of the book element

  1. Sibling
    nodes have the same parent. In the above example, the title, author, year, and price elements are all siblings

  2. Ancestor (Ancestor)
    a node's parent, parent's parent, etc. In the above example, the ancestors of the title element are the book element and the bookstore element

  3. Descendant (Descendant)
    a child of a node, the child of a child, etc. In the above example, the descendants of bookstore are the book, title, author, year, and price elements:

2. XPATH
XPath (XML Path Language) is a language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.

(1) Select node
XPath uses path expressions to select nodes or node sets in XML documents. These path expressions are very similar to those we see in regular computer file systems. The most commonly used path expressions are listed below:

The expression description
nodename selects all the child nodes of this node.
/ Select from node.
// Select nodes in the document from the current node of the matching selection, regardless of their location.
. Select the current node.
… Select the parent node of the current node.
@ Select attribute.
In the table below, we have listed some path expressions and their results:

Path expression description
bookstore Select all child nodes of the bookstore element
/bookstore select the root element bookstore. Represents the absolute path of the element.
bookstore/book selects all book elements that are child elements of bookstore.
//book selects all book child elements regardless of their position in the document
bookstore//book selects all book elements belonging to the descendants of the booksore element, regardless of where they are located under the bookstore.
//@lang Select all attributes named lang.
text() takes the value in the label
(2) Predicates The
predicate is used to find a specific node or a node containing a specified value, which is embedded in square brackets. In the following table, we list some path expressions with predicates and the results of the expressions:

The path expression description
/bookstore/book[l] selects the first book element that belongs to the bookstore child element.
/bookstore/book[last()] selects the last book element belonging to the bookstore child element.
/bookstore/book[last()-1] selects the penultimate book element belonging to the bookstore child element.
/bookstore/book[position()<2] Select the first book element that is a child element of the bookstore element.
//title[@lang] Select all the title elements with attributes named lang.
//titlel@lang='eng'] Select all tltle elements, and these elements have a lang attribute whose attribute value is eng.
(3) Select unknown node
XPath wildcard can be used to select unknown XML elements.

Wildcard description

  • Match any element node.
    @* matches any attribute node.
    In the table below, we list some path expressions and the results of these expressions:

Path expression description
/bookstore/* select all child elements of the bookstore element
//* select all elements in the document.
//title[@*] Select all title elements with attributes.
(4) Select several paths
By using the "|" operator in the path expression, you can select several paths. In the table below, we list some path expressions and the results of these expressions:

Path expression description

//book/title //book/price
//title //price
//price 选取文档中所有的 price 元素。

3. lxml module
(1) lxml introduction and installation
lxml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data. We can use the previously learned XPath syntax to quickly locate specific elements and node information.
Installation method: pip install lxml

(2) Initial use of lxml
1. Parse HTML string

XML material: http://www.cnblogs.com/zhangboblogs/p/10114698.html
Summary: lxml can automatically correct the html code. In the example, not only the li tag is completed, but also the body and html tags are added.

2. Read lxml file

XML material: http://www.cnblogs.com/zhangboblogs/p/10114698.htm
In addition to reading strings directly, lxml also supports reading content from files. We create a new hello.html file, and then use the etree.parse() method to read the file.
Note: To read data from a file, the content of the file must conform to the xml format. If the tag is missing, it cannot be read normally.
Four, XPath node information analysis:

# 安装lxml: pip install lxml

 

# 1. 导入etree: 两种导入方式

# 第一种: 直接导入

from lxml import etree

# 注意: 此种导入方式,可能会导致报错(etree下面会出现红色波浪线,不影响正常使用)

 

# 第二种:

# from lxml import html

# etree = html.etree

 

str = '' \

            '' \

                '' \

                '29.99' \

            '' \

            '' \

                '' \

                '39.95' \

            '' \

            '' \

                '' \

                '69.95' \

            '' \

            '' \

                '' \

                '29.95' \

            '' \

            '' \

                '' \

                '29.95' \

            '' \

        ''

 

 

# 2. etree.HTML() 将字符串转换成HTML元素对象,可以自动添加缺失的元素

html = etree.HTML(str)  #   是一个el对象

# print(html)

 

 

# 3. 方法:

# 3.1 tostring()  查看转换之后的内容(二进制类型)

# 如果想要查看字符串,需要解码

# 如果想要显示汉字,需要先编码,再解码

# content = etree.tostring(html,encoding='utf-8')

# print(content.decode())

 

 

# 3.2 xpath()方法  作用:提取页面数据,返回值是一个列表

# xpath的使用一定是建立在etree.HTML()之后的内容中的

 

# xpath是如何来提取页面数据的?

# 答:使用的是路径表达式

 

# 3.2.1 xpath路径分为两种:

# 第一种: /  代表一层层的查找,如果/存在于开头,代表根路径

# bookstore = html.xpath('/html/body/bookstore')

# print(bookstore)  # []

 

# 第二种: // 任意路径  焦点在元素身上

# 例如:查找bookstore标签

# bookstore = html.xpath('//bookstore')

# print(bookstore)  # []

 

# 第一种和第二种结合

# 例如:查找所有book标签

# book = html.xpath('//bookstore/book')

# print(book)  # [, , , , ]

 

# 3.2.2 /text()  获取标签之间的内容

# 例如:获取所有title标签的内容

# 步骤:

# 1. 找到所有title标签

# 2. 获取内容

# title = html.xpath('//book/title/text()')

# print(title)  # ['Harry Potter', 'Learning XML', '西游记', '水浒传', '三国演义']

 

# 3.3 位于  使用[]  可以理解成条件

# 3.3.1 [n] 代表获取第n个元素,n是数字,n<=1

# 例如: 获取第二个title标签

# title = html.xpath('//book[2]/title/text()')

# title1 = html.xpath('//title[2]/text()')

# print(title)  # ['Learning XML']

# print(title1)  # []

 

# last()  获取最后一个

# 同理: last()-1  获取倒数第二个

# 例如: 获取最后一本书的title标签之间的内容

# title = html.xpath('//book[last()]/title/text()')

# title1 = html.xpath('//book[last()-1]/title/text()')

# print(title)  # ['三国演义']

# print(title1)  # ['水浒传']

 

# 3.3.2 position()  位置,范围  支持 > / < / = / >= / <= / !=

# 例如: 获取最后两本书的title标签之间的内容

# 步骤:

# 1. 先获取后两本书

# 2. 获取内容

# title = html.xpath('//book[position()>3]/title/text()')

# print(title)  # ['水浒传', '三国演义']

# ? title = html.xpath('//book[position()>last()-2]/title/text()')

# print(title)  # ['水浒传', '三国演义']

 

# 3.3.3 获取属性值:@属性名

 

# 例如: 获取lang属性值为cng的title标签的内容

# title = html.xpath('//book/title[@lang="cng"]/text()')

# print(title)  # ['西游记']

 

# 例如: 获取包含src属性得title标签的内容

# title = html.xpath('//book/title[@src]/text()')

# print(title)  # ['Harry Potter', '水浒传', '三国演义']

 

# 例如: 获取包含属性的title标签的内容

# title = html.xpath('//book/title[@*]/text()')

# print(title)  # ['Harry Potter', 'Learning XML', '西游记', '水浒传', '三国演义']

 

# 例如: 获取最后一个title标签的src属性的值

# title = html.xpath('//book[last()]/title/@src')

# print(title)  # ['https://www.jd.com']

 

# 例如: 获取所有包含src属性的标签之间的内容

# node = html.xpath('//*[@src]/text()')

# print(node)  # ['Harry Potter', '水浒传', '三国演义']

 

 

# 3.4 and  与  连接的是谓语(条件)

# 例如: 获取lang="dng"并且class="t1"的title标签的内容

# title = html.xpath('//book/title[@lang="dng" and @class="t1"]/text()')

# title1 = html.xpath('//book/title[@lang="dng"][@class="t1"]/text()')

# print(title)  # ['三国演义']

# print(title1)  # ['三国演义']

 

 

# 3.5 or  或  连接谓语

# 例如: 查找lang="cng"或者lang="bng"的title标签的内容

# title = html.xpath('//book/title[@lang="cng" or @lang="bng"]/text()')

# print(title)  # ['Harry Potter', '西游记']

 

 

# 3.6 |  连接路径

# 例如: 获取所有title标签和price标签之间的内容

# title = html.xpath('//title/text() | //price/text()')

# print(title)  # ['Harry Potter', '29.99', 'Learning XML', '39.95', '西游记', '69.95', '水浒传', '29.95', '三国演义', '29.95']

 

 

# 3.8 parse()  作用:从文件中读取数据

# 注意: 读取的文件,必须满足xml格式**(不存在单标签,全部都是上标签)**

content = etree.parse('test.html')

# print(content)  #

res = etree.tostring(content,encoding='utf-8')

print(res.decode()) 

This article comes from php Chinese website: python video tutorial column https://www.php.cn/course/list/30.html

Guess you like

Origin blog.csdn.net/Anna_xuan/article/details/110666931