Article directory
foreword
Past review:
- Python Advanced Tutorial Series--Python3 Regular Expressions (1)
- Python Advanced Tutorial Series--Python3 CGI Programming (2)
- Python Advanced Series Tutorials--Python3 MySQL-mysql-connector Driver (3)
- Python Advanced Series Tutorials--Python3 MySQL Database Connection-PyMySQL Driver
- Python Advanced Series Tutorials--Python3 Network Programming (5)
- Python Advanced Tutorial Series--Python3 SMTP Sending Mail (6)
- Python Advanced Tutorial Series--Python3 Multithreading (7)
What is XML?
XML refers to eXtensible Markup Language (eXtensible Markup Language), a subset of the standard general markup language, and is a markup language used to mark electronic documents to make them structural. You can learn XML tutorials from this site
XML was designed to transmit and store data.
XML is a set of rules that define semantic tags that divide a document into parts and identify those parts.
It is also a meta-markup language, that is, a syntactic language that defines other domain-specific, semantic, and structured markup languages.
Python's parsing of XML
The common XML programming interfaces are DOM and SAX. These two interfaces deal with XML files in different ways, and of course the usage occasions are also different.
Python has three methods to parse XML, SAX, DOM, and ElementTree:
1. SAX (simple API for XML)
The Python standard library includes a SAX parser. SAX uses an event-driven model to process XML files by triggering events one by one during the process of parsing XML and calling user-defined callback functions.
2. DOM (Document Object Model)
parses XML data into a tree in memory, and manipulates XML by operating on the tree.
The content of the XML instance file movies.xml used in this chapter is as follows:
example
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
Python uses SAX to parse xml
SAX is an event-driven API.
Parsing XML documents using SAX involves two parts: parser and event handler.
The parser is responsible for reading the XML document and sending events to event handlers, such as element start and element end events.
The event handler is responsible for responding to the event and processing the passed XML data.
1. Process large files;
2. Only need part of the file, or only need to get specific information from the file.
3. When you want to build your own object model.
To use sax to process xml in Python, first introduce the parse function in xml.sax, and the ContentHandler in xml.sax.handler.
ContentHandler class method introduction
characters(content) method
When to call:
From the beginning of the line, until the label is encountered, there are characters, and the value of content is these strings.
From one label, until the next label is encountered, there are characters, and the value of content is these strings.
From a label, until the line terminator is encountered, there are characters, and the value of content is these strings.
A tag can be either an opening tag or an ending tag.
startDocument() method
Called when the document starts.
endDocument() method
Called when the parser reaches the end of the document.
startElement(name, attrs) method
Called when an XML start tag is encountered, name is the name of the tag, and attrs is the attribute value dictionary of the tag.
endElement(name) method
Called when an XML closing tag is encountered.
make_parser method
The following method creates a new parser object and returns it.
xml.sax.make_parser( [parser_list] )
Parameter Description:
parser_list - optional argument, list of parsers
parser method
The following method creates a SAX parser and parses the xml document:
xml.sax.parse( xmlfile, contenthandler[, errorhandler])
Parameter Description:
xmlfile - xml file name
contenthandler - must be a ContentHandler object
errorhandler - if specified, errorhandler must be a SAX ErrorHandler object
parseString method
The parseString method creates an XML parser and parses the xml string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
Parameter Description:
- xmlstring - xml string
- contenthandler - must be a ContentHandler object
- errorhandler - if specified, errorhandler must be a SAX ErrorHandler object
Python parsing XML instance
example
#!/usr/bin/python3
import xml.sax
class MovieHandler( xml.sax.ContentHandler ):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.format = ""
self.year = ""
self.rating = ""
self.stars = ""
self.description = ""
# 元素开始调用
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print ("*****Movie*****")
title = attributes["title"]
print ("Title:", title)
# 元素结束调用
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "format":
print ("Format:", self.format)
elif self.CurrentData == "year":
print ("Year:", self.year)
elif self.CurrentData == "rating":
print ("Rating:", self.rating)
elif self.CurrentData == "stars":
print ("Stars:", self.stars)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# 读取字符时调用
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if ( __name__ == "__main__"):
# 创建一个 XMLReader
parser = xml.sax.make_parser()
# 关闭命名空间
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# 重写 ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse("movies.xml")
The above code execution results are as follows:
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom
For complete SAX API documentation see Python SAX APIs
Use xml.dom to parse xml
The Document Object Model (DOM for short) is a standard programming interface recommended by the W3C organization for processing Extensible Markup Language.
When a DOM parser parses an XML document, it reads the entire document at one time and stores all the elements in the document in a tree structure in memory, and then you can use the different functions provided by DOM to read or modify the document The content and structure of the xml file can also be written to the modified content.
Use xml.dom.minidom in Python to parse xml files, examples are as follows:
example
#!/usr/bin/python3
from xml.dom.minidom import parse
import xml.dom.minidom
# 使用minidom解析器打开 XML 文档
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
print ("Root element : %s" % collection.getAttribute("shelf"))
# 在集合中获取所有电影
movies = collection.getElementsByTagName("movie")
# 打印每部电影的详细信息
for movie in movies:
print ("*****Movie*****")
if movie.hasAttribute("title"):
print ("Title: %s" % movie.getAttribute("title"))
type = movie.getElementsByTagName('type')[0]
print ("Type: %s" % type.childNodes[0].data)
format = movie.getElementsByTagName('format')[0]
print ("Format: %s" % format.childNodes[0].data)
rating = movie.getElementsByTagName('rating')[0]
print ("Rating: %s" % rating.childNodes[0].data)
description = movie.getElementsByTagName('description')[0]
print ("Description: %s" % description.childNodes[0].data)
The above program execution results are as follows:
Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom