[Python] Python Advanced Series Tutorials--Python3 XML Parsing (8)

foreword

Past review:

What is XML?

XML refers to eXtensible Markup Language (eXtensible Markup Language), a subset of the standard general markup language, and is a markup language used to mark electronic documents to make them structural. You can learn XML tutorials from this site

XML was designed to transmit and store data.

XML is a set of rules that define semantic tags that divide a document into parts and identify those parts.

It is also a meta-markup language, that is, a syntactic language that defines other domain-specific, semantic, and structured markup languages.

Python's parsing of XML

The common XML programming interfaces are DOM and SAX. These two interfaces deal with XML files in different ways, and of course the usage occasions are also different.

Python has three methods to parse XML, SAX, DOM, and ElementTree:

1. SAX (simple API for XML)
The Python standard library includes a SAX parser. SAX uses an event-driven model to process XML files by triggering events one by one during the process of parsing XML and calling user-defined callback functions.

2. DOM (Document Object Model)
parses XML data into a tree in memory, and manipulates XML by operating on the tree.

The content of the XML instance file movies.xml used in this chapter is as follows:

example

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Python uses SAX to parse xml

SAX is an event-driven API.

Parsing XML documents using SAX involves two parts: parser and event handler.

The parser is responsible for reading the XML document and sending events to event handlers, such as element start and element end events.

The event handler is responsible for responding to the event and processing the passed XML data.

1. Process large files;
2. Only need part of the file, or only need to get specific information from the file.
3. When you want to build your own object model.
To use sax to process xml in Python, first introduce the parse function in xml.sax, and the ContentHandler in xml.sax.handler.

ContentHandler class method introduction

characters(content) method

When to call:

From the beginning of the line, until the label is encountered, there are characters, and the value of content is these strings.

From one label, until the next label is encountered, there are characters, and the value of content is these strings.

From a label, until the line terminator is encountered, there are characters, and the value of content is these strings.

A tag can be either an opening tag or an ending tag.

startDocument() method

Called when the document starts.

endDocument() method

Called when the parser reaches the end of the document.

startElement(name, attrs) method

Called when an XML start tag is encountered, name is the name of the tag, and attrs is the attribute value dictionary of the tag.

endElement(name) method

Called when an XML closing tag is encountered.

make_parser method

The following method creates a new parser object and returns it.

xml.sax.make_parser( [parser_list] )

Parameter Description:

parser_list - optional argument, list of parsers

parser method

The following method creates a SAX parser and parses the xml document:

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

Parameter Description:

xmlfile - xml file name
contenthandler - must be a ContentHandler object
errorhandler - if specified, errorhandler must be a SAX ErrorHandler object

parseString method

The parseString method creates an XML parser and parses the xml string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

Parameter Description:

  • xmlstring - xml string
  • contenthandler - must be a ContentHandler object
  • errorhandler - if specified, errorhandler must be a SAX ErrorHandler object

Python parsing XML instance

example

#!/usr/bin/python3

import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # 元素开始调用
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print ("*****Movie*****")
         title = attributes["title"]
         print ("Title:", title)

   # 元素结束调用
   def endElement(self, tag):
      if self.CurrentData == "type":
         print ("Type:", self.type)
      elif self.CurrentData == "format":
         print ("Format:", self.format)
      elif self.CurrentData == "year":
         print ("Year:", self.year)
      elif self.CurrentData == "rating":
         print ("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print ("Stars:", self.stars)
      elif self.CurrentData == "description":
         print ("Description:", self.description)
      self.CurrentData = ""

   # 读取字符时调用
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
 
if ( __name__ == "__main__"):
   
   # 创建一个 XMLReader
   parser = xml.sax.make_parser()
   # 关闭命名空间
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # 重写 ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   
   parser.parse("movies.xml")

The above code execution results are as follows:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

For complete SAX API documentation see Python SAX APIs

Use xml.dom to parse xml

The Document Object Model (DOM for short) is a standard programming interface recommended by the W3C organization for processing Extensible Markup Language.

When a DOM parser parses an XML document, it reads the entire document at one time and stores all the elements in the document in a tree structure in memory, and then you can use the different functions provided by DOM to read or modify the document The content and structure of the xml file can also be written to the modified content.

Use xml.dom.minidom in Python to parse xml files, examples are as follows:

example

#!/usr/bin/python3

from xml.dom.minidom import parse
import xml.dom.minidom

# 使用minidom解析器打开 XML 文档
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print ("Root element : %s" % collection.getAttribute("shelf"))

# 在集合中获取所有电影
movies = collection.getElementsByTagName("movie")

# 打印每部电影的详细信息
for movie in movies:
   print ("*****Movie*****")
   if movie.hasAttribute("title"):
      print ("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print ("Type: %s" % type.childNodes[0].data)
   format = movie.getElementsByTagName('format')[0]
   print ("Format: %s" % format.childNodes[0].data)
   rating = movie.getElementsByTagName('rating')[0]
   print ("Rating: %s" % rating.childNodes[0].data)
   description = movie.getElementsByTagName('description')[0]
   print ("Description: %s" % description.childNodes[0].data)

The above program execution results are as follows:

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

Guess you like

Origin blog.csdn.net/u011397981/article/details/131138242