12 Python using xml

Overview

        In the previous section, we introduced Python's regular expressions, including: the definition of regular expressions, the syntax of regular expressions, re.search function, re.match function, re.findall function, re.sub function, re. compile function, re.finditer function, re.split function, etc. In this section, we will introduce how to use xml in Python.

        The full name of XML in English is eXtensible Markup Language, and in Chinese it is Extensible Markup Language. It is a language similar to HTML but does not use predefined tags. Therefore, XML can define exclusive tags according to your own design needs. Best of all, because the basic format of XML is standardized, if you share or transmit XML across systems or platforms, locally or over the Internet, the recipient can still parse the received data. In one sentence, XML is designed to transmit and store data, not to represent and display data. HTML is used to represent data.

        In Python, the following methods are usually used to process xml.

        SAX: The full name in English is Simple API for XML. It scans XML documents line by line and parses while scanning. It takes up less memory and is faster. The disadvantage is that it cannot stay in the memory for a long time, and the data is not permanent. After the event, if the data is not saved, the data will be lost.

        DOM: The full English name is Document Object Model. It will read the entire XML into the memory, parse it into a tree in the memory, and operate the XML by operating the tree. This method takes up a lot of memory and is slow to parse.

        ElementTree: Element tree, which combines the advantages of SAX and DOM, takes up less memory, is faster, and is easier to use.

SAX

        SAX is an event-based interface for parsing XML. It does not load the entire document into memory, but reads the document line by line or element by element, and then triggers the corresponding event. In Python, a SAX parser can be implemented using the built-in xml.sax module. This module provides a basic set of event handler classes that can be used to process different parts of XML, such as elements, attributes, text, etc. Since SAX does not load the entire document into memory, it is better suited for processing large XML documents.

        If there is the following Friends.xml file:

<?xml version='1.0' encoding='UTF-8'?>
<Friends>
  <Friend>
    <Name>Mike</Name>
    <Age>18</Age>
  </Friend>
  <Friend>
    <Name>Tom</Name>
    <Age>16</Age>
  </Friend>
</Friends>

        We can use the following sample code to read this Friends.xml file.

import os
import xml.sax

class FriendHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.name = ''
        self.age = ''

    # 元素开始时调用
    def startElement(self, tag, attributes):
        self.CurrentData = tag

    # 元素结束时调用
    def endElement(self, tag):
        if self.CurrentData == 'Name':
            print('Name is', self.name)
        elif self.CurrentData == 'Age':
            print('Age is', self.age)
        self.CurrentData = ''

    # 读取字符时调用
    def characters(self, content):
        if self.CurrentData == 'Name':
            self.name = content
        elif self.CurrentData ==  'Age':
            self.age = content

parser = xml.sax.make_parser()
Handler = FriendHandler()
parser.setContentHandler(Handler)
path = os.getcwd() + 'Friends.xml'
parser.parse(path)

        In the above sample code, we defined a class named FriendHandler, which inherits the xml.sax.ContentHandler class and implements three of its methods: startElement, endElement, and characters. When the parser encounters an element start tag, it calls the startElement method. When an element end tag is encountered, the endElement method is called. When text within an element is encountered, the characters method is called. After running the sample code, its output is as follows:

Name is Mike
Age is 18
Name is Tom
Age is 16

DOM

        DOM is an interface standard for representing HTML and XML documents. It provides a way for developers to programmatically access and modify the content and structure of the document. In Python, you can use a variety of libraries to implement DOM parsers, such as: xml.dom.minidom, lxml, etc.

        In the sample code below, we use xml.dom.minidom to parse the Friends.xml file mentioned above. In addition, we can also use minidom.parseString to parse xml strings.

import xml.dom.minidom as minidom

doc = minidom.parse('Friends.xml')
root = doc.documentElement 
children = root.childNodes
for child in children:
    if child.nodeName == 'Friend':
        name = child.getElementsByTagName('Name')[0]
        print('Name is', name.childNodes[0].data)
        age = child.getElementsByTagName('Age')[0]
        print('Age is', age.childNodes[0].data)

ElementTree

        ElementTree provides a simple and efficient API for parsing and creating XML data. It uses a tree-based model to represent XML documents, allowing us to easily access and modify the elements and attributes of XML data.

        In the sample code below, we use ElementTree to parse the Friends.xml file mentioned above.

import xml.etree.ElementTree as ET

tree = ET.parse('Friends.xml')
root = tree.getroot()
for friend in root:
    name = friend[0]
    print('Name is', name.text)
    age = friend[1]
    print('Age is', age.text)

        Using ElementTree, we can also easily generate xml and save it to a file or string. The content of the Friends_new.xml file generated in the sample code below is the same as the content of the Friends.xml file mentioned above.

import xml.etree.ElementTree as ET
  
root = ET.Element('Friends')
  
child = ET.SubElement(root, 'Friend')
child_name = ET.SubElement(child, 'Name')
child_name.text = 'Mike'
child_age = ET.SubElement(child, 'Age')
child_age.text = '18'

child = ET.SubElement(root, 'Friend')
child_name = ET.SubElement(child, 'Name')
child_name.text = 'Tom'
child_age = ET.SubElement(child, 'Age')
child_age.text = '16'
   
tree = ET.ElementTree(root)
with open('Friends_new.xml', 'wb') as file:
    tree.write(file, 'UTF-8')

Guess you like

Origin blog.csdn.net/hope_wisdom/article/details/132778570