What is XML
XML is an extensible markup language abbreviation (Extensible Markup Language), where the mark is a key part. The user can create content, then it is defined marker tag, so that each word, phrase, or become identifiable blocks, classified information.
Markup Language from early forms of private companies and the government to develop gradually evolved into Standard Generalized Markup Language (Standard Generalized Markup Language, SGML) , HTML (Hypertext Markup Language, HTML), and eventually evolve into XML. XML has the following characteristics:
- XML is designed to transmit data rather than displaying data
- XML tags are not predefined, you need to define your own labels
- XML is designed to be self-descriptive
- XML is a W3C Recommendation
Python parsing of XML documents
Common DOM and XML programming interfaces SAX, two different interface modes handle XML files, using the occasion is also different. DOM is a standard proposed by the W3C official, it will read the entire XML file into memory, and the file is parsed into a tree, we can access the tag in XML by way of access node of the tree, but this method occupies a large memory, slow to resolve, if the file is too large to read, try to avoid using this method. SAX is event-driven, by triggering one by one in the process of parsing XML in the event and call the user callback function defined to handle XML files, faster, less memory, but requires the user to implement a callback function, so Python standard library official documents such introduction SAX: SAX allows only a small part of your view of the document, you can not access other elements by elements of the current acquisition. Python provides many analytical package supports XML file, such as xml.dom, xml.sax, xml.dom.minidom and xml.etree.ElementTree etc., this article focuses xml.dom.minidom.
xml.dom.minidom package
xml.dom.minidom is extremely simplify the implementation of the DOM API, it is much simpler than the full version of the DOM, and this package is much smaller, the following example to movie.xml file operations.
<collection shelf="New Arrivals"> <movie title="Enemy Behind"> <type>War, Thriller</type> <format>DVD</format> <year>2003</year> <rating>PG</rating> <stars>10</stars> <description>Talk about a US-Japan war</description> </movie> <movie title="Transformers"> <type>Anime, Science Fiction</type> <format>DVD</format> <year>1989</year> <rating>R</rating> <stars>8</stars> <description>A schientific fiction</description> </movie> <movie title="Trigun"> <type>Anime, Action</type> <format>DVD</format> <episodes>4</episodes> <rating>PG</rating> <stars>10</stars> <description>Vash the Stampede!</description> </movie> <movie title="Ishtar"> <type>Comedy</type> <format>VHS</format> <rating>PG</rating> <stars>2</stars> <description>Viewable boredom</description> </movie> </collection>
Then we call xml.dom.minidom.parse method to read and parse xml file into a DOM tree
from xml.dom.minidom Import the parse Import xml.dom.minidom # using Open XML document parser minidom DOMTree = xml.dom.minidom.parse ( " F: /project/Breast/codes/AllXML/aa.xml " ) Collection = DOMTree.documentElement IF collection.hasAttribute ( " Shelf " ): Print ( " Root Element:% S " % collection.getAttribute ( " Shelf " )) # get all the movies in the collection Movies = collection.getElementsByTagName ( " movie ") # 打印每部电影的详细信息 for movie in movies: print("*****Movie*****") if movie.hasAttribute("title"): print("Title: %s" % movie.getAttribute("title")) type = movie.getElementsByTagName('type')[0] print("Type: %s" % type.childNodes[0].data) format = movie.getElementsByTagName('format')[0] print("Format: %s" % format.childNodes[0].data) rating = movie.getElementsByTagName('rating')[0] print("Rating: %s" % rating.childNodes[0].data) description = movie.getElementsByTagName('description')[0] print("Description: %s" % description.childNodes[0].data)
The above program execution results are as follows:
Root element : New Arrivals *****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Rating: PG Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Rating: R Description: A schientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Description: Viewable boredom
Real - Batch modify the XML file
In the recent match with caffe-ssd training data set, but the official data set to be used to mark the XML file is not a standard format, to name a few wrong label, making it impossible to generate lmdb file correctly, it is necessary to modify the label, use the following Python implementation of a batch script to modify the XML file.
# -*- coding:utf-8 -*- import os import xml.dom.minidom xml_file_path = "/home/lyz/data/VOCdevkit/MyDataSet/Annotations/" lst_label = ["height", "width", "depth"] lst_dir = os.listdir(xml_file_path) for file_name in lst_dir: file_path = xml_file_path + file_name tree = xml.dom.minidom.parse(file_path) root = tree.documentElement #获取根结点 size_node = root.getElementsByTagName("size")[0] for size_label in lst_label: #替换size标签下的子节点 child_tag = "img_" + size_label child_node = size_node.getElementsByTagName(child_tag)[0] new_node = tree.createElement(size_label) text = tree.createTextNode(child_node.firstChild.data) new_node.appendChild(text) size_node.replaceChild(new_node, child_node) #替换object下的boundingbox节点 lst_obj = root.getElementsByTagName("object") data = {} for obj_node in lst_obj: box_node = obj_node.getElementsByTagName("bounding_box")[0] new_box_node = tree.createElement("bndbox") for child_node in box_node.childNodes: tmp_node = child_node.cloneNode("deep") new_box_node.appendChild(tmp_node) x_node = new_box_node.getElementsByTagName("x_left_top")[0] xmin = x_node.firstChild.data data["xmin"] = (xmin, x_node) y_node = new_box_node.getElementsByTagName("y_left_top")[0] ymin = y_node.firstChild.data data["ymin"] = (ymin, y_node) w_node = new_box_node.getElementsByTagName("width")[0] xmax = str(int(xmin) + int(w_node.firstChild.data)) data["xmax"] = (xmax, w_node) h_node = new_box_node.getElementsByTagName("height")[0] ymax = str(int(ymin) + int(h_node.firstChild.data)) data["ymax"] = (ymax, h_node) for k, v in data.items(): new_node = tree.createElement(k) text = tree.createTextNode(v[0]) new_node.appendChild (text) new_box_node.replaceChild (new_node, V [ . 1 ]) obj_node.replaceChild (new_box_node, box_node) with Open (file_path, ' W ' ) AS F: tree.writexml (F, indent = " \ n- " , addindent = " \ T " , encoding = ' UTF-. 8 ' ) # remove the XML file header in the presence (in some cases the file header may cause errors) Lines = [] with Open (file_path, ' RB ' ) AS F: Lines = f.readlines () [. 1 :] with Open (file_path, ' WB ') as f: f.writelines(lines) print("-----------------done--------------------")
About writexml Method: indent parameter indicates the character is inserted before the current node, addindent represents the character inserted in the former sub-nodes of the node