Read xml contains only labels

What is XML

XML is an extensible markup language abbreviation (Extensible Markup Language), where the mark is a key part. The user can create content, then it is defined marker tag, so that each word, phrase, or become identifiable blocks, classified information.
Markup Language from early forms of private companies and the government to develop gradually evolved into Standard Generalized Markup Language (Standard Generalized Markup Language, SGML) , HTML (Hypertext Markup Language, HTML), and eventually evolve into XML. XML has the following characteristics:

  • XML is designed to transmit data rather than displaying data
  • XML tags are not predefined, you need to define your own labels
  • XML is designed to be self-descriptive
  • XML is a W3C Recommendation

Python parsing of XML documents


       Common DOM and XML programming interfaces SAX, two different interface modes handle XML files, using the occasion is also different. DOM is a standard proposed by the W3C official, it will read the entire XML file into memory, and the file is parsed into a tree, we can access the tag in XML by way of access node of the tree, but this method occupies a large memory, slow to resolve, if the file is too large to read, try to avoid using this method. SAX is event-driven, by triggering one by one in the process of parsing XML in the event and call the user callback function defined to handle XML files, faster, less memory, but requires the user to implement a callback function, so Python standard library official documents such introduction SAX: SAX allows only a small part of your view of the document, you can not access other elements by elements of the current acquisition. Python provides many analytical package supports XML file, such as xml.dom, xml.sax, xml.dom.minidom and xml.etree.ElementTree etc., this article focuses xml.dom.minidom.

xml.dom.minidom package

      xml.dom.minidom is extremely simplify the implementation of the DOM API, it is much simpler than the full version of the DOM, and this package is much smaller, the following example to movie.xml file operations.

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

Then we call xml.dom.minidom.parse method to read and parse xml file into a DOM tree

from xml.dom.minidom Import the parse
 Import xml.dom.minidom 

# using Open XML document parser minidom 
DOMTree = xml.dom.minidom.parse ( " F: /project/Breast/codes/AllXML/aa.xml " ) 
Collection = DOMTree.documentElement
 IF collection.hasAttribute ( " Shelf " ):
     Print ( " Root Element:% S " % collection.getAttribute ( " Shelf " )) 

# get all the movies in the collection 
Movies = collection.getElementsByTagName ( " movie ")

# 打印每部电影的详细信息
for movie in movies:
    print("*****Movie*****")
    if movie.hasAttribute("title"):
        print("Title: %s" % movie.getAttribute("title"))

    type = movie.getElementsByTagName('type')[0]
    print("Type: %s" % type.childNodes[0].data)
    format = movie.getElementsByTagName('format')[0]
    print("Format: %s" % format.childNodes[0].data)
    rating = movie.getElementsByTagName('rating')[0]
    print("Rating: %s" % rating.childNodes[0].data)
    description = movie.getElementsByTagName('description')[0]
    print("Description: %s" % description.childNodes[0].data)

The above program execution results are as follows:

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

Real - Batch modify the XML file


In the recent match with caffe-ssd training data set, but the official data set to be used to mark the XML file is not a standard format, to name a few wrong label, making it impossible to generate lmdb file correctly, it is necessary to modify the label, use the following Python implementation of a batch script to modify the XML file.

# -*- coding:utf-8 -*-

import os

import xml.dom.minidom

xml_file_path = "/home/lyz/data/VOCdevkit/MyDataSet/Annotations/"
lst_label = ["height", "width", "depth"]
lst_dir = os.listdir(xml_file_path)


for file_name in lst_dir:
    file_path = xml_file_path + file_name
    tree = xml.dom.minidom.parse(file_path)
    root = tree.documentElement        #获取根结点
    size_node = root.getElementsByTagName("size")[0]
    for size_label in lst_label:    #替换size标签下的子节点
        child_tag = "img_" + size_label
        child_node = size_node.getElementsByTagName(child_tag)[0]
        new_node = tree.createElement(size_label)
        text = tree.createTextNode(child_node.firstChild.data)
        new_node.appendChild(text)
        size_node.replaceChild(new_node, child_node)

    #替换object下的boundingbox节点
    lst_obj = root.getElementsByTagName("object")
    data = {}
    for obj_node in lst_obj:
        box_node = obj_node.getElementsByTagName("bounding_box")[0]
        new_box_node = tree.createElement("bndbox")
        for child_node in box_node.childNodes:
            tmp_node = child_node.cloneNode("deep")
            new_box_node.appendChild(tmp_node)
        x_node = new_box_node.getElementsByTagName("x_left_top")[0]
        xmin = x_node.firstChild.data
        data["xmin"] = (xmin, x_node)
        y_node = new_box_node.getElementsByTagName("y_left_top")[0]
        ymin = y_node.firstChild.data
        data["ymin"] = (ymin, y_node)
        w_node = new_box_node.getElementsByTagName("width")[0]
        xmax = str(int(xmin) + int(w_node.firstChild.data))
        data["xmax"] = (xmax, w_node)
        h_node = new_box_node.getElementsByTagName("height")[0]
        ymax = str(int(ymin) + int(h_node.firstChild.data))
        data["ymax"] = (ymax, h_node)


        for k, v in data.items():
            new_node = tree.createElement(k)
            text = tree.createTextNode(v[0])
            new_node.appendChild (text) 
            new_box_node.replaceChild (new_node, V [ . 1 ]) 
        obj_node.replaceChild (new_box_node, box_node) 

    with Open (file_path, ' W ' ) AS F: 
        tree.writexml (F, indent = " \ n- " , addindent = " \ T " , encoding = ' UTF-. 8 ' ) 

    # remove the XML file header in the presence (in some cases the file header may cause errors) 
    Lines = [] 
    with Open (file_path, ' RB ' ) AS F: 
        Lines = f.readlines () [. 1 :] 
    with Open (file_path, ' WB ') as f:
        f.writelines(lines)

print("-----------------done--------------------")

About writexml Method: indent parameter indicates the character is inserted before the current node, addindent represents the character inserted in the former sub-nodes of the node

 

Guess you like

Origin www.cnblogs.com/ziytong/p/11106381.html