Minidom parse xml python with a (reproduced)

1 xml Rule Summary
  The following is an example xml a blog site:

<xml Version = "1.0" encoding = "UTF-8"??>
<Rss Version = "2.0">
        <Channel>
                <title> Pilgrim </ title>
                <Link> http://blog.xxx.com </ Link>
                <the description> blog description </ the description>
                <Generator> Terac Miracle 3.8 </ Generator>
                <Item>
                        <title> title of the article </ title>
                        <Link> http://blog.xxx.com/e_42678.html </ Link>
                        <the Description> content of the article </ the Description>
                        <pubDate> Sun, 23 Sep 2007 23:32:00 +0800 </ pubDate>
                </item>
                <item>
                        <title> title of the article </ title>
                        <Link> http://blog.xxx.com/e_39749.html </ Link>
                        <the Description> content of the article </ the Description>
                        <pubDate> Mon, 27 Aug 2007 23:58: +0800 00 </ pubDate>
                </ Item>
        </ Channel>
</ rss>

  wherein the first acts of the document states that it provides basic information about the document to the parser. XML declaration is recommended, but it is not required. If so, then it must be the same as the first document of East
West. Statement may contain up to three name - value pairs. version is the version of XML used; the current value must be 1.0. encoding is the character set used in the document. If there is no specified encoding, XML parser assumes the character set in UTF-8 characters. Finally, there is a standalone (can be yes or no)
defines whether you can handle the document without reading any other files. For example, if an XML document does not refer to any other file, you can specify standalone = "yes". If the XML document references other description of the document file, you can specify standalone = "no".
standalone = "no" It is the default.
Example: <? Xml version = "1.0 " encoding = "ISO-8859-1" standalone = "no"?>
  Next, define xml document root element, an xml document must contain a single root element, named rss element tag in the present embodiment, it contains all the text in the document and all other elements
factors. rss version attribute is an element, which is the string "2.0." Useful values attribute must enclosed in quotation marks may be single or double quotation marks. </ rss> rss is an element end
flag, the end tag of each element is necessary. If the element has no child node, the start tag and end tag can be combined: <tag attr = 'value' />.
  xml element can not overlap, xml case sensitive.

  Three kinds of XML documents:
    * invalid XML document does not comply with the rules of grammar specification defined. If the developer is already defined in the DTD or schema document can contain anything, but a document does not comply with those rules
are, then the document is invalid. (See Defining document content to get devoted to DTD and schema of an XML document.)
    * Only valid document conforms to the XML syntax rules also comply with the rules defined in its DTD or schema.
    * Well-formed XML documents to comply with grammar, but no DTD or schema.

2 parser kind
  to let the computer read xml document, you need a parser (parser).
  There are different ways to divide the parser type:

  1. Verify or validation parser
  2. Support Document Object Model (DOM) parser
  3. Support Simple API for XML (SAX) parser
  written in a specific language parsing 4 devices (Java, C ++, Perl, etc.)

  As we mentioned in the previous, XML documents if you use a DTD and conform to the rules in the DTD it will be called a valid document (valid document). XML document conforms to the basic labeling rules are called correctly formatted document (well-formed document). XML specification requires all parsers when it was found to be an error when a document is not properly formatted.
  Validating parser (Validating parser) at the same time parsing XML documents to validate (check whether it is a valid document). Non-validating parser (Non-validating parser) ignore all validation errors. In other words, if an XML document is properly formatted when a non-validating parser does not concern whether the document is in line with the rules (if any) specified in the corresponding DTD.

  Document Object Model (Document Object Model) is the official recommended World Wide Web Consortium (W3C) is. It defines an interface that allows programs to access and update XML documents style, structure and content. Support DOM XML parser implement the interface.
  A DOM parser when parsing an XML document, a one-time read the entire document, save the document all the elements in a tree structure in memory, then you can use the DOM to provide different functions to read or modify the document the content and structure, you can also write the contents of the modified xml file.

  SAX API is another XML document content processing method. A de facto standard, which consists of David Megginson and XML-Dev mailing list
other members of the development. Unlike DOM, SAX read the entire document is not disposable, but the processing of the document in the form of a data stream, at the different document parser will generate an event. You decide for each event, such as
how to handle.

  DOM parser suitable for handling short document, and the document is suitable for random access and modify, SAX is suitable in order to read large documents. Modify the document using SAX is very troublesome.

3 DOM to python library xml.dom.minidom
  the DOM xml to each element, attribute, text and other information stored in the data type called nodes, in the xml.dom the Node (node) xml document is a component of each of
the parent class . XML is the most common node types include:
*: Element is the basic building blocks of XML. Typically, the element has child elements, text nodes, or a combination thereof. Element node can also have a unique node type attribute.
* Attributes: Attribute nodes contain information about an element node, but does not actually considered to be a child element, as in the following example:
* Text: The text node is a veritable text. It may be composed of more information may be included only blank.
* Documentation: document node is the document all other nodes father.
Other less common node types, but in some cases is still required. They include:
* CDATA: abbreviation for character data (Character Data), which is a special node that contains information that should not be analyzed parser. Instead, it contains information should be passed in plain text. For example, it may be stored for a specific purpose HTML tags. Under normal circumstances, the processor may try to create elements for each tag stored, and this may lead to a document is not a good format
of. The problem can be avoided by using CDATA section (section). These sections using special symbols to write:
<[CDATA [<B>
      Important: Please Keep head and Hands Inside Ride AT <I> All
. Times </ I>
      </ B>]]>
* NOTE: annotation includes data regarding the information is often overlooked application. They write as follows:
<-! This IS the Comment A -.>
* Processing instruction: information processing instruction is designed for the application. Some examples include code to be executed or information on where to find the stylesheet. For example:
<?? This stylesheet XML-type = "text / XSL" the href = "foo.xsl">
Python the xml.dom types of nodes obtained from the node nodeType attribute of the following values: ELEMENT_NODE,
ATTRIBUTE_NODE, a TEXT_NODE, CDATA_SECTION_NODE , ENTITY_NODE,
PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE,
DOCUMENT_TYPE_NODE, NOTATION_NODE

example: the above example is stored xml test-utf8.xml, must save the utf8 encoding. Then the python prompt type:
>>> from the xml.dom Import the minidom
>>> xmlDoc the minidom.parse = ( 'Test-utf8.xml')

to give a Document type variable xmldoc, which is a save all the information of the document tree structure. With toxml Node () function of the storage node can be obtained xml string. Because Node Document is a subclass of the function can be applied toxml:
>>> Print xmldoc.toxml ()

<<XML Version = "1.0"??>

                <title>pilgrim</title>
                <link>http://blog.xxx.com</link>
                <description>博客描述</description>
                <generator>Terac Miracle 3.8</generator>
                <item>
                        <title>文章标题</title>
                        <link>http://blog.xxx.com/e_42678.html</link>
                        <description>文章内容</description>
                        <pubDate>Sun, 23 Sep 2007 23:32:00 +0800</pubDate>
                </item>
                <item>
                        <title>文章标题</title>
                        <link>http://blog.xxx.com/e_39749.html</link>
                        <description>文章内容</description>
                        <pubDate> Mon, 27 Aug 2007 23:58:00 +0800 </ pubDate>
                </ Item>
        </ Channel>
</ rss>
To get the root node of the document, with the documentElement property Document:
>>> root = xmldoc .documentElement
>>> the root

<the DOM the Element: AT 0x14f29e0 RSS>
>>> Print root.toxml ()

<RSS Version = "2.0">
        <Channel>
                <title> Pilgrim </ title>
                <Link> HTTP: // Blog .xxx.com </ Link>
                <description> description blog </ description>
                <Generator> Miracle Terac 3.8 </ Generator>
                <Item>
                        <title> title of the article </ title>
                        <Link> http://blog.xxx.com/e_42678.html </ Link>
                        <description> content of the article </ the Description>
                        <pubDate> Sun, 23 Sep 2007 23:32:00 +0800 </ pubDate>
                </ Item>
                <Item>
                        <title> title of the article </ title>
                        <Link> HTTP: / /blog.xxx.com/e_39749.html </ Link>
                        <the Description> content of the article </ the Description>
                        <pubDate> Mon, 27 Aug 2007 23:58:00 +0800 </ pubDate>
                </ Item>
        </ Channel>
</ RSS>
the root has one child Channel, these sub-elements held by the childNodes root, wherein a first child node referenced by root.firstChild, last child node referenced by root.lastChild:

>>> Print root.firstChild. the toxml ()
>>> print root.lastChild.toxml()
>>> print root.childNodes[1].toxml()

<channel>
                <title>pilgrim</title>
                <link>http://blog.xxx.com</link>
                <description>博客描述</description>
                <generator>Terac Miracle 3.8</generator>
                <item>
                        <title>文章标题</title>
                        <link>http://blog.xxx.com/e_42678.html</link>
                        <description>文章内容</description>
                        <pubDate>Sun, 23 Sep 2007 23:32:00 +0800</pubDate>
                </item>
                <item>
                        <title>文章标题</title>
                        <link>http://blog.xxx.com/e_39749.html</link>
                        <description> content of the article </ the Description>
                        <pubDate> Mon, 27 Aug 2007 23:58:00 +0800 </ pubDate>
                </ Item>
        </ channel>
because firstChild and lastChild before and after the channel is formed by a blank space and line breaks text-based node, it will print out blank lines.
Access xmlDoc:
XML minidom portions corresponding relationship:
<Node.tagName Node.attributes.keys () = Node.attributes [ 'Key'] value.
Attributes may be the same as used dict, () with a list of properties obtained keys, with Node .attributes [ 'key']. value obtained attribute value

        Node.childNodes save child nodes, the first node Node.firstChild, last Node.lastChild, the same list can be used like.
        <Node.tagName> TextNode.data </Node.tagName>
</Node.tagName>
Modify xmldoc:
add a node: First, create a node: Document.createElement (tagName)
the Document.
node.insertBefore (new, ref)
remove nodes Node.removeChild ()
replacement node Node.replaceChild (new, old)
add and delete attributes change the attributes
Element.setAttribute (name, value) Element.removeAttribute ( name) can also be used
Element.attributes [ 'key'] = value to be specified directly, value is unicode string.

Examples 4
dir2xml.py is an example of xml file generated by the directory structure, xml2dir.py directory is reconstructed according to xml file generated content files are certainly set the contents of the file is not empty.
File dir2xml.py:

! # / Usr / bin / env Python
# - * - Coding: GBK - * -
"" "directory traversal to generate xml file directory structure.
Dir2xml dirname XMLFileName
" ""
Import os
from xml.dom Import AS pydom the minidom
Import SYS
DEF Usage ():
    Print "Usage:", the sys.argv [0], "dirname XMLFileName"

DEF dir2xml (dirname):
    "" "
    dirname is the path name, the path name must be the standard, no blanks before, not followed / or \\.
"" "
    Impl = pydom.getDOMImplementation ()
    newDoc = impl.createDocument (None," the dir ", None)
    RootDir = newdoc.documentElement
    rootdir.attributes [ 'name'] = os.path.basename (dirname)

    DEF walkdir (dirname , node, document): # recursive directory traversal, here os.path.walk easier
        for File in os.listdir (dirname):
            IF os.path.isfile (os.path.join (dirname, File)):
                = document.createElement newFileEl ( 'File')
                newFileEl.attributes [ 'name'] = file.encode ( 'UTF8')
                Node.appendChild (newFileEl)
            elif os.path.isdir (the os.path.join (dirname, File) ):
                newFileEl = the Document.
                newFileEl.attributes['name']=file.encode('utf8')
                node.appendChild(newFileEl)
                walkdir(os.path.join(dirname,file),newFileEl,document)
    walkdir(dirname,rootdir,newdoc)
    return newdoc

if __name__ == '__main__':
    if len(sys.argv)<3:
        usage()
        sys.exit()
    if not os.path.isdir(sys.argv[1]):
        print 'Error:',sys.argv[1],'is not a directory.'
        sys.exit()
    xmlfile=file(sys.argv[2],'w')

newdoc=dir2xml(unicode(os.path.normpath(sys.argv[1].strip()),'gb2312'))
    newdoc.writexml(xmlfile,'\n','  ')
    xmlfile.close()

文件xml2dir.py:

#!/usr/bin/env python
# - * - Coding: cp936 - * -
"" "directory generated by the xml document
xml2dir XMLFileName dirname
" ""
Import os
from the minidom AS pydom xml.dom Import
Import SYS

DEF Usage ():
    Print "Usage:", sys.argv [0], "dirname XMLFileName"

DEF xml2dir (XmlElement, dirname):
    "" "build directory from the xml document
    xml2dir (XmlElement, dirname)
    XmlElement is xml document element tag name file represents a file, dir represents the directory attribute name indicates the file. or directory name
    dirname is the name of the directory to save xmlElement entire node, will begin to build xmlElement tree in the directory dirname "". "
    IF not os.path.exists (dirname):
        os.mkdir (dirname)
    CWD = os.getcwd ( )
    os.chdir (dirname)
    for the childNode in xmlElement.childNodes:
        IF the childNode.nodeType not in
(childNode.ELEMENT_NODE,childNode.DOCUMENT_NODE):
            continue
        if childNode.tagName == u'file':
            file(childNode.attributes['name'].value,'w').close()
        elif childNode.tagName == u'dir':
            if not os.path.exists(childNode.attributes['name'].value):
                os.mkdir(childNode.getAttribute('name'))
            xml2dir(childNode,childNode.getAttribute('name'))
    os.chdir(cwd)

if __name__ == '__main__':
    if len(sys.argv)<3:
        usage()
        sys.exit()
    try:
        xmlfile=file(sys.argv[1],'r')
    except:
        sys.stderr.write("XML file not found or cannot access.")
        sys.exit()
    xmldoc=pydom.parse(xmlfile)
    xml2dir(xmldoc,os.path.normpath(sys.argv[2].strip()))
    xmlfile.close()

Reproduced in: https: //www.cnblogs.com/licheng/archive/2010/12/06/1897657.html

Guess you like

Origin blog.csdn.net/weixin_34337381/article/details/92627195