Parsing XML... backwards?

http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html

Okay, I’ve heard jokes about people parsing XML files backwards, starting at the end of the file and reporting SAX events in reverse document order, but it seems that someone has actually gone and done it. The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total. However, it’s crazy to think of doing this for XML in general.

First problem: the document encoding. You don’t know what it is unless you sniff the beginning of the file and read the XML declaration, if present. A specific application may always write out XML in the same encoding and thus not bother to check, but this is not good enough for the general case.

扫描二维码关注公众号,回复: 1176476 查看本文章

Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it’s at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it. A specific application may decide it’s just not going to handle entities, but that won’t work for XML in general.

Third problem: comments. Say you’re parsing backwards through an XML file and you see this: -->. Must be the end of a comment, right? Wrong, it’s the end of an element: </hello-->. This is a killer for efficient parsing as it means you need potentially unbounded look-ahead (or look-behind, in this case) to decide what something is. (This problem could be avoided if comments were symmetrical and ended with --!>, but XML just wasn’t designed to be parsed backwards).

Fourth problem: processing instructions. As with the comments, how can you tell what this is:?>. The problem this time is that text can contain unescaped > characters (as long as they don’t follow ]]), so a backwards parser may need to look a very long way ahead to tell if this is the end of a processing instruction or just some text.

Fifth problem: documents that are not well-formed. I shudder to think what this parser would do with them, especially considering that it may stop parsing before it even reaches the beginning of the document.

Sixth problem: how do you append to the document in the first place if the root element is already closed? And if you never close the root element, then it’s not an XML file at all.

In summary, don’t use this technique! Sure, you might think that all of these problems can be avoided by making sure that your XML is in a fixed encoding, doesn’t use entities or processing instructions or whatever, but what’s the point of choosing an open standard text format if you’re going to impose arbitrary limitations on its use?

There are a number of other ways that applications can use to maintain growable log files. You can write multiple well-formed XML documents to a single file, following each one by a binary trailer that gives the size of the last chunk of XML. Then it is trivial for code to jump backwards through the file, grabbing a little document each time and passing it to a real XML parser. Or if your filesystem doesn’t suck, you can place each XML document in a separate file with a suitable filename and scan the directory, just like a maildir. Or if you don’t like writing this kind of code, grab a simple database library like Berkeley DB and let it do the work for you. Just don’t parse XML backwards!

[Note: unless you’re writing a syntax highlighting XML editor, and you want to do efficient update while the user edits the file. Then go for it.]

猜你喜欢

转载自ldbjakyo.iteye.com/blog/1720451