http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
Okay, I’ve heard jokes about people parsing XML files backwards, starting at the end of the file and reporting SAX events in reverse document order, but it seems that someone has actually gone and done it. The justification sounds almost plausible: an instant messaging client (Adium on the Mac) that writes out XML message log files and uses backwards parsing as a method for retrieving the last N messages in constant time, regardless of how many messages the file contains in total. However, it’s crazy to think of doing this for XML in general.
First problem: the document encoding. You don’t know what it is unless you sniff the beginning of the file and read the XML declaration, if present. A specific application may always write out XML in the same encoding and thus not bother to check, but this is not good enough for the general case.
Second problem: the DOCTYPE declaration. This can define entities and fixed attribute values, and again it’s at the beginning of the file, not the end. If you parse the file backwards and hit an entity reference, you have no idea what to do with it. A specific application may decide it’s just not going to handle entities, but that won’t work for XML in general.
Third problem: comments. Say you’re parsing backwards through an XML file and you see this: -->
. Must be the end of a comment, right? Wrong, it’s the end of an element: </hello-->
. This is a killer for efficient parsing as it means you need potentially unbounded look-ahead (or look-behind, in this case) to decide what something is. (This problem could be avoided if comments were symmetrical and ended with --!>
, but XML just wasn’t designed to be parsed backwards).
Fourth problem: processing instructions. As with the comments, how can you tell what this is:?>
. The problem this time is that text can contain unescaped >
characters (as long as they don’t follow ]]
), so a backwards parser may need to look a very long way ahead to tell if this is the end of a processing instruction or just some text.
Fifth problem: documents that are not well-formed. I shudder to think what this parser would do with them, especially considering that it may stop parsing before it even reaches the beginning of the document.
Sixth problem: how do you append to the document in the first place if the root element is already closed? And if you never close the root element, then it’s not an XML file at all.
In summary, don’t use this technique! Sure, you might think that all of these problems can be avoided by making sure that your XML is in a fixed encoding, doesn’t use entities or processing instructions or whatever, but what’s the point of choosing an open standard text format if you’re going to impose arbitrary limitations on its use?
There are a number of other ways that applications can use to maintain growable log files. You can write multiple well-formed XML documents to a single file, following each one by a binary trailer that gives the size of the last chunk of XML. Then it is trivial for code to jump backwards through the file, grabbing a little document each time and passing it to a real XML parser. Or if your filesystem doesn’t suck, you can place each XML document in a separate file with a suitable filename and scan the directory, just like a maildir. Or if you don’t like writing this kind of code, grab a simple database library like Berkeley DB and let it do the work for you. Just don’t parse XML backwards!
[Note: unless you’re writing a syntax highlighting XML editor, and you want to do efficient update while the user edits the file. Then go for it.]
分享到:
相关推荐
ParsingError.md
郁闷啊,有时候不得不承认,无论是什么事,曾经是好的,到后边未必还是好的,不要拿曾经的种种来判断今天的结果, 前景:之前本地用jeecg(1.7版本)设计流程、发布流程、修改流程,所有的操作都是项目有汉字启动的,...
在您提供的信息中,提到了`draw.py`、`parsing.py`和`samples.py`这三个脚本,它们与Bootchart的实现密切相关。 首先,让我们了解一下这些脚本的作用: 1. **draw.py**:这个名字暗示了这是一个用于绘制图表的脚本...
标题中的" ParsingXML.rar"指的是一个关于XML解析的Java编程示例项目,它使用了SAX(Simple API for XML)方法来处理XML文件。SAX是一种事件驱动的XML解析器,它不会像DOM(Document Object Model)那样将整个XML...
C++ parsing DSL(领域特定语言解析)是编程领域中的一个重要概念,主要涉及到如何在C++中解析和处理特定领域的语法或指令集。DSL是一种为特定任务设计的简化语言,它可以嵌入到像C++这样的通用编程语言中,使得代码...
XML(eXtensible Markup Language)是一种用于标记数据的语言,广泛应用于数据交换、配置文件以及在Web服务中传输数据。Java作为一个强大的编程语言,提供了多种API来解析XML文档,包括DOM(Document Object Model)...
Pull解析器(XML Pull parsing)是一种事件驱动的解析方式,它不需要一次性加载整个XML文档,而是按需读取,逐个处理XML元素。这种方式非常适合处理大文件或流式数据,因为它可以显著降低内存使用,并提高性能。Pull...
标题中的"PyPI 官网下载 | nr.parsing.date-0.4.4.tar.gz"表明这是一个在Python的包索引服务(PyPI)上发布的软件包,名为`nr.parsing.date`,版本号为0.4.4,且是以tar.gz格式压缩的文件。PyPI是Python开发者发布...
Chapter 5: Parsing XML with DOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 6: Parsing XML with SAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 7: XSLT ...
使用PyTorch 1.0和python 3.x 演示版 更改头发和嘴唇的颜色: pythonmakeup.py --img-path imgs / 116.jpg 尝试使用其他颜色: ...更改makeup.py中的颜色列表(第83...按照此回购zllrunning / face-parsing.PyTorch
《纯C语言XML解析——xmlparse.c&xmlparse.h》 XML(eXtensible Markup Language)是一种广泛应用的标记语言,常用于数据交换、配置文件和文档存储等场景。在IT行业中,尤其是在需要跨平台或者对性能有较高要求的...
·比RapidXML功能强很多.比Tiny都强多了.速度和Rapid差不多 ·源代码只有285k 3个文件 ·Low memory consumption and fragmentation (compared to other DOM style parsers)....·Lacks UTF-16/32 parsing.
标题中的"bootchart python draw.py parsing.py samples.py"提到了几个关键元素,它们分别是bootchart、python以及三个Python脚本文件:draw.py、parsing.py和samples.py。这些元素涉及到了Android系统的性能测试和...
3. **解析事件(Parsing Events)**: - `startDocument()`:文档开始时调用。 - `endDocument()`:文档结束时调用。 - `startElement(String uri, String localName, String qName, Attributes attributes)`:...
parsing.py可能会包含解析特定格式的日志的函数,如提取时间戳、识别系统调用等,以便将这些信息传递给其他组件进行进一步处理。 3. **samples.py**:此脚本可能包含了一些示例代码或测试用例,用于展示如何使用...
6. **解析(Parsing)**:XML解析器读取XML文档,验证其结构是否符合规范,并将数据转换为可操作的对象或数据结构。 7. **XPath**:一种在XML文档中查找信息的语言,可以根据路径表达式选取节点,方便数据提取。 8...
5. **解析(Parsing)**:XML解析器读取XML文档并将其转换为数据结构,以便程序可以处理。有两种解析方式:DOM(Document Object Model)和SAX(Simple API for XML)。DOM将整个文档加载到内存中,适合小型文档;SAX...