Saturday, May 21, 2011

Nodetree design

I've spent the last two weeks charting out Nodetree, a XML data binding package for Python 3.

I'd rather not reiterate the multitude of reasons for this, but since they always get asked, I'll summarize this with needing a stream parser that supports xpath/xslt and passing chunks of an XML stream in Python, something that ElementTree and lxml do not support. SleekXMPP employs a fairly ingenious hack on lxml to achieve this, but I'd rather spend some time doing it right from the start for Concordance.

Since XPath and XSLT are best supported by libxml2 than any other free software library, and I'd rather not spend the next few months writing an implementation from scratch, the data must exist as a libxml2 DOM tree at the point of processing.

You can get a libxml2 DOM tree either by parsing a file (which gives you little control, it parses the entire file one-shot) or build the tree node by node. libxml2 also supports processing an XML stream through its SAX interface, node by node, which gives us the flexibility needed to pop and/or parse nodes from the stream as its being parsed while still having all the other tools we need by generating a DOM tree of the segments as we process them.

The problem here is that Nodetree isn't a DOM interface, its a Pythonic XML data binding interface, so such things as adding the same element to two documents (or being able to create a new document using a piece of another document while both are in memory) is a challenge. Even if we were not using libxml2/DOM, for XPath to work correctly every node must have a clear hierarchy which falls apart when your context is a node used in the same document three times and in two different documents.

One of the benefits of working on a wide variety of projects is reusing clever solutions used on one project for something completely different. For example, I recently implemented PySoy's atomic API whereas Python objects are created on the fly for underlying data structures and could be attached to multiple points of data (see ColorCanvas.py for example). This allows multiple data points to be updated with one step, and saves us from doing silly things like storing (ie) an object for every pixel in an image.

I'm implementing almost exactly the same mechanism for Nodetree, it'll limit XPath searches run from Python somewhat but should be fully XML compliant and Pythonic, something (IMHO) nobody has managed to do yet.

No comments: