Wednesday, December 08, 2010

XML parsing in Python

Its been a couple months, so I'm going to give a brief update on what I've been working on.

Concordance is getting close to release, I plan to have the first release (0.1) out January 1st. More on this toward the end of December.

One of the roadblocks I've hit (again and again) is the lack of a decent XML parsing package for Python. The standard library is a shame when it comes to XML; at least four different modules (expat, sax, dom, etree) to choose from and none of them support even XPath. The most popular option, etree (or ElementTree), cannot even process an XML file with the namespace prefix intact.

There's lxml, which offers an etree-compatible API and fixes many of ElementTree's major faults (namespace prefix preservation, xpath/xslt support) but still cannot handle stream processing and, due to ElementTree's API, does not expose multiple text nodes broken up by a child element such as "<div>first string <br/> second string</div>".

To support XMPP streams we need to use expat or sax to handle the stream event-by-event, since the full XML document is only available once the root element closes at the end of the stream, but the direct children of the root element (what we call "stanzas" in XMPP) need to be processed as complete objects. While we may be able to hack something together using lxml, it would likely be less work than to implement a new XML parsing package. As long as the resulting API doesn't diverge very greatly etree the work necessary to switch should be minimal.

Beside this I've been working on a host of different packages around Concordance, from getting a javascript BOSH/XMPP library together to getting distutils2 ready for Python 3. I've even managed to ship a pitiful little serial library for Python 3, PyTTY that we're using to interface with some Arduinos.

No comments: