Push me pull me?

Given an unrestricted choice between using a push or pull model for low-level parsing of XML i will always go for the pull model. In Java the push model corresponds to the SAX API and the pull model corresponds to the StAX API, or JSR 173, submitted to the JCP by BEA, which is implemented by Sun (see the Sun Java Streaming XML Parser (SJSXP) FCS version 1.0) and others (see WoodStox).

As a Java developer i find that the StAX API is just so much more intuitive to use as there is less mental juggling to manage parsing state. In terms of development this translates into much clearer code that is easier to maintain and as a consequence is likely to be more efficient. Furthermore, because application-based parsing state and behaviour is often kept "localized" there is more scope for optimization by the JVM, which can translate into further efficiency. Although it might be observed that raw SAX and StAX parsing performance is comparable i would assert that when application-level parsing code is taken into account that pull model will likely win. (Note that there are some nifty optimizations for pull implementations since the pull model allows skipping of information whereas with the push model all information is reported.)

The StAX API is not just suited for XML implementations. It is equally suited to parsing and serializing other representations of the XML Information Set. Two such examples are the StAX-based Fast Infoset implementation and the StAX-based XML Stream Buffer implementations. The former, Fast Infoset, is a binary encoding of the XML Information Set and the latter, XML Stream Buffer, efficiently buffers XML infoset in memory for replay.

Overall i think the StAX API represents a great improvement for the processing of XML in Java, and i think it possible to make it even better! The following is a wish list of improvements i would really like to see:
  • closer correspondence between parsing and serializing, so that it is easier to connect a XMLStreamReader to a XMLStreamWriter.
  • namespace management. Iterate over the in-scope namespaces.
  • the choice to buffer attribtues or iterate over attributes. Buffering attributes is an unecessary performance hit for some applications.
  • accessing text content or attribute values as "primitive" Java types. The binding of say a sequence of UTF-8 encoded characters to an integer value can be performed more efficiently if the implementation can combine the layer of parsing and data binding.
Comments:

I mostly agree. See my full comments at http://nothing-more.blogspot.com/2006/04/stax-of-future.html

Posted by derek on April 11, 2006 at 12:48 PM CEST #

Hi Paul! Good comments and insights, as usual. Some comments regarding wishlist:
  • Connecting stream readers and writers: I actually implemented something along these lines for Stax2 (experimental woodstox extensions on top of stax 1.0): there is a method XMLStreamWriter2.copyFrom(XMLStreamReader2) which allows for duplicating current event, efficiently.
  • Buffering attributes vs. streaming: alas, buffering pretty much must be done, due to namespace resolution. This because namespace declarations need not come before attributes referring to new declarations... so one can not reliably pass attributes in streaming way. That's a pity -- initially I hoped I would be able to do that (lazily only parse attribute values if needed, for example).
  • Access to in-scope namespaces: this should be easy to do, and definitely a good suggestion.
  • Closer binding of attribute value parsing and primitives: the challenge is really figuring out suitable plug'n play API. It might be worth it... it's just so messy to find a way to connect the pieces. ;-)

Posted by Cowtowncoder on April 13, 2006 at 12:49 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

sandoz

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today