Fast Infoset and bnux performance results: oranges-to-apples?
By sandoz on Oct 20, 2005
Wolfgang Hoschek presented, at the GridWorld/GGF15, performance results comparing the XOM-based parsing and serializing of XOM documents encoded using Fast Infoset (FI) and encoded using the bnux format. These results do not show FI performing very well!
Before i delve into the performance comparisons a bit of background on bnux follows.
Bnux is a simple binary encoding of the XML Information Set that performs similar tricks to FI. I would classify bnux as supporting a subset of the functionality that FI supports: bnux is simpler than FI but supports less features. For example, the bnux encoding restricts streaming support to deserialization. When serializing bnux requires all string-based information of an XML infoset be known (e.g. from a tree representation such as a XOM document) since this string-based information is encoded at the head of the encoding. The advantage of this approach is that it is possible to 'pack' information based on the frequency of such string-based information. With FI it is possible to support the encoding of string-based information up front or in-line, which makes the encoding slightly more complex than bnux.
Bnux is part of nux a framework for highly-optimized XML processing. I am impressed with bnux and nux. A lot of good work has gone into developing this highly optimized toolkit.
Now back to the performance comparisons.
Although a lot of effort has been made to compare 'apples-to-apples' my view is it is more 'oranges-to-apples'. There are lots of additional costs due to the manner in which FI is being measured that when summed together result in very poor results. I will try and explain why this is might be so for case the of parsing.
Two forms of parsing are measured:
Parsing to a XOM document. Thus measures the cost of parsing and creation of the Java objects associated with the XOM document. (Such forms are called 'bnux?' and 'fi?' models in the presentation for bnux and FI respectively, see slide 6.)
Parsing using a 'Null' Node Factory. This measures parsing without the creation of the Java objects associated with the XOM document. (Such forms are called 'bunx?-NNF' and 'fi?-NNF' models in the presentation for bnux and FI respectively, see slide 6.)
For bnux measurements a very efficient parser is used that performs no well-formed checking and is optimally integrated with XOM such that features of the bnux encoding are taken advantage of to boost performance. The bnux parser relies on XOM to perform well-formed checks.
For FI measurements the FI SAX parser is used in conjunction with the well-formed checking XOM SAX handler. The FI SAX parser performs well-formed checking (for example, in-scope namespaces, duplicate attributes, incorrect use of the XML namespace and prefix, NCName and PCDATA character checks). So for 1) well-formed checking is being performed twice and once for 2) where as for bnux it is only being performed once for 1). The well-formed checking XOM SAX handler also sets up the FI SAX parser to perform string interning (this is a very expensive operation for small documents) and report namespace attributes as part of the array of attributes reported by the start element event (start and prefix mapping events will still occur but are ignored by the XOM handler). Because the SAX API is used it is not possible to take advantage of just the FI-based well-formed checking and FI encoding features (just like bnux takes advantage of the bnux encoding features). All this puts FI at a big disadvantage.
The only way to effectively get closer to an 'apples-to-apples' comparison is to compare using the same level of optimizations/tricks and API. That means developing an optimal FI XOM parser or an optimal bnux SAX parser that performs all the well-formed checks. I am currently in the process of developing the former (using the SAX DOM parser as a template). For development am using XOM-1.1b4 (patched with Wolfgang's performance patches) and an early access of nux 1.4 that Wolfgang has kindly sent me.
Part of this development requires that i can measure things. I have used Japex to develop specific drivers for 1 and 2 in addition to the FI SAX parser and the soon-to-be-fully implemented optimal FI XOM parser. Preliminary results produced from measuring these drivers (minus the FI XOM parser) on a small set of documents show that:
the FI SAX parser (performing well-formed checks) results are not that much slower than bunx form 2) results (using a null node factory, with no well-formed checking);
the FI SAX parser form 2) results are twice as slow as the the FI SAX parser results. Which indicates the use of the well-formed checking XOM SAX handler with a null node factory is costly; and
the difference between the results of FI SAX parser form 1) and FI SAX parser form 2) is much larger than the difference between the results of bunx form 1) and bunx form 2). This indicates that well-formed checking XOM SAX handler in addition to Java object creation is very expensive (some of the well-formed checking is performed at object creation).
Rather than present detailed results of the above and potentially compare 'apples-to-oranges', namely the FI SAX parser with the bnux parser using the null node factory, it would be far more convincing if I wait until the optimal FI XOM parser is complete and present results comparing this to bnux form 1) to give an 'apples-to-apples' comparison. Such results indicate that an optimal FI XOM parser could perform a lot better than using the FI SAX parser and could close the gap between FI and bnux. I hope this will be the case. Stay tuned!