Fast Infoset and bnux performance results: "red delicious"-to-"golden delicious"
By sandoz on Oct 27, 2005
In a previous blog i discussed some Fast Infoset and bnux performance results and how measurement of Fast Infoset was very different from the measurement of bnux ('oranges-to-apples').
For the first iteration to get close to an 'apples-to-apples' comparison I have developed an optimal FI XOM parser that integrates into XOM in a manner such that it is assumed the hallowed status of a trusted parser that performs correct well-formed checking (see previous blog for the configuration). This is for reasons for expediency as it is easy to modify the existing FI DOM parser (which already performs well-formed checking). Such a solution is not practical given that it avails of the package private methods for XOM node construction. However, the results will give a good indication of Fast Infoset performance and whether it is worth pursuing a more practical implementation along the lines of how bnux is integrated.
The differences between the FI and bnux measurement has been reduced from different layers of integration and 2 sets of well-formed checks to the same layer of integration and one set of well-formed checks. The only difference that remains is where the well-formed checks are being performed. The FI XOM parser performs the well-formed checks and the bnux XOM parser defers such checks to XOM (which is probably one of the, if not the, most rigorous well-formed checkers there is).
This is not an exact 'apples-to-apples' comparison, but I would classify it as different types of apple, say 'red delicious' to 'golden delicious'. Perhaps I should have originally compared two fruits that have an old common ancestor (what ever that might be, I briefly looked at the tree of life but could not find apple or orange tree present in the database!) then chosen a fruit with a closer common ancestor that is not an apple.
A 'red delicious' is great for snacking but not good for baking, kind of like the FI XOM implementation since it is not designed to be baked into XOM for practical use. A 'golden delicious' is great for snacking, cooking and baking, kind of like the bnux XOM implementation. I digress somewhat but I hope the point is clear: the measurement process is a lot closer to 'apples-to-apples' than before. (I may have swung slightly in favour of FI over bnux because the FI well-formed checking is likely to be more efficient than the XOM well-formed checking. I will return to this point later on).
The data chosen to measure was the set of 56 XML documents supplied with the nux distribution. I made sure the FI XOM parser passed tests for round-trip consistency on all the XML documents to ensure that all information was reported correctly and the parser was not cheating and gaining an unfair performance advantage.
To measure performance I created Japex drivers to:
report the size of bnux documents; and.
report the average time to parse documents using the FI XOM and bnux XOM parsers.
(I reused the FI size Japex driver from the Japex XML driver library to report the size of FI documents.)
Then, I created Japex configuration files to describe the measurement process to be performed by the Japex runtime (one for size and one for parsing). Running both configuration files produced some nice pretty charts :-)
Rather than present the individual results in all their the gory detail it is easier in this forum to show the average of all the results and provide a link to the zip file containing detailed results. Ideally I would have liked to show a log-log scale scatter plot of all the results (size/throughput vs parser time) but Japex does not have this functionality (not yet, but i want to add this capability!). So average results it is, with a 'government health warning' that these averages are taken over a wide range of sizes and parse times, so please download the detailed results if you want to further analyze.
The size results are presented in the following chart (the detailed report may be found here).
The average results are presented as a percentage of the XML document size, so the smaller the better. The amount of compression achieved by FI can depend on the degree to which repeating text content and attribute values are replaced with small integer values (commonly referred to as indexing). The FI serializers can control this degree using a simple heuristic based on the length of the character strings for text content and attribute values, if the length is less than 'n' than the character string is is indexed, otherwise it is not. The red bar labeled 'FastInfoset7SizeDriver' presents results for the fast infoset documents with the indexing of character strings less than 7 in length (the default value for the implementation of the Fast Infoset serializers), the green bar labeled 'FastInfoset64SizeDriver' presents results for the indexing of character strings less than 64 in length. It is my understanding that bnux indexes all character strings up front at the head of the document.
When an appropriate degree of indexing is chosen FI is in most cases slightly more efficient than bnux. In three cases bnux is more efficient than FI. My initial guess as to why this is the case is that FI, for the most part, is slightly more efficient at 'packing' information but in some cases the length prefix encoding of character strings for FI is more costly than zero terminating encoding of character strings for bnux.
The parse results are presented in the following chart (the detailed report may be found here).
The average results are presented as a percentage of the bnux parse time, so the smaller the better. These results show that FI is slightly faster than bnux. In three cases bnux is slightly faster. Although the average results do not show a difference between the degrees of indexing (64 and 4096) there is a difference in individual results when correlated to differences in size. These results are encouraging!
As i said earlier FI is likely to have an unfair advantage over bnux because of where the well-formed checking is performed. To ascertain how unfair this might be I hacked together an FI SAX parser that performed no well-formed checking then measured this against the well-formed checking SAX parser and the bnux parser using a XOM null node factory (which performs no creation of nodes and therefore no well-formed checking). The detailed report may be found here.
From these and the previous results it is possible to obtain the approximate cost of a FI XOM parser with XOM well-formed checking by doing the following:
subtract the null node factory parse time from the bnux parse time;
add the value obtained 1 to the hacked FI SAX parser that performs no well-formed checking
compare 2 against the bnux parse time.
The data indicates that such a FI XOM parser would be anywhere between 5 to 20% slower (depending on the document) with an average of 5%. This is also encouraging! Intuitively this is to be expected as bnux is a simpler format than Fast Infoset so i would expect it to be more efficient in this respect. (But note that the differences between the Fast Infoset encoding and the bnux encoding are not that great so, intuitively, i would not expect the differences to be that large.)
Overall the results using the FI XOM parser (with well-formed checking performed by the parser) and the possible results using a theoretical XOM parser (with well-formed checking performed by XOM) indicate that it is worth implementing an FI XOM parser in the same manner as bnux for practical use with XOM and also nux. Development of such a parser, in addition to an efficient serializer, is likely to take longer than the development of the prototype parser. I may need to rethink the design of vocabulary tables and external vocabularies such that implementations can optimize the vocabulary table representation and convert to and from a canonical vocabulary table representation.