Realistic generation and use of external vocabularies
By sandoz on Apr 12, 2006
In Fast Infoset terminology an external vocabulary is something a fast infoset document can reference (via a URI). The (referenced) external vocabulary contains tables mapping information to indexes. Thus fast infoset documents referencing the external vocabulary do not need to encode the information that is already present in the external vocabulary and can encode the indexes instead. To benefit from an external vocabulary encoders and decoders must share the external vocabulary out-of-band (so it can be considered external knowledge, a bit like a schema is external knowledge).
Such use can reduced the size of fast infoset documents and consequently increase the encoding and decoding performance because less data is produced and consumed.
External vocabularies are suited for the case where the unique markup of an XML document represents a high proportion of the overall XML document size, which can be the case for UBL and FpML infosets.
Size results generated when using an external vocabulary for UBL and FpML infosets showed promise. However, the generation of the external vocabulary was from the document itself, which is entirely unrealistic in practice.
To be realistic external vocabularies should be generated from a schema and a set of representative sample documents. The schema primes the external vocabulary with information and the set of samples optimize the vocabulary so information is assigned indexes in proportion to frequency of occurence. I have created just such functionality as part of a newly create sub project of the Fast Infoset project at java.net, FastInfosetUtilities. The SchemaProcessor class generates information from schema, the FrequencyHandler class orders that information according to a set of sample documents, and the VocabularyGenerator generates external vocabularies, from the ordered information, to be used for encoding and decoding.
So now we are in a position to compare size results using realistic and unrealistic external vocabularies.
The FpML data has quite a few samples, 90 in all. So using these samples with the complete FpML schema should present a realistic use-case. I created a few Japex config files and some drivers to operate on the Japex config parameters to compare various configurations for measuring the size of fast infoset documents, then ran Japex and it produced the following chart for the means of % of bytes relative to the size of the XML document:
The red "XML" bar represents the size of XML documents. I configured Japex to measure relative to the "XML" bar which is why it is always at 100%.
The blue "FastInfoset" bar represents the size of fast infoset documents when using default settings for the Fast Infoset encoder.
The green "FastInfoset_UseSchema" bar represents the size of fast infoset documents when using an external vocabualry generated from the FpML schema but without using sample documents.
The yellow "FastInfoset_UseSchema_UseSamples" bar represents the size of fast infoset documents when using an external vocabualry generated from the FpML schema and using sample documents
The orange "FastInfoset_UseTestCaseDocument" bar represents the size of fast infoset documents when using an external vocabulary generated from the (test case) document itself.
When using no external vocabulary the fast infoset documents are about 50% of the XML documents (when comparing the arithmetic means). When using a schema to generate an external vocabulary fast infoset documents are about 33% of the XML documents. When using a schema with samples to generate an external vocabulary fast infoset documents are about 29% of the XML documents, which is slightly smaller than the size of the fast infoset documents using an external vocabulary generated from the document itself.
Initially i was surprised at the last observation but after a little reflection it makes sense since when using a set of samples the most frequent information is given the smaller indexes, which is not the case when using the document itself (the most frequently used information could occur towards the end of the document).
What is nice about this result is that although previous size results have been presented using an unrealistic technique the results using the realistic technique are actually slightly better.