How to use an external vocabulary


In a previous blog i explained how external vocabularies can be generated from schema and a set of sample documents. In this blog i will explain how you can using the Fast Infoset implementation with an external vocabulary.

All the referenced classes are found in either FastInfoset.jar or FastInfosetUtilities.jar. Currently the FastInfosetUtilities sub-project is not packaged up into the distribution so you will have to get this from CVS and build it yourself for now. I will sort out the packaging very soon for those that want access via the distribution.

First a schema needs to be processed to obtain all relevant information in the schema ordered lexicographically (namely the element and attribute declarations):
  String args[] = ... // e.g. args from main
SchemaProcessor sp = new SchemaProcessor(
new File(args[0]).toURL(), true);
sp.process();

Next the information from the schema processor is used to initialize a frequency handler. The frequency handler processes a list of sample documents and orders the information in the sample documents and schema according to the frequency of occurence of such information (it is not necessary that the sample documents are valid according to the schema but obviously the sample documents should have a close association with the schema):

  SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
SAXParser p = spf.newSAXParser();
FrequencyHandler fh = new FrequencyHandler(sp);
for (int i = 1; i < args.length; i++) {
p.parse(new File(args[i]), fh);
}

Once all information has been generated and sorted the serializer and parser vocabularies can be generated:

  VocabularyGenerator vg = new VocabularyGenerator(
      fh.getLists(),
      VocabularyGenerator.XmlApi.SAX);       
  SerializerVocabulary externalSerializerVocabulary =
      vg.getSerializerVocabulary();
  ParserVocabulary externalParserVocabulary =
      vg.getParserVocabulary();
  String externalVocabularyURI = args[0];

It is important to specify the API for which the SerializerVocabulary and ParserVocabulary will be generated for by the VocabularyGenerator as there are slightly different representations and optimizations for each API. This is something i want to improve by having a canonical representation of a vocabulary whereby it is up to the parser/seriazlier implementation to convert the canonical representation to the appropriate internal and optimal representation.

The SerializerVocabulary instance, externalSerializerVocabulary, can then be set on an instance of a SAXDocumentSerializer:

  SerializerVocabulary initialVocabulary = 
      new SerializerVocabulary();
  initialVocabulary.setExternalVocabulary(
externalVocabularyURI,
      externalSerializerVocabulary, false);       
  SAXDocumentSerializer saxSerializer =
new SAXDocumentSerializer();
  saxSerializer.setVocabulary(initialVocabulary);

An initial vocabulary needs to be created that contains the external vocabulary and the URI that is to be used for the external vocabulary. The instance, ParserVocabularyexternalParserVocabulary, can then be set on an instance of a SAXDocumentParser:

  SAXDocumentParser saxParser = new SAXDocumentParser();
  Map externalVocabularyMap = new HashMap();
  externalVocabularyMap.put(externalVocabularyURI,
      externalParserVocabulary);
  saxParser.setExternalVocabularies(externalVocabularyMap);

A Map is used to associate an external vocabulary with the URI since the same parser instance could parser two documents that have different external vocabulary URIs.

The saxSerializer and saxParser instances are now correctly initialized so that they can interoperate.

As you can see it is important that they both agree on the external vocabulary. There is currently no general standard to specify an external vocabulary generated from schema and sample documents so it is necessary to define the nature of the external dependence on a per case basis, preferably in a known tightly coupled scenario or by standards that explicitly declare the external vocabulary by other formal means.

It is also possible to create your own serializer and parser vocabularies but i would recommend using the SchemaProcessor, FrequencyHandler and VocabularyGenerator if possible as this will minimize dependecies due to changes to the vocabulary API.
Comments:

Distribution and source snapshots with FastInfosetUtilities included are available now from here: https://fi.dev.java.net/servlets/ProjectDocumentList Paul.

Posted by Paul Sandoz on April 18, 2006 at 08:46 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

sandoz

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today