« Write against the interfaces | Main | Extracting the title of an HTML document in Java »

XPath: Expressive xml random access

I spoke a bit about how specific / vendor APIs sometimes provide useful features that you don't necessarily find in the more generalized APIs. One case that I ran into recently with XML processing was a need to write some code to pull out a couple of key pieces of information from an XML file in a way that's reasonably flexible to future format changes in the file.

We have a utility called genrsp that takes a JDeveloper project file and spits out a response file to pass into the compiler. The idea is that our project files are the single point of truth in terms of compilation. If we need to add something to the classpath of some project, we just make the change in JDeveloper using the library support, and we can then be sure that everything will compile OK whether it's inside the IDE, on the command line or in the build system.

The version of genrsp we've been using for most of the 10.1.3 development cycle was reusing JDeveloper's routines for processing and interpreting the content of project files. This posed something of a bootstrap issue: before we could even start compiling the IDE using its checked in project files, we had to first compile a considerable chunk of the IDE just to create the genrsp utility. This was working up to a point, but could sometimes lead to confusion. In particular, genrsp (and by extension, all the parts of the IDE code it used) could not benefit from the "single point of truth" model where the project file is king. If we introduced a dependency from the IDE to some other library, we had to update both the project file and the boostrap compiler arguments that were used to build genrsp.

So one of the tasks I took on recently was to rewrite genrsp to resolve this bootstrapping issue. The aim was to make it independent of the IDE codebase, and to have as few other dependencies as possible (eventually, it ended up depending only on the XML parser and Ant for its DirectoryScanner API). Of course, there's a complication with this. If I write a whole bunch of code to process project files and they change in the future, we have to update two implementations: the *real* implementation in the IDE and the genrsp implementation. Ideally, I wanted to make as few assumptions about the project file format as possible.

The classic DOM or Sax approaches to parsing are fairly heavyweight. You have to write a fairly large amount of code to get to the part of the document you're interested in. And if the structure of that document ever changes, you're going to have to rewrite a significant portion of it. What I'd really like is to treat the XML document somewhat like a path on a filesystem; I can grab chunks of data out of the XML file using the path and construct paths that are relative to each other. If stuff moves around in the document, I need to change the paths, but the overall processing code stays more or less the same.

XML Path Language (XPath) is exactly that. It's a W3C standard way of expressing the path to some node or set of nodes in an XML document. It's used pretty extensively by another W3C standard, XSL Transformations (XSLT) that provides a powerful way to transform an XML document into another XML document.

Oracle's XML parser implementation has a really elegant and powerful API called selectNode for extracting nodes from an XML document using an XPath. Deepak Vohra wrote an OTN article covering this in some more detail. You can write fairly expressive queries like this:


NodeList nodes = document.selectNodes(
"/project/contentset[@type='java']/sourcepath" );

This would return all sourcepath child elements of cotentset elements in the XML document with the value "java" for the type attribute under the project root document element. So it would select the following nodes from this document:

<project>
  <contentset type="random">
    
  </contentset>
  <contentset type="java">
    <sourcepath>src/<sourcepath>
  </contentset>
  <contentset type="java">
    <classpath>classes/<classpath>
    <sourcepath>test-src/<sourcepath>
  </contentset>

</project>

Of course, the downside is that you break all the good stuff I spoke about in the previous blog about trying to use the most abstract API possible. This is something that JAXP (at least in Java 1.4.2) doesn't support yet. However, in this case it seems like a worthwhile tradeoff. If the XML file I'm parsing ever changes, I just need to update a few strings to point to the new path.

TrackBack

TrackBack URL for this entry:
http://blogs.oracle.com/mte1521/mt-tb.cgi/1854

About This Entry

This page contains a single entry from the blog posted on October 4, 2005 7:35 PM.

The previous post in this blog was Write against the interfaces.

The next post in this blog is Extracting the title of an HTML document in Java.

Many more can be found on the main index page or by looking through the archives.

Top Tags

Powered by
Movable Type and Oracle