Thursday Apr 03, 2008

A SAX Parser Based on JavaScript's String.replace() Method?

I've often wished browsers would offer native SAX implementations. SAX is lightweight and fast. Not only that, SAX is easy because it lets you ignore what's not interesting, unlike DOM, where you have to traverse the whole mess and keep it hanging around in memory. SAX also uses callback functions, which any JavaScript programmer should feel comfortable with.

[Read More]

Monday Feb 06, 2006

AJAX: What's the best tool for the job?

Choices, choices, choices. Is it better to pull XML from the server and manipulate it via DOM/XSL? Is it better to use JSON to convert data directly to JavaScript variables? Is it better to pull text in as tag soup, spackling it into place via innerHTML? Or is it better to use CSV and string.split()? There are no hard and fast rules, so I'll just convey some of my thoughts and experiences.

XML: An issue here is choosing what XML language to use. The XML language that sprang readily to my mind was, of course, XHTML. There are some pretty good reasons to use XHTML for something like this, plus rolling your own XML language might not be such a good idea. But mainly it provides options. With minimal fuss I can use DOM to drop elements from the X(HT)ML right onto the page, or I can be agnostic about things and treat it like XML; reading out data from it and transforming it as necessary. I just make sure my XHTML sticks to reasonable structure conventions. XHTML also has built-in metadata capability, a well-understood syntax, and, as of recently, microformats. All of this is gravy for me, the programmer. With this technique, there are some things to remember: first, an XHTML "mini-document" needs to be well-formed or the parser will reject it; second, the mini-document needs to use the XHTML namespace as the default namespace or the client may not treat it as XHTML.

JSON: Render unto Caesar that which is Caesar's. Thus, in my mind, XHTML is the best data format, and JSON is the best way to import dynamic behavior. Traditional JavaScript files can get huge, and given the direction things are headed, this problem will only get worse. Why not have a minimal JavaScript file, and from there load only the functionality you need, as needed? What this entails is, instead of using JSON notation like this: var foo = {"bar":"baz"}, use notation like this: var foo = {"bar":function(){/\* do stuff \*/}}. It also might make sense to implement some design patterns on top of this, for example the command pattern or some kind of inversion of control pattern. The possibilities are endless.

TAG SOUP: I'm going to resist busting on this method too much because there's always a situation where a given method is the best tool for the job. I'll just say this: importing text as tag soup and then shoveling strings into innerHTML with JavaScript is not my preferred way to do AJAX. I guess I just tend to like either pure XML or no XML at all, (but don't hold me to that!).

CSV: Sometimes the sledgehammer approach is the best approach. CSV has advantages: the syntax is compact, simple, easy to understand, easy to consume, easy to generate. For transmitting mass quantities of simple, tabular data, CSV is hard to beat.

Thursday Dec 08, 2005

HTML as Self-Describing Data

Despite Sun's status as technology powerhouse and innovater of Java, there are a few software concepts that, culturally, we've failed to catch on to. Here's one of them. Over the past five years I've seen numerous attempts to build a system that represents web content structurally. For example, I once helped in converting documents to a custom XML schema for a web-based CMS. During the process I discovered that their XML schema basically duplicated HTML. The intent was apparently to build the one true web document format that avoided HTML's presentational pitfalls.

The problem was this: HTML already is a structural language, but it's been hacked and kludged by web designers for so long that programmers think of it as a presentational language. Rather than simply correcting this faulty notion in their own heads, they made everything more complicated by building a replacement for HTML and then building an XSL system to translate between them.

What they should have realized is that HTML isn't limited to desktop-browser-screen-based views into data; it can be a source of data. In other words, if written succinctly, HTML is a perfectly decent document schema. (Bracing self for a storm of WTFs from DocBook enthusiasts.) Two advantages of this are performance and simplicity. Documents need minimal processing/transformation before being delivered over the web.

Points of Contention

I can see all kinds of "yeah, but" reactions to this, so I'll debunk some of the main ones now.

Point of contention: HTML is SGML, not XML. We need an XML format.

Response: You can use XHTML, which is an exact reformulation of HTML as XML.

Point of contention: We already use XHTML, but it's corrupt as a document format because our pages don't validate.

Response: Whose fault is that? If you produce XHTML, as with any XML, you need to make sure it's well-formed and valid. It really isn't that hard. Tools like Tidy even exist that automatically fix well-formedness and validation errors in HTML documents.

Point of contention: We have no choice but to use HTML presentationally. For web documents to achieve real-world visual design standards, things like layout tables and spacer images are needed. HTML needs to be twisted and misused, therefore destroying its value as a structural schema.

Response: If you believe this, chances are you're not familiar with the advances in CSS over the last five years. CSS takes away the need to write HTML presentationally. As of 12/2005, the vast majority of browsers in use support CSS well enough to achieve real-world visual design standards. Even if CSS were not enough, XHTML documents can be transformed into any format you want via XSL.

Point of contention: HTML isn't rigorous enough and has all sorts of lame elements like <font>.

Response: Most of the lameness was weeded out and/or deprecated in HTML 4. For best results, ban deprecated elements and use one of these doctypes: HTML 4.01 strict, XHTML 1.0 strict, or XHTML1.1.

Point of contention: HTML still isn't semantic. <p>, <h1>, <ul>, etc. have no real meaning, and elements like <br> and <hr> are especially presentational.

Response: First of all this confuses semantic/non-semantic with specific/generic. HTML is an ultra-generic language by design, which is a good thing. It's what makes HTML flexible and useful in so many different contexts. This is hard for a traditional XML programmer to wrap his head around, because he's used to specific elements like <recipe> and <ingredient>. Second, many in the W3C contend that <br> and <hr> were mistakes. Others maintain that they're useful in a structural document, and even have semantic meaning. In the design of XHTML 2 (the next version of XHTML, still in draft) <hr> has been replaced by the more semantic <separator> element. <br> is thought by many to have semantic meaning and is included in the draft, but the <l> element has also been introduced to represent a line of poetry, for example, and thus alleviate the need for <br>. Similar debates also swirl around <b>, <i> and <u>. In any case you're free to ignore elements you don't like.

Point of contention: We can't use HTML because we need proprietary hooks in our document schema, or constructs such as XLink.

Response: That's what namespaces are for. You can import the HTML namespace into a document, our you can import your own proprietary schema into your HTML docs. With XSL, it's easy and efficient to operate on this type of document to produce pure HTML for web delivery. Furthermore, The latest version of XHTML (1.1) is "modularized" so that it's possible to import pertinent modules of HTML into your schema.

Point of contention: But it wouldn't make sense for the canonical version of a document to be the same document we send to browsers, would it? It can't be served without being transformed and massaged on the server side, can it?

Response: Why, because software engineers fear simplicity? Okay, maybe your documents need server-side treatment such as adding company headings, lists of global company links, etc., before they're sent to a customer's web browser. The point is that much of the complexity of dealing with and maintaining HTML can be factored out of existence by simply updating your understanding of the language and pushing the HTML schema further back into your data model.

Wednesday Aug 11, 2004

XForms built into Mozilla

Not a reality yet, but a link to an announcement is pasted below. Since Sun is standardizing on Mozilla soon, maybe internal web applications here can eventually begin taking advantage of this new technology. I also hope Mozilla's XForms support will be more forthcoming than its SVG support; i.e. you won't have to download obscure builds and/or it won't involve licensing wierdness with libraries.

Mozilla's Announcement (
XForms Specification (

Wednesday Jul 14, 2004

Collapsing the Multiverse

In the web application I work on, the data in question exists in one of about four phases at any given time, depending on how you draw the distinctions. Most of the work I do is in trying to herd the data through these phases, expose it to some interface for consumption and/or manipulation, then herd it back across again. Yah!

Phase is actually the wrong word. Universe is better. The data has to shift in and out of different universes, each defining its own view of reality. RDBs and OOP come readily to mind. Both are built around basic concerns necessitated by fundamental paradigms in modern computer architecture; the data needs to, 1) be stored on a disk (RDB) and, 2) pulled into RAM and run through the CPU (OOP).

Since both these universes have their own internal data model empire (object model, schema, whatever), you have adapters like JDBC to convert between them. Dancing around the adapter, I believe, is where much of the pull-your-hair-out complexity comes from in writing software. The adapter's functionality is simple: acquire cnxn, execute stmt, release cnxn. The adapter's strengths and weaknesses are easy to understand: connections are expensive, network calls introduce latency, certain types of statements hang the DB server. Surrounding it all you have the ever tightening noose of increasing drain on the computer's resouces as the application gets used.

My observation is this: coding to this basic set of conditions while navigating the paradigmatic rift between the universes quickly results in Enormous Complexity, similar to how a cellular automata with simple rules and preconditions propagates into a complex set of arrangements (e.g. chess or the Game of Life). Enormous Complexity necessitates clever exception handling, patterns and antipatterns, performance tuning, persistence frameworks or backing as much logic into the DB layer as possible, but while these techniques help, they're essentially artifacts built around complexity and they don't inform the task at hand.

Therefore anything that eliminates the need for these adapters in the first place is a good thing. Case in point: XForms. Besides RDBs and OOP there are the universes of XML and web forms as name/value pairs. XForms replaced the name/value universe with basic XML, which means that, if I used XForms, I could cut out reams of tedious parameter mapping code and let the duty fall to my already-existing XML functions. XForms collapsed the multiverse.

The question on my mind is, will the RDB/OOP multiverse ever collapse? Is it possible to store data in a form that can be exposed directly as a DOM-like construct, without any under-the-sheets translation between disparate RDB and OOP systems? If so, perhaps the database can be thought of as a virtualized instance of a Big Linked List whose links are more akin to URIs than object references, in that they point to resources within an abstract space that doesn't know about RAM or disk memory. A Container would map these links to real data behind the virtual layer, where this Big Linked List (BLL) would be backed by disk data, wile different parts of it are instantiated in physical memory at any given moment according to some intelligent algorithm managed by the Container. (Imagine something remotely akin to swap memory.) Mutations against the BLL's nodes would be backed directly by changes to disk data, or queued in a transaction space. In front of the virtual layer, the BLL would be treated as fully instantiated, so that you could expose pieces of it with XPath- or XQuery-like statements and do work on them, and various nodes throughout the BLL could listen to (or be observed by) other nodes, reacting to conditions and events in useful ways.

As an example I'm thinking of a CMS. The either/or distinction between an XML document tree and a collection of relational data (either of which could be considered "content") is one that gets everybody in my org sufficiently jumbled as to cause me major grief; more casualties of the multiverse problem. In my theoretical system it's not either/or, it's both/and. The BLL is directly analogous to a DOM tree that can be serialized as XML or transformed, but its individual nodes—being resources unto themselves—can also be shared, cross-linked and made relational via an RDF-like meta-framework. CMS content is thus stored in a super-normal state that is inherently both document-centric and relational, therefore collapsing the multiverse and avoiding Enormous Complexity and subsequent artifacts.

Now I must disclaim that I'm a speculative nut with these kind of things. Maybe it's a pipe dream, or maybe it's been done. Maybe it's a dumb idea for reasons I haven't thought of. It just seems that if such a framework could be built, data-driven applications would be an order of magnitude or two easier to write and maintain, and web based applications would almost fall out of it, especially with XForms in the mix. I've heard interesting things about DOM databases that seem to hold some promise, but I must say I don't know a heck of a lot about them. I'll keep researching. Meanwhile I'd welcome any comments, corrections, hints or pointers about this stuff if anybody feels so inclined.

Well, I suppose I'd better get back to navigating rifts in the multiverse and dealing with Enormous Complexity and said artifacts, meanwhile attempting to mitigate the confusion created by my app's plurality of data representation. Thanks for reading my screed.

Wednesday Jun 23, 2004


As an XHTML/CSS advocate amongst Java/XML nuts, sometimes the Web Standards sermon falls by the wayside. One of the reasons for this, perhaps, is that the typical Java/XML nut lives in a heady world of dreams.

When I refer to a Java/XML nut (JXnut, from now on), I mean this: There are web developers who find Java and XML useful, and there are JXnuts. JXnuts are web developers too, in a manner of speaking, but JXnuts tend to say things like "The web is dead" and "So long for the web browser". JXnuts loathe the web and want it to go away. They want it to be replaced by something grander.

I say this because of history. During the late nineties JXnuts saw the convoluted, pulsating mass that the web had become, and they reeled. Being technical purists they sought for something cleaner and more expressive. JXnuts ascended into the well-structured world of Java and XML and, embracing it, never looked back.

If they had looked back, they would have seen that great sprawling mass shudder from end to end and begin to writhe like a vast, salted slug. XHTML and CSS reform had begun to sweep through the rank and file of web development.

I say all of this (in slightly exaggerated terms, perhaps) to hilight a subtle technical rift that exists among web application architects. At one extreme you have those who, in their own minds, deprecated HTML and its cohorts in disgust long ago and turned their focus to server-side solutions or alternative web architectures. At the other extreme you have those who, for various reasons, never got the memo that HTML and browsers were out of style, and went on to embrace XHTML and CSS as a powerful component of modern day web applications.

Personally, I find this disturbing. I'm one of the latter, of course. The rift creates a lack of synergy between two forces that otherwise would form a powerful alliance. I've seen great development efforts afflicted by old school client side coding techniques because HTML and browsers are, annoyingly, still the defacto standard for web architecture and, inexplicably, never vanished from the face of the earth. I've also seen a trend where those who espouse the virtues of XHTML and CSS get relegated to the status of gibbering pratt. They still get to "do their thing" if it doesn't sufficiently annoy any lead architects with circa-1997 HTML sensibilities, but their methodology is slow to seep into the strata that form the bedrock of modern day web architecture, because that bedrock is largely built and controlled by JXnuts (or their counterparts in .NET/PHP/whatever land).

If I could make any suggestions to help rectify the situation, I'd tell the JXnut to swallow his/her pride and try to understand the value of abstracting logical and graphical presentation using nothing but XHTML and CSS, and how this can result in a better, more modular web application. I'd also tell the XHTML/CSS nut (the XCnut, myself being a prime candidate) to try to understand the bigger picture of web architecture, and embrace some of the possibilities that fall outside the comfortable world of HTML and browsers.


My name is Greg Reimer and I'm a web technologist for the Sun.COM web design team.


« July 2016