Thursday Dec 08, 2005

HTML as Self-Describing Data

Despite Sun's status as technology powerhouse and innovater of Java, there are a few software concepts that, culturally, we've failed to catch on to. Here's one of them. Over the past five years I've seen numerous attempts to build a system that represents web content structurally. For example, I once helped in converting documents to a custom XML schema for a web-based CMS. During the process I discovered that their XML schema basically duplicated HTML. The intent was apparently to build the one true web document format that avoided HTML's presentational pitfalls.

The problem was this: HTML already is a structural language, but it's been hacked and kludged by web designers for so long that programmers think of it as a presentational language. Rather than simply correcting this faulty notion in their own heads, they made everything more complicated by building a replacement for HTML and then building an XSL system to translate between them.

What they should have realized is that HTML isn't limited to desktop-browser-screen-based views into data; it can be a source of data. In other words, if written succinctly, HTML is a perfectly decent document schema. (Bracing self for a storm of WTFs from DocBook enthusiasts.) Two advantages of this are performance and simplicity. Documents need minimal processing/transformation before being delivered over the web.

Points of Contention

I can see all kinds of "yeah, but" reactions to this, so I'll debunk some of the main ones now.

Point of contention: HTML is SGML, not XML. We need an XML format.

Response: You can use XHTML, which is an exact reformulation of HTML as XML.

Point of contention: We already use XHTML, but it's corrupt as a document format because our pages don't validate.

Response: Whose fault is that? If you produce XHTML, as with any XML, you need to make sure it's well-formed and valid. It really isn't that hard. Tools like Tidy even exist that automatically fix well-formedness and validation errors in HTML documents.

Point of contention: We have no choice but to use HTML presentationally. For web documents to achieve real-world visual design standards, things like layout tables and spacer images are needed. HTML needs to be twisted and misused, therefore destroying its value as a structural schema.

Response: If you believe this, chances are you're not familiar with the advances in CSS over the last five years. CSS takes away the need to write HTML presentationally. As of 12/2005, the vast majority of browsers in use support CSS well enough to achieve real-world visual design standards. Even if CSS were not enough, XHTML documents can be transformed into any format you want via XSL.

Point of contention: HTML isn't rigorous enough and has all sorts of lame elements like <font>.

Response: Most of the lameness was weeded out and/or deprecated in HTML 4. For best results, ban deprecated elements and use one of these doctypes: HTML 4.01 strict, XHTML 1.0 strict, or XHTML1.1.

Point of contention: HTML still isn't semantic. <p>, <h1>, <ul>, etc. have no real meaning, and elements like <br> and <hr> are especially presentational.

Response: First of all this confuses semantic/non-semantic with specific/generic. HTML is an ultra-generic language by design, which is a good thing. It's what makes HTML flexible and useful in so many different contexts. This is hard for a traditional XML programmer to wrap his head around, because he's used to specific elements like <recipe> and <ingredient>. Second, many in the W3C contend that <br> and <hr> were mistakes. Others maintain that they're useful in a structural document, and even have semantic meaning. In the design of XHTML 2 (the next version of XHTML, still in draft) <hr> has been replaced by the more semantic <separator> element. <br> is thought by many to have semantic meaning and is included in the draft, but the <l> element has also been introduced to represent a line of poetry, for example, and thus alleviate the need for <br>. Similar debates also swirl around <b>, <i> and <u>. In any case you're free to ignore elements you don't like.

Point of contention: We can't use HTML because we need proprietary hooks in our document schema, or constructs such as XLink.

Response: That's what namespaces are for. You can import the HTML namespace into a document, our you can import your own proprietary schema into your HTML docs. With XSL, it's easy and efficient to operate on this type of document to produce pure HTML for web delivery. Furthermore, The latest version of XHTML (1.1) is "modularized" so that it's possible to import pertinent modules of HTML into your schema.

Point of contention: But it wouldn't make sense for the canonical version of a document to be the same document we send to browsers, would it? It can't be served without being transformed and massaged on the server side, can it?

Response: Why, because software engineers fear simplicity? Okay, maybe your documents need server-side treatment such as adding company headings, lists of global company links, etc., before they're sent to a customer's web browser. The point is that much of the complexity of dealing with and maintaining HTML can be factored out of existence by simply updating your understanding of the language and pushing the HTML schema further back into your data model.

Wednesday Jul 14, 2004

Collapsing the Multiverse

In the web application I work on, the data in question exists in one of about four phases at any given time, depending on how you draw the distinctions. Most of the work I do is in trying to herd the data through these phases, expose it to some interface for consumption and/or manipulation, then herd it back across again. Yah!

Phase is actually the wrong word. Universe is better. The data has to shift in and out of different universes, each defining its own view of reality. RDBs and OOP come readily to mind. Both are built around basic concerns necessitated by fundamental paradigms in modern computer architecture; the data needs to, 1) be stored on a disk (RDB) and, 2) pulled into RAM and run through the CPU (OOP).

Since both these universes have their own internal data model empire (object model, schema, whatever), you have adapters like JDBC to convert between them. Dancing around the adapter, I believe, is where much of the pull-your-hair-out complexity comes from in writing software. The adapter's functionality is simple: acquire cnxn, execute stmt, release cnxn. The adapter's strengths and weaknesses are easy to understand: connections are expensive, network calls introduce latency, certain types of statements hang the DB server. Surrounding it all you have the ever tightening noose of increasing drain on the computer's resouces as the application gets used.

My observation is this: coding to this basic set of conditions while navigating the paradigmatic rift between the universes quickly results in Enormous Complexity, similar to how a cellular automata with simple rules and preconditions propagates into a complex set of arrangements (e.g. chess or the Game of Life). Enormous Complexity necessitates clever exception handling, patterns and antipatterns, performance tuning, persistence frameworks or backing as much logic into the DB layer as possible, but while these techniques help, they're essentially artifacts built around complexity and they don't inform the task at hand.

Therefore anything that eliminates the need for these adapters in the first place is a good thing. Case in point: XForms. Besides RDBs and OOP there are the universes of XML and web forms as name/value pairs. XForms replaced the name/value universe with basic XML, which means that, if I used XForms, I could cut out reams of tedious parameter mapping code and let the duty fall to my already-existing XML functions. XForms collapsed the multiverse.

The question on my mind is, will the RDB/OOP multiverse ever collapse? Is it possible to store data in a form that can be exposed directly as a DOM-like construct, without any under-the-sheets translation between disparate RDB and OOP systems? If so, perhaps the database can be thought of as a virtualized instance of a Big Linked List whose links are more akin to URIs than object references, in that they point to resources within an abstract space that doesn't know about RAM or disk memory. A Container would map these links to real data behind the virtual layer, where this Big Linked List (BLL) would be backed by disk data, wile different parts of it are instantiated in physical memory at any given moment according to some intelligent algorithm managed by the Container. (Imagine something remotely akin to swap memory.) Mutations against the BLL's nodes would be backed directly by changes to disk data, or queued in a transaction space. In front of the virtual layer, the BLL would be treated as fully instantiated, so that you could expose pieces of it with XPath- or XQuery-like statements and do work on them, and various nodes throughout the BLL could listen to (or be observed by) other nodes, reacting to conditions and events in useful ways.

As an example I'm thinking of a CMS. The either/or distinction between an XML document tree and a collection of relational data (either of which could be considered "content") is one that gets everybody in my org sufficiently jumbled as to cause me major grief; more casualties of the multiverse problem. In my theoretical system it's not either/or, it's both/and. The BLL is directly analogous to a DOM tree that can be serialized as XML or transformed, but its individual nodes—being resources unto themselves—can also be shared, cross-linked and made relational via an RDF-like meta-framework. CMS content is thus stored in a super-normal state that is inherently both document-centric and relational, therefore collapsing the multiverse and avoiding Enormous Complexity and subsequent artifacts.

Now I must disclaim that I'm a speculative nut with these kind of things. Maybe it's a pipe dream, or maybe it's been done. Maybe it's a dumb idea for reasons I haven't thought of. It just seems that if such a framework could be built, data-driven applications would be an order of magnitude or two easier to write and maintain, and web based applications would almost fall out of it, especially with XForms in the mix. I've heard interesting things about DOM databases that seem to hold some promise, but I must say I don't know a heck of a lot about them. I'll keep researching. Meanwhile I'd welcome any comments, corrections, hints or pointers about this stuff if anybody feels so inclined.

Well, I suppose I'd better get back to navigating rifts in the multiverse and dealing with Enormous Complexity and said artifacts, meanwhile attempting to mitigate the confusion created by my app's plurality of data representation. Thanks for reading my screed.


My name is Greg Reimer and I'm a web technologist for the Sun.COM web design team.


« July 2016