HTML as Self-Describing Data

Despite Sun's status as technology powerhouse and innovater of Java, there are a few software concepts that, culturally, we've failed to catch on to. Here's one of them. Over the past five years I've seen numerous attempts to build a system that represents web content structurally. For example, I once helped in converting documents to a custom XML schema for a web-based CMS. During the process I discovered that their XML schema basically duplicated HTML. The intent was apparently to build the one true web document format that avoided HTML's presentational pitfalls.

The problem was this: HTML already is a structural language, but it's been hacked and kludged by web designers for so long that programmers think of it as a presentational language. Rather than simply correcting this faulty notion in their own heads, they made everything more complicated by building a replacement for HTML and then building an XSL system to translate between them.

What they should have realized is that HTML isn't limited to desktop-browser-screen-based views into data; it can be a source of data. In other words, if written succinctly, HTML is a perfectly decent document schema. (Bracing self for a storm of WTFs from DocBook enthusiasts.) Two advantages of this are performance and simplicity. Documents need minimal processing/transformation before being delivered over the web.

Points of Contention

I can see all kinds of "yeah, but" reactions to this, so I'll debunk some of the main ones now.

Point of contention: HTML is SGML, not XML. We need an XML format.

Response: You can use XHTML, which is an exact reformulation of HTML as XML.

Point of contention: We already use XHTML, but it's corrupt as a document format because our pages don't validate.

Response: Whose fault is that? If you produce XHTML, as with any XML, you need to make sure it's well-formed and valid. It really isn't that hard. Tools like Tidy even exist that automatically fix well-formedness and validation errors in HTML documents.

Point of contention: We have no choice but to use HTML presentationally. For web documents to achieve real-world visual design standards, things like layout tables and spacer images are needed. HTML needs to be twisted and misused, therefore destroying its value as a structural schema.

Response: If you believe this, chances are you're not familiar with the advances in CSS over the last five years. CSS takes away the need to write HTML presentationally. As of 12/2005, the vast majority of browsers in use support CSS well enough to achieve real-world visual design standards. Even if CSS were not enough, XHTML documents can be transformed into any format you want via XSL.

Point of contention: HTML isn't rigorous enough and has all sorts of lame elements like <font>.

Response: Most of the lameness was weeded out and/or deprecated in HTML 4. For best results, ban deprecated elements and use one of these doctypes: HTML 4.01 strict, XHTML 1.0 strict, or XHTML1.1.

Point of contention: HTML still isn't semantic. <p>, <h1>, <ul>, etc. have no real meaning, and elements like <br> and <hr> are especially presentational.

Response: First of all this confuses semantic/non-semantic with specific/generic. HTML is an ultra-generic language by design, which is a good thing. It's what makes HTML flexible and useful in so many different contexts. This is hard for a traditional XML programmer to wrap his head around, because he's used to specific elements like <recipe> and <ingredient>. Second, many in the W3C contend that <br> and <hr> were mistakes. Others maintain that they're useful in a structural document, and even have semantic meaning. In the design of XHTML 2 (the next version of XHTML, still in draft) <hr> has been replaced by the more semantic <separator> element. <br> is thought by many to have semantic meaning and is included in the draft, but the <l> element has also been introduced to represent a line of poetry, for example, and thus alleviate the need for <br>. Similar debates also swirl around <b>, <i> and <u>. In any case you're free to ignore elements you don't like.

Point of contention: We can't use HTML because we need proprietary hooks in our document schema, or constructs such as XLink.

Response: That's what namespaces are for. You can import the HTML namespace into a document, our you can import your own proprietary schema into your HTML docs. With XSL, it's easy and efficient to operate on this type of document to produce pure HTML for web delivery. Furthermore, The latest version of XHTML (1.1) is "modularized" so that it's possible to import pertinent modules of HTML into your schema.

Point of contention: But it wouldn't make sense for the canonical version of a document to be the same document we send to browsers, would it? It can't be served without being transformed and massaged on the server side, can it?

Response: Why, because software engineers fear simplicity? Okay, maybe your documents need server-side treatment such as adding company headings, lists of global company links, etc., before they're sent to a customer's web browser. The point is that much of the complexity of dealing with and maintaining HTML can be factored out of existence by simply updating your understanding of the language and pushing the HTML schema further back into your data model.


Some very interesting points, Greg! One way I've tried incorporating semantics in XHTML is using the <div class="someblocktag"> structure for block-level elements and <span class="someinlinetag"> for inlines. It's reasonable, you can parse and query for it, and you can style it with CSS, too! Your approach is sounding suspiciously like microformats! :-)

Posted by Scott Hudson on December 09, 2005 at 04:46 AM MST #

Oh, yeah, since I'm a DocBook enthusiast: WTH! (What The Heck?!?) ;-)

Posted by Scott Hudson on December 09, 2005 at 04:48 AM MST #

Yeah, microformats are pretty interesting too. Structural XHTML is really catching on out in the wild, but unfortunately it seems to be somewhat of a blind spot for the Java crowd. Another cool idea that some people have begun to kick around is to use an XML parser (or even XSL) to unit-test the structure of HTML documents.

Posted by Greg Reimer on December 12, 2005 at 03:37 AM MST #

Post a Comment:
  • HTML Syntax: NOT allowed

My name is Greg Reimer and I'm a web technologist for the Sun.COM web design team.


« June 2016