Thursday Dec 08, 2005

HTML as Self-Describing Data

Despite Sun's status as technology powerhouse and innovater of Java, there are a few software concepts that, culturally, we've failed to catch on to. Here's one of them. Over the past five years I've seen numerous attempts to build a system that represents web content structurally. For example, I once helped in converting documents to a custom XML schema for a web-based CMS. During the process I discovered that their XML schema basically duplicated HTML. The intent was apparently to build the one true web document format that avoided HTML's presentational pitfalls.

The problem was this: HTML already is a structural language, but it's been hacked and kludged by web designers for so long that programmers think of it as a presentational language. Rather than simply correcting this faulty notion in their own heads, they made everything more complicated by building a replacement for HTML and then building an XSL system to translate between them.

What they should have realized is that HTML isn't limited to desktop-browser-screen-based views into data; it can be a source of data. In other words, if written succinctly, HTML is a perfectly decent document schema. (Bracing self for a storm of WTFs from DocBook enthusiasts.) Two advantages of this are performance and simplicity. Documents need minimal processing/transformation before being delivered over the web.

Points of Contention

I can see all kinds of "yeah, but" reactions to this, so I'll debunk some of the main ones now.

Point of contention: HTML is SGML, not XML. We need an XML format.

Response: You can use XHTML, which is an exact reformulation of HTML as XML.

Point of contention: We already use XHTML, but it's corrupt as a document format because our pages don't validate.

Response: Whose fault is that? If you produce XHTML, as with any XML, you need to make sure it's well-formed and valid. It really isn't that hard. Tools like Tidy even exist that automatically fix well-formedness and validation errors in HTML documents.

Point of contention: We have no choice but to use HTML presentationally. For web documents to achieve real-world visual design standards, things like layout tables and spacer images are needed. HTML needs to be twisted and misused, therefore destroying its value as a structural schema.

Response: If you believe this, chances are you're not familiar with the advances in CSS over the last five years. CSS takes away the need to write HTML presentationally. As of 12/2005, the vast majority of browsers in use support CSS well enough to achieve real-world visual design standards. Even if CSS were not enough, XHTML documents can be transformed into any format you want via XSL.

Point of contention: HTML isn't rigorous enough and has all sorts of lame elements like <font>.

Response: Most of the lameness was weeded out and/or deprecated in HTML 4. For best results, ban deprecated elements and use one of these doctypes: HTML 4.01 strict, XHTML 1.0 strict, or XHTML1.1.

Point of contention: HTML still isn't semantic. <p>, <h1>, <ul>, etc. have no real meaning, and elements like <br> and <hr> are especially presentational.

Response: First of all this confuses semantic/non-semantic with specific/generic. HTML is an ultra-generic language by design, which is a good thing. It's what makes HTML flexible and useful in so many different contexts. This is hard for a traditional XML programmer to wrap his head around, because he's used to specific elements like <recipe> and <ingredient>. Second, many in the W3C contend that <br> and <hr> were mistakes. Others maintain that they're useful in a structural document, and even have semantic meaning. In the design of XHTML 2 (the next version of XHTML, still in draft) <hr> has been replaced by the more semantic <separator> element. <br> is thought by many to have semantic meaning and is included in the draft, but the <l> element has also been introduced to represent a line of poetry, for example, and thus alleviate the need for <br>. Similar debates also swirl around <b>, <i> and <u>. In any case you're free to ignore elements you don't like.

Point of contention: We can't use HTML because we need proprietary hooks in our document schema, or constructs such as XLink.

Response: That's what namespaces are for. You can import the HTML namespace into a document, our you can import your own proprietary schema into your HTML docs. With XSL, it's easy and efficient to operate on this type of document to produce pure HTML for web delivery. Furthermore, The latest version of XHTML (1.1) is "modularized" so that it's possible to import pertinent modules of HTML into your schema.

Point of contention: But it wouldn't make sense for the canonical version of a document to be the same document we send to browsers, would it? It can't be served without being transformed and massaged on the server side, can it?

Response: Why, because software engineers fear simplicity? Okay, maybe your documents need server-side treatment such as adding company headings, lists of global company links, etc., before they're sent to a customer's web browser. The point is that much of the complexity of dealing with and maintaining HTML can be factored out of existence by simply updating your understanding of the language and pushing the HTML schema further back into your data model.

Wednesday Jun 23, 2004

JXnuts

As an XHTML/CSS advocate amongst Java/XML nuts, sometimes the Web Standards sermon falls by the wayside. One of the reasons for this, perhaps, is that the typical Java/XML nut lives in a heady world of dreams.

When I refer to a Java/XML nut (JXnut, from now on), I mean this: There are web developers who find Java and XML useful, and there are JXnuts. JXnuts are web developers too, in a manner of speaking, but JXnuts tend to say things like "The web is dead" and "So long for the web browser". JXnuts loathe the web and want it to go away. They want it to be replaced by something grander.

I say this because of history. During the late nineties JXnuts saw the convoluted, pulsating mass that the web had become, and they reeled. Being technical purists they sought for something cleaner and more expressive. JXnuts ascended into the well-structured world of Java and XML and, embracing it, never looked back.

If they had looked back, they would have seen that great sprawling mass shudder from end to end and begin to writhe like a vast, salted slug. XHTML and CSS reform had begun to sweep through the rank and file of web development.

I say all of this (in slightly exaggerated terms, perhaps) to hilight a subtle technical rift that exists among web application architects. At one extreme you have those who, in their own minds, deprecated HTML and its cohorts in disgust long ago and turned their focus to server-side solutions or alternative web architectures. At the other extreme you have those who, for various reasons, never got the memo that HTML and browsers were out of style, and went on to embrace XHTML and CSS as a powerful component of modern day web applications.

Personally, I find this disturbing. I'm one of the latter, of course. The rift creates a lack of synergy between two forces that otherwise would form a powerful alliance. I've seen great development efforts afflicted by old school client side coding techniques because HTML and browsers are, annoyingly, still the defacto standard for web architecture and, inexplicably, never vanished from the face of the earth. I've also seen a trend where those who espouse the virtues of XHTML and CSS get relegated to the status of gibbering pratt. They still get to "do their thing" if it doesn't sufficiently annoy any lead architects with circa-1997 HTML sensibilities, but their methodology is slow to seep into the strata that form the bedrock of modern day web architecture, because that bedrock is largely built and controlled by JXnuts (or their counterparts in .NET/PHP/whatever land).

If I could make any suggestions to help rectify the situation, I'd tell the JXnut to swallow his/her pride and try to understand the value of abstracting logical and graphical presentation using nothing but XHTML and CSS, and how this can result in a better, more modular web application. I'd also tell the XHTML/CSS nut (the XCnut, myself being a prime candidate) to try to understand the bigger picture of web architecture, and embrace some of the possibilities that fall outside the comfortable world of HTML and browsers.

About

My name is Greg Reimer and I'm a web technologist for the Sun.COM web design team.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today