Wednesday Dec 14, 2005

On the Importance of URL Canonicity

When designing a linking schema between their pages, web developers often fail to take into account the importance of URL canonicity. For example, when implementing a tabbed navigation design, it's tempting to switch the tab based on a querystring such as page.jsp?tab=1. Similarly, a CMS might use a URL like showArticle.jsp?article=123456. Also, most webservers accept two versions of a directory index URL, for example /foo/ and /foo/index.jsp. Finally, It's common for a website to exist at both and

All of these URLs create canonical issues, or ambiguity, regarding where exactly a piece of content lives. You might ask why there should be ambiguity when a given URL always gets you the same piece of content? In other words, if both /foo/ and /foo/index.jsp return the same content, what's the big deal? Well for starters there's the extra performance hit on your webserver because the browser cached one version and not the other. In fact, any caching system is impacted by non-canonical URLs.

Another problem with non-canonical URLs is in web analytics. Two versions of a URL might really be the same page, but a web analytics package doesn't know this unless you hard code a special rule about it somewhere. The fact is that, canonical or not, the world treats URLs as canonical, and so broken behavior results when they are not.

But what about querystring driven URL schemes? Isn't showArticle.jsp?article=123456 always the same? Strictly speaking, yes, so why should querystrings be bad for canonicity? When there are multiple variables inside a querystring, order generally doesn't matter. foo.jsp?cid=123&uid=456 is the same as foo.jsp?uid=456&cid=123, so in this sense canonicity is broken. But even if you take pains to ensure querystring values are always ordered consistently, the world at large doesn't know this. Querystring driven URLs are treated as ambiguous. For example, Google isn't as quick to index a querystring driven URL as it would be to index a URL that looks canonical.

Fortunately, options exist to make URLs more canonical. Most webservers have URL rewriting capability that can redirect to a given URL from the corresponding URL, or vice versa. And most webservers can be similarly configured with regard to the /foo/ vs. /foo/index.html issue.

For web applications where pages are dynamically assembled, it's possible to use path info instead of querystrings. Consider the following URLs:

Here, "articles" serves the same role as "showArticle.jsp." Correspondingly, "?article=123456" and "/123456.jsp" serve as the pointer to the content piece. The implementation of this is beyond the scope of this post, suffice it to say that the capability is built into most web application environments.

Finally, here are a couple of related links:

Wednesday Jun 23, 2004


As an XHTML/CSS advocate amongst Java/XML nuts, sometimes the Web Standards sermon falls by the wayside. One of the reasons for this, perhaps, is that the typical Java/XML nut lives in a heady world of dreams.

When I refer to a Java/XML nut (JXnut, from now on), I mean this: There are web developers who find Java and XML useful, and there are JXnuts. JXnuts are web developers too, in a manner of speaking, but JXnuts tend to say things like "The web is dead" and "So long for the web browser". JXnuts loathe the web and want it to go away. They want it to be replaced by something grander.

I say this because of history. During the late nineties JXnuts saw the convoluted, pulsating mass that the web had become, and they reeled. Being technical purists they sought for something cleaner and more expressive. JXnuts ascended into the well-structured world of Java and XML and, embracing it, never looked back.

If they had looked back, they would have seen that great sprawling mass shudder from end to end and begin to writhe like a vast, salted slug. XHTML and CSS reform had begun to sweep through the rank and file of web development.

I say all of this (in slightly exaggerated terms, perhaps) to hilight a subtle technical rift that exists among web application architects. At one extreme you have those who, in their own minds, deprecated HTML and its cohorts in disgust long ago and turned their focus to server-side solutions or alternative web architectures. At the other extreme you have those who, for various reasons, never got the memo that HTML and browsers were out of style, and went on to embrace XHTML and CSS as a powerful component of modern day web applications.

Personally, I find this disturbing. I'm one of the latter, of course. The rift creates a lack of synergy between two forces that otherwise would form a powerful alliance. I've seen great development efforts afflicted by old school client side coding techniques because HTML and browsers are, annoyingly, still the defacto standard for web architecture and, inexplicably, never vanished from the face of the earth. I've also seen a trend where those who espouse the virtues of XHTML and CSS get relegated to the status of gibbering pratt. They still get to "do their thing" if it doesn't sufficiently annoy any lead architects with circa-1997 HTML sensibilities, but their methodology is slow to seep into the strata that form the bedrock of modern day web architecture, because that bedrock is largely built and controlled by JXnuts (or their counterparts in .NET/PHP/whatever land).

If I could make any suggestions to help rectify the situation, I'd tell the JXnut to swallow his/her pride and try to understand the value of abstracting logical and graphical presentation using nothing but XHTML and CSS, and how this can result in a better, more modular web application. I'd also tell the XHTML/CSS nut (the XCnut, myself being a prime candidate) to try to understand the bigger picture of web architecture, and embrace some of the possibilities that fall outside the comfortable world of HTML and browsers.


My name is Greg Reimer and I'm a web technologist for the Sun.COM web design team.


« April 2014