On the Importance of URL Canonicity
By greimer on Dec 14, 2005
When designing a linking schema between their pages, web developers often fail to take into account the importance of URL canonicity. For example, when implementing a tabbed navigation design, it's tempting to switch the tab based on a querystring such as page.jsp?tab=1. Similarly, a CMS might use a URL like showArticle.jsp?article=123456. Also, most webservers accept two versions of a directory index URL, for example /foo/ and /foo/index.jsp. Finally, It's common for a website to exist at both example.com and www.example.com.
All of these URLs create canonical issues, or ambiguity, regarding where exactly a piece of content lives. You might ask why there should be ambiguity when a given URL always gets you the same piece of content? In other words, if both /foo/ and /foo/index.jsp return the same content, what's the big deal? Well for starters there's the extra performance hit on your webserver because the browser cached one version and not the other. In fact, any caching system is impacted by non-canonical URLs.
Another problem with non-canonical URLs is in web analytics. Two versions of a URL might really be the same page, but a web analytics package doesn't know this unless you hard code a special rule about it somewhere. The fact is that, canonical or not, the world treats URLs as canonical, and so broken behavior results when they are not.
But what about querystring driven URL schemes? Isn't showArticle.jsp?article=123456 always the same? Strictly speaking, yes, so why should querystrings be bad for canonicity? When there are multiple variables inside a querystring, order generally doesn't matter. foo.jsp?cid=123&uid=456 is the same as foo.jsp?uid=456&cid=123, so in this sense canonicity is broken. But even if you take pains to ensure querystring values are always ordered consistently, the world at large doesn't know this. Querystring driven URLs are treated as ambiguous. For example, Google isn't as quick to index a querystring driven URL as it would be to index a URL that looks canonical.
Fortunately, options exist to make URLs more canonical. Most webservers have URL rewriting capability that can redirect to a given example.com URL from the corresponding www.example.com URL, or vice versa. And most webservers can be similarly configured with regard to the /foo/ vs. /foo/index.html issue.
For web applications where pages are dynamically assembled, it's possible to use path info instead of querystrings. Consider the following URLs:
Here, "articles" serves the same role as "showArticle.jsp." Correspondingly, "?article=123456" and "/123456.jsp" serve as the pointer to the content piece. The implementation of this is beyond the scope of this post, suffice it to say that the capability is built into most web application environments.
Finally, here are a couple of related links: