Tags, keywords, and inconsistency

James Governor pointed out a use for tags as a kind of rejoinder to Tim Bray's wondering whether tags were useful. I'm not sure that his example actually proves that tags are better, but I've been thinking about this, and I can't really decide who's right or who's wrong.

Here's an interesting fact upon which I'll base the rest of my argument: people are horribly inconsistent when assigning keywords to documents. If you give two people the same document and ask them to assign a set of keywords to describe it, then the sets of keywords that they assign will agree only about 20% of the time. This was one of the problems that lead to the development of full text indexing systems. If we couldn't choose a few keywords from a document, we would use every word in the document as a keyword!

This so-called inter-indexer inconsistency is kinda-sorta the Halting Problem for Information Retrieval. If you can convince yourself that a problem you're looking at is really the keyword assignment problem, then you can pretty safely say that people will be inconsistent when doing whatever it is you're studying.

For example, people are inconsistent when assigning hypertext links within and between documents. Peter Willett did a study for the British Library that showed that, given a large document, people will tend to assign hypertext links between different paragraphs. During my Ph.D., I showed a similar result for links between documents. There's a good summary in an ACM Computing Surveys article from 1999.

So, what does this mean? Let's say we have a system where people manually assign keywords to documents (as far as I can tell, this is what tagging is, but I'm happy to be corrected) and let's also say that people can run queries against this index of keywords. You can think of such a query as an attempt by the searcher to assign keywords to a document that he or she would like to get in response to the query. The problem is that the person who originally assigned a tag to the document and the searcher who "assigned a tag" to the document are going to be inconsistent, so the searcher won't pick quite the same tag.

So, that's why James Governor is wrong and Tim Bray is right about tagging: it's not really a new way of indexing documents, it's actually an old way that didn't work very well.

The only method that has been shown to improve the consistency of keyword assignment is to assign keywords from a fixed vocabulary (see, for example, MeSH, the Medical Subject Headings). Maintaining such a vocabulary is a non-trivial task. Obviously, the Technorati tagging system is not "controlled" in this sense (or possibly in any sense), but I'm wondering whether its web-scale nature can provide some benefit that one would not expect.

Here's what I mean: if hypertext link assignment is inconsistent, how come Google's PageRank does such a good job of finding relevant pages? The answer, at least in part, is that there are a lot of pages and a lot of links out there so that some agreement can be reached (at least on things that lots of people care about!) If the Technorati tags can be organized and available in such a way that taggers are facing a recognition problem (as opposed to a recall problem), I'm wondering whether we couldn't get some of the benefits of a controlled vocabulary, at least for the popular tags.

So, that's why James Governor is right and Tim Bray is wrong about tagging: it may be more like assigning keywords from a controlled vocabulary.

Ultimately, I think Tim's caution is warranted, not that I think my opinion will keep people from tagging. This whole issue needs to be the subject of some actual retrieval evaluations.


Thanks for the link. you might take a look at my first answer to tim too. http://www.redmonk.com/jgovernor/archives/000605.html One thing to stress - i believe in synthesis of method. so being right and wrong is ok i guess. I really Ken Norton's take http://tagsonomy.com/index.php/ken-norton-humans-at-both-ends-of-the-rope/ I am not trying to remove humans from semantic threads. we're really good at sloppy. a combination of tagsonomy and formal classification and flat search has value, i think. the example on the post you picked out was exactly that, an example. i couldnt find a picture of tim using the flickr search engine which was down, and google wasn't much help either. on the other hand using timbray as a tag in technorati got me to exactly where i wanted to be in seconds. it was an example, a use case, rather than a formal argument one way or another. Cheers!

Posted by James Governor on May 13, 2005 at 04:16 AM EDT #

I actually started to think about writing this when I read your first comment about Tim's concerns, but then real life (a.k.a. the Lab's Open House) took over.

Ken makes a very good point (probably better than I did :-). Actually, Einat Amitay, a Ph.D. student from my Macquarie University days did an interesting thesis around extracting descriptions of Web pages from the links pointing to them. If you're interested it's here. Part of what she did was up on labs.google.com for a while, but it appears to be gone now.

I also think that autoclassification could help people do a better job of assigning tags consistently (i.e., using the content of the document as an indicator for tagging) while not stepping on their toes, creativity-wise. But it's possible that's just because I do document autoclassification research!

Posted by Stephen Green on May 13, 2005 at 04:46 AM EDT #

As my link shows, I dashed off a comment about the del.icio.us tagging system the other day which raised a similar question (among others). It doesn't take long to figure out that folks categorize identical items very differently. I still don't know, though, whether that's a strength or a weakness in the system. It could, as you say, depend on how many links are involved. It might depend, too, on exactly what someone's trying to accomplish, as the chaos makes it more likely that any particular searcher will find something of interest while simultaneously making it more difficult to assemble a complete list of articles/pages/whatever on a specific topic. Interesting question, regardless.

Posted by Joel Dinda on May 13, 2005 at 06:00 AM EDT #

Joel, I think your blog posting was very apropos for the whole subject: people can be inconsistent even with themselves!

I had a look at technorati's tags when I was writing the entry and I was surprised how many morphological varaitions there were (e.g., blog and blogs are treated separately.) If we really want to build folksonomies, it seems like we could at least try to control for this kind of obvious variation...

Posted by Stephen Green on May 13, 2005 at 06:59 AM EDT #

Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016