### Calculating document similarity

A few folks have been asking me how document similarity is calculated in search engines. The answer's actually pretty simple, but it involves a bit of background.

The first thing that you need to understand is term weighting schemes. Each term that the engine knows about can be assigned a weight for each document that it appears in. The idea is that the weight assigned to a term indicates how important that term is for that particular document. For example, one of the most popular term weighting schemes assigns high weights to terms that are frequent in a given document but infrequent in the rest of the collection. The idea is that such terms are good discriminators for a document.

Tim mentioned something like this approach in passing in his On Search series. There are a lot of term weighting functions out there. Moffat and Zobel have an interesting paper where they describe a framework into which all of the various functions can be placed, if you're interested in the gory details.

Once we've decided on our term weighting function, we need to talk about the Vector Space Model of Information Retrieval. Given a collection of documents, we'll call N the number of unique terms in the collection. Even for small collections, N can get pretty big. For example, a small collection of about 16,000 documents that I use has about 27,000 unique terms! In the Vector Space Model we represent each document as a vector in an N-dimensional space. The components of the vector are the term weights assigned to that document by the term weighting function for each of the N unique terms. Clearly, most of these are going to be 0, since only a few of the N terms actually appear in any given document.

There are a number of similarity measurements that we can use on this vector space representation of the documents (Moffat and Zobel discuss a number of these.) For example, one could simply calculate the Euclidean distance between the points. This might not be the best approach, especially since longer documents will tend to be artifically far away from shorter documents.

One of the more popular similarty measures is to calculate the angle between the vectors for two documents. In most of the formulations of this measure, we first normalize the document vectors to unit length. This allows us to approximate the cosine of the angle between the two vectors by taking the vectors' dot product and it avoids the longer-vs-shorter documents problem.

I came across a different algorithm when I was reading a paper on Low Bandwidth File Systems (LBFS). It was checksum based, which divided a large file(or a document) into a lot of smaller dynamic width sections and calculating checksum for each block and using it to compare the similarity. It was very good and hard to understand (I am no data mining guy, I am a systems guy). You might be interested in the paper.

Posted by Amjidanutpan Ramanujam on May 18, 2005 at 09:43 AM EDT #

If readers are interested in this, I have a (very basic and hence hopefully readable) vector-based similarity tester thing in Classifier4J. The code for it can be found at http://classifier4j.sourceforge.net/subprojects/core/xref/net/sf/classifier4J/vector/VectorClassifier.html#42 and is pretty easy to follow (I hope!).

Posted by Nick Lothian on May 18, 2005 at 12:44 PM EDT #

good information and links for Enterprize IT guys who don't have a lot of background in search. Really got me thinking, although intranet search isn't a high enough priority at my company that I'll be able to do any work on it. Thanks! cori

Posted by cori schlegel on May 19, 2005 at 02:28 AM EDT #

Comments are closed for this entry.