Calculating document similarity
By searchguy on May 18, 2005
The first thing that you need to understand is term weighting schemes. Each term that the engine knows about can be assigned a weight for each document that it appears in. The idea is that the weight assigned to a term indicates how important that term is for that particular document. For example, one of the most popular term weighting schemes assigns high weights to terms that are frequent in a given document but infrequent in the rest of the collection. The idea is that such terms are good discriminators for a document.
Tim mentioned something like this approach in passing in his On Search series. There are a lot of term weighting functions out there. Moffat and Zobel have an interesting paper where they describe a framework into which all of the various functions can be placed, if you're interested in the gory details.
Once we've decided on our term weighting function, we need to talk about the Vector Space Model of Information Retrieval. Given a collection of documents, we'll call N the number of unique terms in the collection. Even for small collections, N can get pretty big. For example, a small collection of about 16,000 documents that I use has about 27,000 unique terms! In the Vector Space Model we represent each document as a vector in an N-dimensional space. The components of the vector are the term weights assigned to that document by the term weighting function for each of the N unique terms. Clearly, most of these are going to be 0, since only a few of the N terms actually appear in any given document.
There are a number of similarity measurements that we can use on this vector space representation of the documents (Moffat and Zobel discuss a number of these.) For example, one could simply calculate the Euclidean distance between the points. This might not be the best approach, especially since longer documents will tend to be artifically far away from shorter documents.
One of the more popular similarty measures is to calculate the angle between the vectors for two documents. In most of the formulations of this measure, we first normalize the document vectors to unit length. This allows us to approximate the cosine of the angle between the two vectors by taking the vectors' dot product and it avoids the longer-vs-shorter documents problem.