### Calculating document similarity

#### By searchguy on May 18, 2005

The first thing that you need to understand is term weighting schemes.
Each term that the engine knows about can be assigned a weight for each
document that it appears in. The idea is that the weight assigned to a
term indicates how important that term is for that particular document.
For example, one of the most popular term weighting schemes assigns high
weights to terms that are *frequent* in a given document but *
infrequent* in the rest of the collection. The idea is that such
terms are good discriminators for a document.

Tim mentioned something like this approach in passing in his On Search series. There are a lot of term weighting functions out there. Moffat and Zobel have an interesting paper where they describe a framework into which all of the various functions can be placed, if you're interested in the gory details.

Once we've decided on our term weighting function, we need to talk about
the Vector Space Model of Information Retrieval. Given a collection of
documents, we'll call *N* the number of unique terms in the
collection. Even for small collections, *N* can get pretty big. For
example, a small collection of about 16,000 documents that I use has
about 27,000 unique terms! In the Vector Space Model we represent each
document as a vector in an N-dimensional space. The components of the
vector are the term weights assigned to that document by the term
weighting function for each of the N unique terms. Clearly, most of
these are going to be 0, since only a few of the N terms actually appear
in any given document.

There are a number of similarity measurements that we can use on this vector space representation of the documents (Moffat and Zobel discuss a number of these.) For example, one could simply calculate the Euclidean distance between the points. This might not be the best approach, especially since longer documents will tend to be artifically far away from shorter documents.

One of the more popular similarty measures is to calculate the angle between the vectors for two documents. In most of the formulations of this measure, we first normalize the document vectors to unit length. This allows us to approximate the cosine of the angle between the two vectors by taking the vectors' dot product and it avoids the longer-vs-shorter documents problem.

Posted by

Amjidanutpan Ramanujamon May 18, 2005 at 09:43 AM EDT #Posted by

Nick Lothianon May 18, 2005 at 12:44 PM EDT #Posted by

cori schlegelon May 19, 2005 at 02:28 AM EDT #