Semantics and Search
By searchguy on Feb 13, 2005
Search is a hard problem. One of the main reasons that it is so difficult is that the semantics of human languages are particularly difficult to deal with. Tim has mentioned this aspect of search before, but I wanted to take some time and a few posts to talk about the issues a bit more.
There are a lot of reasons that semantics complicate the search problem, but two of the main ones are synonymy and polysemy.
Synonymy occurs when we have multiple words that have the "same" meaning. For example, if we index a document that contains the word lunar and the query uses the word moon, then in most search engines, that query will never retrieve that document.
A typical response to synonymy in a search engine is to introduce a synonym thesaurus. When the system encounters a query term that has synonyms, all of the synonyms are tossed into the query. This seems like a pretty good idea, but it actually doesn't work very well. While tossing in the synonyms does increase the number of relevant documents that are retrieved, they tend to be washed out by all the extra irrelevant documents that are retrieved!
A more subtle problem with a traditional synonym thesaurus is that there are few true synonyms — words that mean exactly the same thing — in English. For the most part there is a relationship of generality between so-called synonyms. For example, one could imagine that the words dog and hound would appear as synonyms in a thesaurus, but a hound is a particular kind of dog. I'll discuss what you could do about this problem in later posts.
Polysemy, on the other hand, is when a single word has multiple senses. Let's say that you got a one word query bank. Did the user mean the financial institution? The side of a river? The way that they slope roads so your car doesn't fly off of them in the turns? There's really no way to know. Most of the time, we're saved by the fact that queries tend to be more than one word long, thus giving the ambiguous word context. Figuring out what sense is meant in a given context is called "word sense disambiguation" in Computational Linguistics.
J.R. Firth said "You shall know a word by the company it keeps", and this is how search engines usually handle the problem of polysemy. While there are a lot of words with more than one sense, there are very few pairs of words that are co-ambiguous. A query like "savings bank" disambiguates pretty well for the financial sense of bank. Of course, this presumes that the words are close enough together in the document, which is not necessarily the case (yes, this is another plug for passage retrieval.)
Some systems have put quite a bit of effort into disambiguating words both during indexing and during querying. The problem is that if you don't do a fantastically good job of this (almost as good as a human would do), then an incorrect disambiguation of a document term or a query term means that you will miss documents.