Semantics and Search

Search is a hard problem. One of the main reasons that it is so difficult is that the semantics of human languages are particularly difficult to deal with. Tim has mentioned this aspect of search before, but I wanted to take some time and a few posts to talk about the issues a bit more.

There are a lot of reasons that semantics complicate the search problem, but two of the main ones are synonymy and polysemy.

Synonymy occurs when we have multiple words that have the "same" meaning. For example, if we index a document that contains the word lunar and the query uses the word moon, then in most search engines, that query will never retrieve that document.

A typical response to synonymy in a search engine is to introduce a synonym thesaurus. When the system encounters a query term that has synonyms, all of the synonyms are tossed into the query. This seems like a pretty good idea, but it actually doesn't work very well. While tossing in the synonyms does increase the number of relevant documents that are retrieved, they tend to be washed out by all the extra irrelevant documents that are retrieved!

A more subtle problem with a traditional synonym thesaurus is that there are few true synonyms — words that mean exactly the same thing — in English. For the most part there is a relationship of generality between so-called synonyms. For example, one could imagine that the words dog and hound would appear as synonyms in a thesaurus, but a hound is a particular kind of dog. I'll discuss what you could do about this problem in later posts.

Polysemy, on the other hand, is when a single word has multiple senses. Let's say that you got a one word query bank. Did the user mean the financial institution? The side of a river? The way that they slope roads so your car doesn't fly off of them in the turns? There's really no way to know. Most of the time, we're saved by the fact that queries tend to be more than one word long, thus giving the ambiguous word context. Figuring out what sense is meant in a given context is called "word sense disambiguation" in Computational Linguistics.

J.R. Firth said "You shall know a word by the company it keeps", and this is how search engines usually handle the problem of polysemy. While there are a lot of words with more than one sense, there are very few pairs of words that are co-ambiguous. A query like "savings bank" disambiguates pretty well for the financial sense of bank. Of course, this presumes that the words are close enough together in the document, which is not necessarily the case (yes, this is another plug for passage retrieval.)

Some systems have put quite a bit of effort into disambiguating words both during indexing and during querying. The problem is that if you don't do a fantastically good job of this (almost as good as a human would do), then an incorrect disambiguation of a document term or a query term means that you will miss documents.

Comments:

Hi, Steve, I have to say I find the subject of search really fascinating. I'm enamoured of the elegance of the way in which the Universe has evolved with connections between all manner of things that, intuitively, aren't obviously linked; and computational linguistics, describing a mathematical space built on the relationships between words and their grammatical expression, and search applications thereof, are perfect examples of this. I haven't read anything specifically, other than the few bits that Tim Bray has hinted at, and now, your "place". I have found it difficult to decide just where to apply this enchantment, when I eventually complete my degree (there's a story in that in the telling of which I won't indulge). The types of things computational linguists and search engineers seem really interesting, though, especially given the proximity of and potential extrapolation to artificial intelligence. Maybe that's a stretch, but I'd think not. So, this is my rambling way of telling you that I enjoy reading your thoughts on the subject. Maybe, if I'm subtle yet obsequious enough, before long I'll have ingratiated myself upon you such that, upon completion of my undergrad work, you'll think kindly of my application to your or a related department.

Posted by Daniel LeVangie-Stricklen on February 14, 2005 at 05:37 AM EST #

Hi, Steve, I have to say I find the subject of search really fascinating. I'm enamoured of the elegance of the way in which the Universe has evolved with connections between all manner of things that, intuitively, aren't obviously linked; and computational linguistics, describing a mathematical space built on the relationships between words and their grammatical expression, and search applications thereof, are perfect examples of this. I haven't read anything specifically, other than the few bits that Tim Bray has hinted at, and now, your "place". I have found it difficult to decide just where to apply this enchantment, when I eventually complete my degree (there's a story in that in the telling of which I won't indulge). The types of things computational linguists and search engineers seem really interesting, though, especially given the proximity of and potential extrapolation to artificial intelligence. Maybe that's a stretch, but I'd think not. So, this is my rambling way of telling you that I enjoy reading your thoughts on the subject. Maybe, if I'm subtle yet obsequious enough, before long I'll have ingratiated myself upon you such that, upon completion of my undergrad work, you'll think kindly of my application to your or a related department.

Posted by Daniel LeVangie-Stricklen on February 14, 2005 at 05:50 AM EST #

Um...that was the result of some weird net traffic or server issues at sun.com. I'm actually not trying to spam.

Posted by Daniel LeVangie-Stricklen on February 14, 2005 at 05:53 AM EST #

I've been reading about your Passage Search project. It sounds like a good start to solving the problems with boolean and phrase search. I'll be interested to see where your project goes; keep up the updates.

Posted by Bob on February 23, 2005 at 05:37 PM EST #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today