Automatically building semantic taxonomies

One of the capabilities that our research search engine has (that is to say, this capability doesn't ship with Sun's products) is the ability to build a semantic taxonomy from all of the words in the indexed material. The taxonomy is built on a relationship of generality, where more general terms are said to subsume more specific terms. For example, the word canine is more general that the word dog.

We currently build such a taxonomy using two kinds of information. The first is a set of lexical axioms. These are simply facts that you need to know about the world. For example, the fact that a dog is a kind of canine is something that you simply need to know about the world. You can't deduce this relationship by simply looking at the words themselves.

Our engine has a lexicon with about 250K words in it. About 80K of these words have this kind of semantic information associated with them. The words in the lexicon are not specific to the computer industry, rather they try to cover a good subset of general English. For example, the lexicon knows that a disk is a flat circular object, but it doesn't know that it's also an information storage device.

The second kind of information that we use is the morphology of the words. We say that any given term is subsumed by its morphological root. For example, the word allocation is subsumed by its root allocate. The system contains an extensive morphology engine that can do prefixes and suffixes (in combination, or separately) as well as lexical compounds (words like shoelace or bitmap).

The morphology can use information from the lexicon to help with deriving the root of a word. For example, we may only want to remove a suffix from a word if the root that we are left with is a noun. Even without the lexicon, though, the system can do a fairly amazing job of deriving the root forms of words. Here's an example from one of the text bases that we index:

This graphic shows a part of the taxonomy around the term linux . The numbers in parentheses after the terms indicate the number of documents in which each term occurs. The first thing to note about this is that the term linux isn't even in our dictionary. The system simply assumes that it's a noun, which turns out to be a good assumption when you encounter a word that you don't know.

The word that's interesting here is linuxization. This is a nominalization of the verb linuxize. Note that the morphology has correctly computed this, but also note that the term linuxize doesn't actually occur anywhere in the text we've indexed. Linuxize is the verbalization of the noun linux, to which it is linked. You can extend this to whatever craziness you like and the morphology can cope. The state before linuxization? Prelinuxization. The state before you remove any traces of linuxization? Predelinuxization . As you can see, this is an incredibly useful capability to have, especially in fields like computer science where we make new words out of old words all the time.

The engine can use the taxonomy at query time to automatically add all of the terms subsumed by a given term that appears in the query. So, if a user enters a query with linux, we can add linuxization to the query without the user ever having to know that such a word exists (or even that it can exist.) This can go a long way towards solving the problem that you need to know what words to ask to get a particular document.

In addition, the ability to browse the automatically constructed taxonomy gives users a way to look at the terminology that's actually used in the index of documents, so that they can do a much better job of choosing query terms.


Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« June 2016