Search Engineering - the End of N-Grams? (or Not)
By seapegasus on Feb 11, 2007
A start-up company called Powerset has set out for the holy grail -- a natural language search engine! One that does not (like the other one that starts with Goo and ends in gle) "simply" strip off stop words from the search string, and then makes parallel searches of lexemes and synonyms in a database of cleverly sorted n-grams?
Powerset say they will use technology from Xerox PARC. Okay, those Xerox guys are good, but having licenced their stuff doesn't solve the task alone. How and what exactly will they do? Dang, the web page doesn't tell... :-[ Hmmm... "Power set"...? Is that a hint of some sort? ;-)
Google is good because it already lets you search for natural input like "Who is that German who believes that 300 years of the history of the early middle ages are made up??" more or less successfully. Just hit "I feel lucky" and find a page where somebody blogged about this original conspiracy theory. But I had to try five times with different sentences to find this perfect hit. -- Well, still better than getting laughed at by your friends for asking them about phantom time, right? In real natural language search you would not have to try paraphrases yourself.
But don't expect Powerset will implement natural language search requests like "How many non-American and non-Russian Astronauts were deployed to the ISS?" that would get you the list of names and a number. This would require that the search engine not only understands the question perfectly, but also a) has access to a list of ISS astronauts, b) has access to a list of their respective nationalities, c) knows that "non" means I want it to filter all persons of American and Russian nationality from the list (and not people whose names happen to look like a nationality), and then d) interprets "how many" as a request to count the remaining entries, and finally prints the answer. That would "merely" require a mapping from each ever-so crazy text-book question type the user comes up with, to a step-by-step search solution, and the tables to look up data from...Then after 12 years of research a user types in "So how many legs do a farmer, a dog and 17 chicken have together?" and your engines gripes "Gimme a break" and dies of karoshi. \*Sigh\*
Answering math questions is not what Powerset attempts to achieve. (If they do, then they really found the holy grail!) Basically search engines are 'only' a more user-friendly interface for common "a and b and (c or d or e)"-style search requests, they are not supposed to do math homework for you. Unless somebody wrote a web page about the exact same question using similar words, these questions won't work. You will have to search for "list of ISS astronauts and their nationalities" or something and count them yourself -- sorry.
But there is still enough work to be done to make normal fact searching more intuitve, and that is more likely what Powerset are up to. They may try to use implicit context, so you could type in "Kubiak eat now" then the search would fill in your city (i.e. the city via which your provider connects you) as default location and list fast-food chains. Also search results could be clustered and labeled better -- are they news articles (presumably more reliable) or blog entries or reviews (if you are searching for opinions), are they privately hosted or on company domains or on university sites, what media are they, and how old? If the user's browser is set to German but she searches for an English word, would she be interested in German results about the same topic too? Etc. I hope they hired a lot of user interface designers.
Ooh, Powerset even have job openings for computational linguists, man, I haven't seen that for a while. Well, currently I already have a job, thank you, ;-) so I will have to wait until the end of the year to see the first prototype. \*Sigh\* Come on, Powerset, give us a Beta! Some Google apps have been in beta for how long? Since, like, the last millenium? See!
Uh-oh. Speaking of Google. I feel a disturbance in the force... Google just published their corpus of n-grams? Free. Gzipped. On 6 DVDs. Woah. \*Waves at Thorsten Brants from Saarbrücken!\*
Of course this does not mean that google is releasing their n-grams in response to Powerset's announcement, because, dunno, Powerset's new method will be the end of n-gram usage now or something. Obviously, Google did not publish the database column that says on which web page this n-gram was found (data which has to be refreshed regularly anyway)... :-P
If you don't know what an n-gram is: It's just sequences of words like they typically appear in text, sorted by frequency. This involves tokenizing loads and loads of text. Amounts of text which can be found for free on the internet.
For instance, a 3-gram (trigram) is a typical sequence of three words torn out of context ("I am a, I am just, I am here, I am the, I am not" etc). So, if you have collected lots and lots of different trigrams ("not completed but, not count on, not belong to", or "to do it, to go away, to continue his" etc), and have sorted them by frequency, then! \*drumroll\* You can calculate the statistically most probable English sentence! Which would look like (I am making an example up here) "I am not belong to continue his words are the one of the president of stupid like a big piece of all the things are no doubt that he is another step to..." Wahahahaha! :-D ... :-| Okay, maybe applied linguistics jokes are only funny for computational linguists.
Anyway, the real use case is, having alphabetically sorted n-grams annotated with the web page they came from, speeds up the search process significantly for a search engine provider (because jumping to a position in the alphabet can be done faster than doing full-text search over and over again).
Another example of useful things you can do with Google's n-grams even without the page URL they came from, is training speech recognition systems: If the system didn't get whether you said "Oh painter knew him tivo'ed phile ending sword mine amen sedate" or "Open a new empty word file and insert my name and the date", a quick n-gram frequency comparison tells it: The second interpretation is a bit more likely to occur in the English language. Now, aren't you glad n-grams were invented and you can get them for free? :-)