Tuesday Jan 02, 2007

Remember, it's not Achoo!, it's Achelbow!

From his very clean office, Paul links to achelbow.com, the site that I set up with my niece on Boxing Day (I'm Canadian, so I celebrate Boxing day.)

When we first arrived, my sister (who works for Health Canada (Dang, I do love those bilingual domain names, they've been making those silly things from the start of CA\*net in the late 80s!)) and my niece pointed out to my son and I that it's no longer considered correct to sneeze into one's hand, as that's a great way to pass on germs. Apparently, you're supposed to sneeze into the arm of your shirt (how that works out if you're wearing a t-shirt is left as an exercise for the reader!)

I asked my niece what it was called when you did that, and after some discussion, we came up with "achelbow". As a search guy, I appreciated that achelbow didn't occur in Google's index (and it still doesn't --- I should put up a script that runs the query every few seconds and see how long it takes it to make it into the index...)

I said that we should put up an achelbow web site, but my sister didn't think I could do it. A quick trip to GoDaddy and a couple of hours of waiting got us the achelbow.com domain and a hosting plan.

We got some free AdWords money with the hosting plan and I must say that I was mightily tempted to buy the keyword "achelbow" so that when you run a search in Google all you'll see is achelbow.com and an ad for achelbow.com :-)

So, everyone remeber: It's not Achoo!, it's Achelbow!

Update: we're now showing up in Google's blog search, but still no sign in the main index...

Friday Dec 15, 2006

Sun Search in the news

David Berlind wrote a nice post (Proof that the search for “great search” isn’t over just yet) in Berlind's Testbed about a recent visit he made to Sun Labs here in Burlington where we talked about our approach to search in general and gave him a demo of The Blurbalyzer, our content based recommender for Amazon.

I'm pretty sure the wife doesn't read my blog, so I'm hoping I won't be in any hot water over my opinions of her search-fu (but I am a better guesser!)

A lot of the stuff that the Blurbalyzer uses to do its work is not in the engine that's currently shipping in Sun's products. It's part of the "next generation" engine that we've been working on for a while in the labs. As Jim Mitchell is fond of saying, "Tech transfer is a contact sport," so we're hoping that we won't face too many hurdles getting the new stuff into the product.

Thursday Dec 14, 2006

IBM OmniFind Yahoo! Edition

This is pretty cool. On Nick Carr's blog I see that IBM's offering (in partnership with Yahoo!) a free version of OmniFind that's based on Lucene an open source Java search engine.

The engine is good for up to 500K docs, and you can buy service from IBM if you're so inclined. This appears to be a bit of a nose-thumb at Google whose Google Mini will run you $9K to handle a collection of that size. 500K docs seems like a lot, but that's a pretty small collection these days.

The screencast of the search engine shows some useful features of the engine like customizing the search pages and setting up "sponsored links." The system allows you to add synonyms for terms, which seems useful, but there doesn't appear to be a way in the UI to turn off synonym expansion. This is a problem, since a simple synonym list like the one that they offer tends to "fuzz out" results in a system.

They don't mention it in the screencast, but they provide "tunable relevancy controls" that provides "Web link analysis" that looks like it's doing a PageRank-y/hubs-and-authorities-y kind of thing. I'd be interested to see how much that actually helps on such a small collection.

Thursday Aug 10, 2006

Rexa.info: A search engine for research literature

I've been meaning to blog about this one for a while. Rexa.info is a new search engine for research literature. In a sense, it's like Citeseer or Google Scholar: You can search for particular terms and Rexa keeps track of citation references.

The interesting thing about Rexa is that it understands objects other than papers and relationships other than citation (i.e., the fact that one paper cited another.) Thanks to some pretty sophisticated machine learning algorithms, Rexa can also understands people and grants.

Along with the recognition of these kinds of objects, the system also does deduplication and canonicalization, so that, for example, the system recognizes that the same author has written mulitple papers. Here's a screenshot of me searching for myself:

(The astute among you will note that I'm the first hit for Stephen Green. Score!) Here the system has figured out that I'm the Stephen Green that writes about computational lingusitics and information retrieval, and not the Stephen Green that writes about molecular biology. I'm now a first-class entity in Rexa.

Here's the display associated with a single paper:

You can see how well the information extraction works: it's done a great job of extracting the authors as well as the references (recall that all of this is being done from PDF files!)

One of the interesting things here is that the system is tracking some top-level topics and each paper presents the percentage of membership in a subset of these topics. Here's a page for the "information retrieval" topic:

Clicking through the topics leads you to a page for the topic indicating the contribution to the topic from words and phrases found in the documents, as well as pointers to citing, cited and co-occurring topics. The top papers display is nice, although I'm surprised that Brin and Page's Google paper from the 7th WWW conference isn't in that list. Oh, wait I get it, it's not an "information retrieval" paper.

Rexa also has a nice bit of the Web 2.0 nature: you can assign tags and notes to papers and then search by tag. So you can easily find papers on conditional random fields. I'm pleased that they decided that tags should be allowed to have spaces in them. It means that you have to type tags separately, but at least you don't get the weirdness that you get when tags are space separated.

Anyways, go on and get an account and start searching!

Thursday Jun 01, 2006

Open House 2006: The Blurbalyzer

It is time again for the Sun Labs Open House. This is the couple of days a year where we folk in the Labs get a chance to show off the technologies that we've been working on to a larger audience. The Open House used to be for Sun people only, but in the past few years we've been holding a public day for press and analysts as well.

If you're a Sun employee, be sure to stop by the Open House today. We'll be in MPK 16 in the big demo room, next to Paul Lamere (look for the big crowd and the cool tunes.)

This year we're demoing the Blurbalyzer, which analyzes the reviews of books from Amazon to help you find new books that are similar to books that you like. By using the review data (including the customer reviews) we end up with a set of recommendations that are very different from those provided by Amazon.

Any one who's done demos knows how things work: you're at your home base, everything works fine, you move it out to the demo site and it stops working. Here's what I look like when the App Server won't start for some reason:

Just in case you're wondering, the answer is to remove all of the half-completed installations and do a fresh install. Stay tuned for more pictures, if Paul gets enough time during the day to take a few.

Thursday Apr 27, 2006

Open Text Mining Interface

One of the librarians here at Sun pointed me at a very interesting blog entry from the Nature blog on web technology and science. The idea is that, along with the usual formats, Nature could publish an XML file that contains a "machine readable" representation of the text.

In essence, they'll publish a small inverted file of the contents, along with the text of the article broken up into sentences.

Their motivation is that researchers in information retrieval can simply retrieve this version of the articles in Nature for use in their experiments and they don't have to do all of the work that one normally has to do when dealing with HTML input (e.g., "what does this <h1>> tag do in this context?").

There are a few small problems: they remove stop words, they don't include position information for the words, and their sentence boundary detection is a bit flaky (as noted in the comments.)

As for the first of these problems, I'm a bit dogmatic about not throwing information away when indexing, even the so-called "noise" words. Typically this is done to save space in the index, but if one is doing compression in one's postings entries, the space saving is typically pretty minimal. In Nature's case, they actually list the words that were stopped out of the article, so there's no reason why they couldn't just list the frequencies as they do with everything else.

The lack of position information means that you would have to parse the data anyways if you wanted to do proximity operations in your engine, including things like our passage retrieval algorithm. Of course, putting in the proximity data and the actual text means that your XML file is probably going to be twice the size of the original data.

The sentence boundary detection is pretty flaky: it appears to break at every period. This is probably an 80/20 algorithm for sentence end detection, but that 20% is pretty glaring (think about how many sentences are in a typical scientific article!) In general, finding sentence boundaries is pretty hard. I remember a talk given by someone from Lexis/Nexis about 15 years ago where it was claimed that their sentece boundary detector had greater than 90% accuracy and it was their key differntiator.

Now, the trick is to generalize this to an entire site, so that your friendly neighborhood crawler only needs to request one file from your site and process it to index all of your content. The real real trick is to figure out how to do this without getting spammed like crazy.

Wednesday Apr 05, 2006

Why should you study Information Retrieval?

Because one day, they might name a street after you!

Tim could probably tell you more about Frank Tompa than I can, having worked with him on the New OED project. I knew him as an undergraduate and graduate student at Waterloo.

I still remember being blown away by a talk that he gave when I was at University of Toronto about indexing using semi-infinite strings and especially his description of the PAT search engine and the LECTOR text display system (which, if memory serves, included a lot of the ideas that we take for granted now with XML and XSLT).

Congratulations, Frank!

Tuesday Mar 14, 2006

Search: It's an integer thing

Jonathan's been blogging quite a bit about the try and buy program for our new SunFire T2000 servers. I must say, I'm tempted to see if I can order one delivered to a Sun office so that I can try running our search engine on one of these things.

The CoolThreads boxes are meant to handle Web-type work loads, but I'd be willing to bet that they would do a pretty good job at running a search engine as well.

The reason is that pretty much everything that a search engine does is integer based:

  • Documents are read and tokenized
  • Entries are added to dictionaries
  • Postings are added to postings lists
  • Index segments are written to disk
  • Dictionary entries and postings are read from disk
  • Queries are answered by processing postings lists

In a lot of systems, the postings are compressed, using a variety of techniques. Lucene, for example, uses what it calls VInt encoding. You can see a description of this in the Lucene file formats page. All of these compression techniques are integer based and they require work at indexing time and query time.

Why compress them? Because the cost of doing longer I/Os with uncompressed postings is substantially larger than the cost of the decompression.

Of course, having all of those hardware threads mean that you can be handling a lot of work (indexing or querying) in parallel.

About the only place that most search engines get any where near a floating point number is in the calculation of scores that are assigned to documents for ranking. There are a few solutions to this problem:

  1. Precompute the scores when indexing. This saves you the calculation at query time, except perhaps for adding a few floats. I expect that Google does this, since query time is such an important thing to them, and the scores are unlikely to change due to the addition of new documents.
  2. Use fixed point arithmetic. Most scores are pretty small, so you should be able to get by with a 32 bit integer. Of course, you need to worry about overflow, underflow, etc. But a good fixed point arithmetic class would do the trick.
  3. Don't worry about it. When I'm profiling the query side of our search engine, the floating point calculations are completely dominated by the time spent decompressing the postings from the postings file. The method to compute the scores isn't even in the top thousand methods.

Wednesday Mar 08, 2006

What's the difference between 18.5 million and 24 million?

I was reading an interesting paper from SIGIR 2001 where the task was to find the "entry page" for a Web site. For example, given the query "American Airlines" the task was to return http://www.aa.com. In passing, the authors mentioned that the index of 18.5 million Web pages that they were using for the experiment (The TREC VLC2 corpus, generated from a 1997 crawl by the Internet Archive) wasn't big enough for Google-style link counting techniques to provide any benefit.

I found this very interesting. The first big Google paper was The Anatomy of a large-scale hypertextual Web search engine (apparently now one of the most cited papers in IR.) As it turns out, I attended the talk that Larry and Sergey gave for this paper at the Web conference in Australia in 1998 (I even spoke to them about evaluation afterwards. Who knew I was talking to future multi-billionaires!) At the time that they wrote the paper, the Google index was only 24 million pages. Clearly, Google style techniques were working on an index of this size.

So, what's the difference between the 18.5 million page crawl and the 24 million page crawl? Well, size for one. Google's crawl was one third larger than the crawl that Craswell used for their experiments. Still, I don't think that would be enough to cause page ranking to fail to have any effect at all. The character of the pages probably has something to do with it, but I'm willing to bet that the original Google 24 million page crawl has gone the way of their Lego disk boxes.

Wednesday Jan 25, 2006

Extract Interface: An Adventure in Refactoring

Well, for small values of "Adventure" anyways.

I have to hand it to Netbeans. Despite my grousing about the editor (see this issue to find out more), the refactoring support in Netbeans is very handy.

I had a class from which I wanted to generate a new interface which the old class would then implement. It was as simple as selecting the class and then selecting "Extract interface..." from the context menu. Just pick the methods you want to pull up and Bob's your uncle. Previewing all changes let me pick and choose where I wanted to use the new interface and where I wanted to use the new implementation (almost nowhere, which is why I wanted an interface in the first place!)

Also nice was the fact that it's easy to pull methods up into the new interface when you realize that you've forgotten something.

As far as I can tell, the only thing missing is that I can't select two (or more) classes and extract an interface by choosing the intersection set of the methods.

All in all, a very satisfying refactoring experience. It makes me want to go try the other things on that menu. What do you suppose "Use supertype where possible..." does?

Monday Dec 19, 2005

A quick taglib, Netbeans, and XEmacs

So, I finally decided that I would figure out how to write a nice small JSP tag library for doing search related Web apps (the taglib's motto is: "When Web Server's implementation of search is too much, and Portal server's implementation is waaaaaay too much")

I decided that I would give Netbeans (5.0 beta 2) a try for this, since I was starting from scratch. I must say that I was very pleasantly surprised about how helpful it was. It helped me figure out how I would get the web app to load a search index at startup time (ServletContextListener, for those playing along at home) and it knows how to frob a .tld file when I add a new tag handler class.

I was also pleasantly surprised to see that they now support a decent set of (X)Emacs keybindings. I don't need everything, but it's nice, for example, that Ctrl-X Ctrl-S saves a file. I'm at a point in my life where I don't have any conscious control over that keystroke sequence. I think saving a buffer is now part of my autonomic nervous system, like breathing.

The only thing that's bothering me is that it doesn't know how to fill comments! I was happily typing javadoc for a method and when I got to the 70th column.... nothing. I had to hit return to move down to the next line. It helpfully filled in an asterisk on the new line, but then didn't put in a space! At least the manual typewriter that I learned to type on would go "Ding!" when I got to the right margin!

Seriously, up till this point I was considering starting to use Netbeans for more development. People who know me (I'm looking at you, Paul!) will realize what a startling admission that is for me to make. I'm the kind of Emacs user that tries to convert vi users (successful conversions to date: 7+. Evangelical Emacs users take note: kill/yank-rectangle is pretty impressive to a vi user.)

You don't realize how much you depend on something like XEmacs autofill-mode until it's gone. All of a sudden I have to start thinking about how close I am to the end of the line and whether this word will fit instead of thinking about the comment that I'm writing. I have this $2,000 computer sitting next to my right knee that's not doing much of anything except watching me type and it can't put in a carriage return, tab, asterisk and space for me when I cross column 70? Really? It certainly knows enough about where I am to color the comments grey...

Anyways, if there's a module for Netbeans that does autofill, I'd be happy to hear about it. Perhaps I'll have to write it myself.

Sunday Nov 13, 2005

Why a Word Processor?

Well, not exactly a word processor. Tim Bray was asked to provide a recent resumé and ended up asking the question "assuming you had a Web editor with a good change tracker, why would anyone want a word processor any more?"

The one word answer is "equations." MathML is a standard and some browsers support it, but have you ever tried to use it? That's definitely a markup language that requires assistance from a good editor (is there a MathML mode for (X)Emacs?) The problem is that the "equation editors" in most word processing software are not that useful if you're doing anything more complicated than a couple of fractions, and even then the results usually don't look very good.

If you try to use MathML in Mozilla, the browser helpfully points out that you should install the the TeX math fonts. If you want to do math, you should be doing (La)TeX!

Friday Nov 11, 2005

Autonomy and Verity

I try to stick to technical search topics in my blog, but there was big news in enterprise search this past week: Autonomy is buying Verity. Google is trying to make hay while the sun shines and capture Verity customers with the lure of free hardware.

I'm not sure what someone who buys a full scale GSA is going to want to do with a mini, but the guys at Anandtech sure had fun with theirs. I'm guessing they violated their license agreement in more than one way.

The thing that surprises me about the GSA is that the license limits you to the number of documents that you can index with it. They don't just offer some don't-cross-the-streams-style advice ("It would be bad"), they actually won't let you do it. And don't forget: a row of a database is a document! (Actually, I agree with that one.)

Folks interested in the GSA should probably have a look at Searching the Workplace Web, a paper from WWW2003 by a number of researchers at IBM Almaden that focused on how well techniques like PageRank work in an intranet setting.

Of course, I don't know that much about business (for example, I own far fewer 767s than some people (I think if I was worth 10 billion or so, I would just go for a 777, and then live in it full time, fly around the world, and solve crimes (that would make a good TV show: "He's a search billionaire, she's a freelance flight attendant. Together, they fight crime!" (Help, I'm trapped in a Lisp program!))))

Tuesday Nov 01, 2005

Sun Labs Open House Video and lonely software

Paul pointed out that we're both in the Open House video currently up at sun.com.

I'd have to say that my favorite part is the shot of my hand pushing the mouse around. You just don't see enough of that. Also, I'd completely forgotten that I had a goatee in April.

Paul extols my virtues as an xorg.conf fixer. The search engine that I used to fix the problem that he was having with the 24" panel is our engine running on an archive of about 3 million email messages that have been sent to Sun internal mailing lists. I solved his problem with a quick passage search. That kind of specific information request ("what's the correct Xorg modeline for a 24" flat panel?") is exactly what our passage retrieval algorithm does best.

The mail archive was actually having a bit of trouble today. I swear, the thing will run updates every five minutes for six months and then freak out for no obvious reason (well, every now and then a disk goes bad.) It's as though it gets lonely and wants me to stop by and pay some attention to it.

Oh, and I cannot believe that wikipedia has an entry for twm. I'll bet that twm is lonely for some attention.

Friday Oct 14, 2005

Indexing syndication feeds with Rome

I decided to try indexing a few (well 250) syndication feeds in order to run a few experiments (more on this later). I was going for quick-n-dirty, so I had initially thought that I would just pull the XML for the feed and then deal with that using the standard Java XML APIs.

A quick look at one of the feeds that I wanted to index told me that I would be better off with a slower-n-cleaner approach, so I downloaded and built (with a bit of hacking of various build.xmls) Rome.

I went from compiling Rome to indexing an actual feed in about 3 hours, which is not too bad. I used the FeedReader sample that comes with Rome to figure out how to pull a feed. One thing to keep in mind: if you're going to be pulling a lot of feeds, remember to close your XMLReader! I find this a lot in sample programs: they demonstrate a typical usage of the API, but they often forget the cleanup, since they tend to be very short.

In a real world situation where a program will run for hours and open tens of thousands of files, you need to close the files you open. The GC won't do this for you, and if you run out of file descriptors you get a not-very-helpful java.io.FileNotFoundException.

One other thing to keep in mind: if you're behind a cache or proxy, you want to make sure that you use URLConnection.setUseCaches(false) on the connections for your feed to ensure that you're really get the feed and not just a cached copy. This one lead to a bit of head scratching and printf debugging for me.


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« August 2016