Open Source TREC: TRECmentum!

Great minds think alike1, I suppose. Grant Ingersoll, a core committer for Lucene posted last week about open source search engine relevance, proposing a TREC-like evaluation for open source engines. His much more comprehensive post explains a lot of what I mentioned in passing in a post a little while ago.

Getting the TREC data is definitely a barrier to entry to someone who wants to just try the TREC ad-hoc queries to see how their favorite engine holds up against the standard TREC evaluation measures. At Sun, we actually participated in TREC 1.5 times and we've intended to compete other times, so we have all of the data but that doesn't actually do anyone else any good.

It also does take a fair amount of hardware to run a TREC system through its paces. In 2000, we borrowed about 20 Ultrasparc machines to distribute indexing and retrieval for the question answering task. Of course, we messed up by deciding to compete the week before the evaluation data was going out!

Grant mentions a number of collections that we could use for the evaluation. I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.

I guess what I'm saying is "sign me up". I think this could be a great benefit to all of the open source engine communities. TREC certainly was to the academic and commercial communities.

Update: Paul reminds me in an email that we have a blog crawler and can pretty easily generate several million documents. We wouldn't have queries, though.

1. But as my grandfather always used to say: "Great minds think alike, but fools seldom differ."



Your idea to use mailing list data is interesting. It's definitely a potential collection. I guess I have a few questions:

What use cases should the collections be designed for? Things that come to mind: enterprise search, web search, product search, etc...

The TREC collections are freely available Government documents. The really interesting stuff is copyrighted (like web data). Re-distributing copyrighted data is problematic.

Could a solution be a platform like Alexa's Web Search Platform? The documents are hosted on the cluster and not distributed, but can be processed to create search indexes. You could even create a working service and collect queries. Who knows, you could even before A/B relevance tests using live traffic. The Information Retrieval Facility is doing something similar for patents.

See my blog for more thoughts.

Posted by Jeff on May 21, 2008 at 05:16 PM EDT #

Not all of the TREC collections are government documents. Most of the stuff that was used for the ad hoc querying tracks was news data.

The redistribution problem is thorny, but it looks like there's about a million sites that crawl mailing lists, so I figured that would be pretty safe to redistribute. Somehow Google gets away with caching the Web too, and the cost of the TREC web data is pretty high (see, so people are managing to do this sort of thing.

If we look at sites that we have relationships with (Open Solaris, Apache) then we can get historical queries and have a look at those.

These queries will be very different from the TREC ad hoc queries, which are from that long ago time (called "the late 80s") when queries to search engines had two paragraphs in them, not two words.

I'll have a look at your blog and comment there as necessary...

Posted by Stephen Green on May 22, 2008 at 01:27 AM EDT #

Interestingly MarkMail [ ] is searching a lot of mailing list archives with MarkLogic. They've been charting a lot their ingest processes in their blog.

Posted by Matt McKnight on May 22, 2008 at 07:19 AM EDT #

While the GOV collections are still expensive, some of the older, original newspaper TREC data has become quite cheap. TREC disks 4 and 5, which used to run about $1,000 ten years ago, now sell for $90 each:

And NIST makes available hundreds of topics, and relevance judgments, for these two disks.

So, it's still not completely open source, but $180 is not too expensive, overall.

Posted by jeremy on May 22, 2008 at 07:19 AM EDT #

Actually, I wonder if we couldn't just take advantage of the paths that Google has blazed, in their book scanning project. Google claims that it's ok for them to make full-text copies of books that do not belong to them, because they are only keeping copies of the index, which is a transformative work and does not allow the "reading" of the original book.

So what if, instead of distributing the full TREC dataset, we were to distribute the parsed indices. E.g. if everyone were to agree on certain tokenizing rules, we could use Lucene or Minion to parse the TREC data, and then just distribute the Lucene or Minion index files, rather than the full TREC data.

Would that work?

Posted by jeremy on May 22, 2008 at 07:22 AM EDT #

Jeremy, I'd be a lot happier with the Google approach if I had several billion dollars to spend on lawyers to help me out.

Plus, tokenizing might be (although it's unlikely) one engine's secret sauce, and then where would we be?

I like Grant's idea that you should have to contribute source along with your evaluations so that we can get repeatable experiments.

Posted by Stephen Green on May 22, 2008 at 07:26 AM EDT #

Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« August 2016