Wednesday May 21, 2008

Open Source TREC: TRECmentum!

Great minds think alike1, I suppose. Grant Ingersoll, a core committer for Lucene posted last week about open source search engine relevance, proposing a TREC-like evaluation for open source engines. His much more comprehensive post explains a lot of what I mentioned in passing in a post a little while ago.

Getting the TREC data is definitely a barrier to entry to someone who wants to just try the TREC ad-hoc queries to see how their favorite engine holds up against the standard TREC evaluation measures. At Sun, we actually participated in TREC 1.5 times and we've intended to compete other times, so we have all of the data but that doesn't actually do anyone else any good.

It also does take a fair amount of hardware to run a TREC system through its paces. In 2000, we borrowed about 20 Ultrasparc machines to distribute indexing and retrieval for the question answering task. Of course, we messed up by deciding to compete the week before the evaluation data was going out!

Grant mentions a number of collections that we could use for the evaluation. I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.

I guess what I'm saying is "sign me up". I think this could be a great benefit to all of the open source engine communities. TREC certainly was to the academic and commercial communities.

Update: Paul reminds me in an email that we have a blog crawler and can pretty easily generate several million documents. We wouldn't have queries, though.

1. But as my grandfather always used to say: "Great minds think alike, but fools seldom differ."

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today