By searchguy on May 21, 2008
Getting the TREC data is definitely a barrier to entry to someone who wants to just try the TREC ad-hoc queries to see how their favorite engine holds up against the standard TREC evaluation measures. At Sun, we actually participated in TREC 1.5 times and we've intended to compete other times, so we have all of the data but that doesn't actually do anyone else any good.
It also does take a fair amount of hardware to run a TREC system through its paces. In 2000, we borrowed about 20 Ultrasparc machines to distribute indexing and retrieval for the question answering task. Of course, we messed up by deciding to compete the week before the evaluation data was going out!
Grant mentions a number of collections that we could use for the evaluation. I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.
I guess what I'm saying is "sign me up". I think this could be a great benefit to all of the open source engine communities. TREC certainly was to the academic and commercial communities.
Update: Paul reminds me in an email that we have a blog crawler and can pretty easily generate several million documents. We wouldn't have queries, though.
1. But as my grandfather always used to say: "Great minds think alike, but fools seldom differ."