Come and get it! Fresh hot minion!

Jeff and I gave our talk at JavaOne today. I'll post a link to the audio and slides once it's up.

The big news today is that the source for Minion is now available at java.net.

This is pretty exiciting for us. If you're interested in trying Minion, please do download the source, join the project and the mailing lists, read the getting started docs, and give it a go!

Minion's under active development, and we'll be working on documentation and tutorials as we go, so stay tuned.

Comments:

Great! But the source doesn't compile! Notably lots of missing references to "LuceneParser" and "Node"

Posted by Jordan on May 07, 2008 at 04:26 PM EDT #

Great news! I'm planning on testing Minion for searching a local PDF collection (~13GB/2,6M docs).
I wonder if Minion easily supports PDF documents?

Posted by Sérgio Nunes on May 08, 2008 at 03:05 AM EDT #

Jordan, that's weird. We've been using the Minion SVN ourselves for a week or so, so it should compile. The files that are missing are generated by JavaCC via a rule in build.xml. Were you trying to build it outside of Netbeans or possibly not using ant?

Also, if you're going to give Minion a try, I would heartily recommend joining the mailing lists at java.net.

Posted by Stephen Green on May 08, 2008 at 03:57 AM EDT #

Sergio, we don't support indexing PDF directly, you'll need to use something (e.g., PDF box (or Apache POI/Tika for office doc formats)) to break the text out of the PDF and then index the text.

The usual trick here is to add a field to the document that you index indicating where the original PDF file is located, so that you can show your users the original PDF when they get a hit in a PDF document.

Minion can certainly handle 2.6 million documents without any trouble.

Posted by Stephen Green on May 08, 2008 at 04:00 AM EDT #

DO you have any non-svn download methods (I mean freshmeat.net and sourceforge.net seem pretty easy to get big \*.tar.gz files) - My current 'svn' reports an error like the following:
-------
svn checkout https://minion.dev.java.net/svn/minion/trunk minion \\
--username jon_strabala
svn: PROPFIND request failed on '/svn/minion/trunk'
svn: PROPFIND of '/svn/minion/trunk': SSL negotiation failed: SSL disabled
due to library version mismatch (https://minion.dev.java.net)
-------

A fresh build of SVN seems to be an extensive pain i.e. lots of dependancies
-------
# subversion-1.4.3-sol10-x86-local.gz Subversion is an alternative to the CVS
version control system - installs in /usr/local.
To use subversion, you might also need to install:
neon-0.25.5,
apache-2.0.59,<--- NOTE THIS
swig-1.3.29,
expat-2.0.1,
gdbm-1.8.3,
libxml2-2.6.31,
db-4.2.52.NC,
openssl-0.9.8g,
libiconv-1.11,
zlib-1.2.3,
and to get /usr/local/lib/libgcc_s.so.1 and
/usr/local/lib/libstdc++.so.6 install
libgcc-3.4.6 or
gcc-3.4.6 or similar.
-------

Posted by Jon Strabala on May 08, 2008 at 08:38 AM EDT #

Jon, those requirements are a bit crazy (Apache? Does SVN do the same Web server thing that Mercurial does?). I'm trying to upload a tar file to java.net, but I'm at the end of a very thin wireless connection right now.

I'll try to get something up this evening back at the hotel, and I'll see if I can configure java.net to put up nightly tar.gzs or something.

I'll post another comment when it's ready.

Posted by Stephen Green on May 08, 2008 at 08:57 AM EDT #

Any progress of posting a non-svn tarball ?

Posted by Jon Strabala on May 15, 2008 at 10:44 AM EDT #

Jon, sorry for the delay. I just uploaded it. It's at:

https://minion.dev.java.net/files/documents/8750/97116/minion.tar.gz

Posted by Stephen Green on May 17, 2008 at 03:34 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today