Wikipedia for indexers testing

Mikkel from tracker mailing list had nice idea to dump Wikipedia to text files for testing indexers. See:
http://mail.gnome.org/archives/tracker-list/2007-January/msg00180.html
So I've created small application that can be run from the command line as well with the swing gui called JWikiTextGetter.
To run it, just grab the binaries from HERE
Unpack them and run with >jre1.5 as follows:
java -jar JWikiTextGetter.jar --gui
or:
java -jar JWikiTextGetter.jar --help
This is very fast written application, so don't expect to much! It just do what it should. The quality of grabbing text from wiki depends on the htmlparser library. Each file (for en.wikipedia) contains 59 lines that maybe should be deleted, but I left them :-)
There are configuration files for each wiki, because different localized wikipedias have different url's, if you want to create your own, go to the wikipedias folder in the main application folder and write your own as two examples.
UPDATE:
According to the wikipedia page, it is not welcome to use web crawlers to download large numbers of articles.
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Comments:

nice, thanks. any hope to get images downloaded as well?

Posted by Berteh on January 24, 2007 at 12:46 AM UTC #

Hey migi.

Rather than downloading text from wikipedia, I did once search for some stuff with google and downloaded that whole google search (all pdf documents, about 700MB in total) --- nobody complained :) - may this be an option?

BTW: this paper collection is always indexed as i still have it ;)

Cheers, Marcus

Posted by fritschy on September 07, 2007 at 03:46 AM UTC #

Hello Marcus,

I know that wiki administrators asks to not use crawlers, that is why I have added update to my weblog, but on the other hand downloading all the wiki dumps just to have few pages can make other people (Internet providers, other users of the same subnet) unhappy. Even worst case if we want to get few documents in each language. So in my opinion downloading 50 pages of each language, which is one time process, is "better worst" than downloading whole dumps.

There is another thing, legal one. Getting documents from google search doesn't tells you anything about license of those documetns. Probably nobody ever will notice that someone is using such dataset, but the one that I am making is also used inside my company and I have to be aware of the licenses.

Now I am thinking how to get legal mp3's any idea? :-)

best
Michal Pryc

Posted by Michal Pryc on September 10, 2007 at 03:03 AM UTC #

Post a Comment:
Comments are closed for this entry.
About

migi

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today