Indexing MediaWiki with SES


Introduction

MediaWiki is a popular Wiki system. It is basically the same software used for Wikipedia and its many spin-offs and copies.

It is possible to index a MediaWiki with the Web Crawler in SES, but this has certain disadvantages:

  • The web crawler has to follow all links, which is not the most efficient way of getting all page
  • In order to determine whether a page has changed, the web crawler has to access all page headers, which can put a heavy load on the Wiki's web server
  • If some content is not public, there is no way for the web crawler to access the security information to see who can see a page.

Instead, we want to create a database crawler, which can access the underlying database holding the Wiki pages.

This article shows you how to do that. It makes certain assumptions:

  • The database storing the pages is MySQL. MediaWiki supports other databases, but this has not been tested
  • All content is public. It should be quite possible to extend this method for proper secure search, but that's not covered here.

This project was based on the Turnkey MediaWiki - an easy-to-install preconfigured machine with Ubuntu Linux, PHP, MediaWiki and MySQL. It was installed into a virtual machine with 512MB RAM and 8GB of disk running under VMWare Server. The version is 2009.10.

Configuring MySQL

We need to ensure that we have a user who is able to connect to the wiki from the SES machine.

MySQL in the Turnkey system is configured to only allow connections from the local machine. We need to modify its config to allow external connections from the SES box.

This will normally require you to edit the file /etc/mysql/my.cnf If you have a line:

bind-address = 127.0.0.1

you need to change it to something like

bind-address = 123.123.123.321

where the address is the external IP address for your mediawiki server (find it with /sbin/ifconfig)

Then you need to create a MySQL user who can access the wiki tables from the SES machine. On the wiki machine, do:

 mysql -u root -ppassword
 mysql> create user 'sesuser'@'%' identified by 'secretpassword';
 mysql> grant all on wiki_db.* to 'sesuser'@'%';

Note: the '%' should really contain your SES host name to restrict access to just the SES host. However, I was unable to get connections working this way. '%' allows the user to connect from any host.

If you have an older version of MySQL in your installation, it may be configured with "skip_networking". To find out, run

 mysqladmin -u root -ppassword variables

If skip_networking is ON, you will need to change it. This page may help.

Configuring SES to access MySQL

SES does not come, out of the box, with the JDBC drivers needed to access MySQL. We must download them and do some simple configuration. Most of this is discussed in the SES Admin Guide.

Go here and download the Connector/J package. Open the archive, and extract the file mysql-connector-java-5.1.10-bin.jar (of course this name may change with future versions). Save the jar file into the directory $ORACLE_HOME/search/lib/plugins/oracleapplications on your SES server.

Now edit the file drivers.properties in the same directory. Add a line as follows:

 mysql: com.mysql.jdbc.Driver,PAGEID

This tells SES two things:

  • The name of the jdbc driver for MySQL
  • The name of the unique key column we are going to use. SES defaults to KEY as the key string, but KEY is a reserved word in MySQL

Finally we need to make sure SES can find the driver JAR file. For this we need to edit the MANIFEST file in the appsjdbc.jar file in the same oracleapplications directory.

If you're lucky enough to be an Emacs user on Linux, you can do this by opening the jar file in emacs, and directly editing the meta-inf/manifest.mf file.

Edit the Class-Path line to read:

  Class-Path: sqljdbc.jar mysql-connector-java-5.1.10-bin.jar

And save the file. (Change the jar file name if you're using a different version).

If you use an inferior editor, or you're on Windows and emacs can't edit jar files, you will need to do

  jar xvmf META-INF/MANIFEST.MF appsjdbc.jar
<edit META-INF/MANIFEST.MF to make changes as above>
jar cvmf META-INF/MANIFEST.MF appsjdbc.jar oracle
rm -rf oracle # or similar windows delete command

Do make sure you're in the right (oracleapplications) directory when you run that last command!

There's no need to restart SES after making these changes. SES reads the files afresh at the start of each crawl.

Creating the Source in SES

Then we create a database crawler in SES. These are the parameters (you'll need to edit the database connect string and URL prefix to include the hostname or IP address of the MediaWiki server, and possibly change the port 3306 if your MySQL uses a non-standard port):

Database Connection String: jdbc:mysql://mediawiki.mydomain.com:3306/wiki_db

User ID: sesuser

Password: secretpassword

Query: select p.page_id as PAGEID, p.page_title as URL, t.old_text as CONTENT, p.page_title as TITLE, r.rev_timestamp as LASTMODIFIEDDATE, r.rev_user_text as AUTHOR, 'EN' as LANG from page as p left join revision as r on p.page_id = r.rev_page left join text as t on r.rev_text_id = t.old_id where r.rev_id in ( select max(rev_id) from revision group by rev_page )

URL Prefix: http://mediawiki.mydomain.com/index.php/

All other parameters can be left at their default values.

In page 2 (Authorization) you must set Authorization to "No Access Control List".

You MUST also clear the two lines Plug-in Class Name and Jar File Name.

We're done! Crawl away.

Here's that SQL in a more readable form:

     select 
p.page_id as PAGEID,
p.page_title as URL,
t.old_text as CONTENT,
p.page_title as TITLE,
r.rev_timestamp as LASTMODIFIEDDATE,
r.rev_user_text as AUTHOR,
'EN' as LANG
from
page as p
left join
revision as r
on p.page_id = r.rev_page
left join
text as t
on r.rev_text_id = t.old_id
where r.rev_id in
( select max(rev_id)
from revision
group by rev_page
)

This is a fairly basic crawl. The page contents, title and author are fetched for the latest revision of each page. The title is fetched in the slightly unfriendly internal form - for example Wiki_Index_Notes. A better crawl would transform this by replacing the underscores with spaces. There is also probably much other useful information which could be extracted from the various tables, including such things as the summary text, the link-in count (for relevance boosting) and perhaps link text as well. This could be achieved by further customization of the SQL above, by reference to the table descriptions and schema diagram in the links below.

Useful Links

MediaWiki.org Homepage

MediWiki Database Layout Table Descriptions

MediaWiki Database Scheme (Large .PNG Image)

MySQL documentation library

Turnkey MediaWiki appliance

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today