Choosing which documents to index with a document service

A question posed recently was "How can I tell the NTFS crawler to index normal files but not folders?"

The initial answer was "There is no option available to prevent submitting of folders for indexing in an NTFS source".

But like many problems in SES, it can be solved quite easily with a document service.

Document services sit in the pipeline between the crawler (which fetches the documents from a source) and the indexer, which analyzes and indexes the document.  In a document service you can:

  1. Add or modify attributes of the document
  2. Change, add or delete some or all of the text of the document
  3. Specify whether the current document should be indexed or not
We want to use number 3 here - specify that certain documents should not be indexed. We know that NTFS folders have a URL which ends with a "/".  It would be easy enough to write a document service that checks for this and specifies that a matching document should not be indexed.  But let's make it a bit more flexible, and allow the user to specify a regular expression.  If the URL matches the regular expression, the document will not be indexed.

We do this by defininig a parameter to the document service. The document service is installed in SES from a jar file, then we create an instance of the document service, complete with a parameter, which is the regular expression to match. In this case, the regular expression would be ".*/$" (without the quotes). Meaning "match anything ending with a forward slash".

You can download the document service to do this from this link: RejectionFilter.zip . It's a zip file, so unzip it with your favorite utility on Windows or using "unzip" on Unix/Linux. The file readme.txt contains full instructions for installing it (hopefully - if not let me know).

UPDATE!

It seems I posted this a little early. Since I didn't have an NTFS source set up on the current 10.1.8.4 version, I tested this onto the development machine, running an early release of 11g SES. It worked fine. Unfortunately, it seems it doesn't work with 10.1.8.4. In the current release, NTFS folders do not have a trailing slash - indeed there is no way to tell from looking at the URL whether a "document" is a folder or a file.  Fortunately, however, we can look at the document size instead - a folder has no content, so the size is zero. I modified the document service to include a new parameter for minimum document size. Set this to 1, and folders (plus any truly empty documents) will get ignored. You can get the updated document service from here: 
RejectionFilterV11.zip







Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today