Oracle Text query parser

Oracle Text provides a rich query syntax which enables powerful text searches.

However, this syntax isn't intended for use by inexperienced end-users.  If you provide a simple search box in your application, you probably want users to be able to type "Google-like" searches into the box, and have your application convert that into something that Oracle Text understands.

For example if your user types "windows nt networking" then you probably want to convert this into something like
"windows ACCUM nt ACCUM networking".  But beware - "NT" is a reserved word, and needs to be escaped.  So let's escape all words:
"{windows} ACCUM {nt} ACCUM {networking}".  That's fine - until you start introducing wild cards. Then you must escape only non-wildcarded searches:
"win% ACCUM {nt} ACCUM {networking}".  There are quite a few other "gotchas" that you might encounter along the way.

Then there's the issue of scoring.  Given a query for "oracle text query syntax", it would be nice if we could score a full phrase match higher than a hit where all four words are present but not in a phrase.  And then perhaps lower than that would be a document where three of the four terms are present.  Progressive relaxation helps you with this, but you need to code the "progression" yourself in most cases.

To help with this, I've developed a query parser which will take queries in Google-like syntax, and convert them into Oracle Text queries. It's designed to be as flexible as possible, and will generate either simple queries or progressive relaxation queries. The input string will typically just be a string of words, such as "oracle text query syntax" but the grammar does allow for more complex expressions:

  word : score will be improved if word exists
  +word : word must exist
  -word : word CANNOT exist
  "phrase words" : words treated as phrase (may be preceded by + or -)
  field:(expression) : find expression (which allows +,- and phrase as above) within "field".

So for example if I searched for
  +"oracle text" query +syntax -ctxcat
Then the results would have to contain the phrase "oracle text" and the word syntax. Any documents mentioning ctxcat would be excluded from the results.

All the instructions are in the top of the file (see "Downloads" at the bottom of this blog entry).  Please download the file, read the instructions, then try it out by running "parser.pls" in either SQL*Plus or SQL Developer.

I am also uploading a test file "test.sql". You can run this and/or modify it to run your own tests or run against your own text index. test.sql is designed to be run from SQL*Plus and may not produce useful output in SQL Developer (or it may, I haven't tried it).

I'm putting the code up here for testing and comments. I don't consider it "production ready" at this point, but would welcome feedback.  I'm particularly interested in comments such as
  • "The instructions are unclear - I couldn't figure out how to do XXX"
  • "It didn't work in my environment" (please provide as many details as possible)
  • "We can't use it in our application" (why not?)
  • "It needs to support XXX feature"
  • "It produced an invalid query output when I fed in XXXX"

Downloads

The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As". 

If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page).

parser.pls - main program file download link / open in browser

test.sql - for testing procedure download link / open in browser


Comments:

Have a look at my other entry: https://blogs.oracle.com/searchtech/entry/indexing_data_from_multiple_tables

In order to prioritize columns you can use multipliers in the query.
For example:
select * from tab where contains (text, ' ( oracle WITHIN title )*2 OR ( oracle WITHIN body )*0.5' ) > 0

In that query, hits in the title section will score 4x as much as hits in the body section.

Posted by guest on January 09, 2014 at 11:58 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today