An Oracle blog about Oracle Text Index

  • July 11, 2012

Oracle Text query parser

Roger Ford
Product Manager
Oracle Text provides a rich query syntax which enables powerful text searches.

However, this syntax isn't intended for use by inexperienced end-users.  If you provide a simple search box in your application, you probably want users to be able to type "Google-like" searches into the box, and have your application convert that into something that Oracle Text understands.

For example if your user types "windows nt networking" then you probably want to convert this into something like
"windows ACCUM nt ACCUM networking".  But beware - "NT" is a reserved word, and needs to be escaped.  So let's escape all words:
"{windows} ACCUM {nt} ACCUM {networking}".  That's fine - until you start introducing wild cards. Then you must escape only non-wildcarded searches:
"win% ACCUM {nt} ACCUM {networking}".  There are quite a few other "gotchas" that you might encounter along the way.

Then there's the issue of scoring.  Given a query for "oracle text query syntax", it would be nice if we could score a full phrase match higher than a hit where all four words are present but not in a phrase.  And then perhaps lower than that would be a document where three of the four terms are present.  Progressive relaxation helps you with this, but you need to code the "progression" yourself in most cases.

To help with this, I've developed a query parser which will take queries in Google-like syntax, and convert them into Oracle Text queries. It's designed to be as flexible as possible, and will generate either simple queries or progressive relaxation queries. The input string will typically just be a string of words, such as "oracle text query syntax" but the grammar does allow for more complex expressions:

  word : score will be improved if word exists
  +word : word must exist
  -word : word CANNOT exist
  "phrase words" : words treated as phrase (may be preceded by + or -)
  field:(expression) : find expression (which allows +,- and phrase as above) within "field".

So for example if I searched for
  +"oracle text" query +syntax -ctxcat
Then the results would have to contain the phrase "oracle text" and the word syntax. Any documents mentioning ctxcat would be excluded from the results.

All the instructions are in the top of the file (see "Downloads" at the bottom of this blog entry).  Please download the file, read the instructions, then
try it out by running "parser.pls" in either SQL*Plus or SQL Developer.

I am also uploading a test file "test.sql". You can run this and/or modify it to run your own tests or run against your own text index. test.sql is designed to be run from SQL*Plus and may not produce useful output in SQL Developer (or it may, I haven't tried it).

I'm putting the code up here for testing and comments. I don't consider it "production ready" at this point, but would welcome feedback.  I'm particularly interested in comments such as
  • "The instructions are unclear - I couldn't figure out how to do XXX"
  • "It didn't work in my environment" (please provide as many details as possible)
  • "We can't use it in our application" (why not?)
  • "It needs to support XXX feature"
  • "It produced an invalid query output when I fed in XXXX"


The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As". 

If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page).

parser.pls - main program file : download link  open in browser

test.sql - for testing procedure : download link  open in browser


Join the discussion

Comments ( 16 )
  • guest Thursday, January 9, 2014

    Have a look at my other entry: https://blogs.oracle.com/searchtech/entry/indexing_data_from_multiple_tables

    In order to prioritize columns you can use multipliers in the query.

    For example:

    select * from tab where contains (text, ' ( oracle WITHIN title )*2 OR ( oracle WITHIN body )*0.5' ) > 0

    In that query, hits in the title section will score 4x as much as hits in the body section.

  • Dmitry Wednesday, February 24, 2016


    Is there any updated version of this code which is more production-ready?

    Because I'm working exactly on this kind of single input box interface to Oracle full text search.


  • guest Wednesday, February 24, 2016

    The current links have code which fixes a couple of issues people have found during testing. The code IS in use in production systems, so to that extent I would say it is now production ready. However, the code still has the status of "sample code" - if you use it, you must own it and be prepared to maintain it, if necessary. I'm very happy to personally help with any issues you might find (in terms of bugs, probably not upgrades to add further functionality), but Oracle Support will not be able to help with it.

  • guest Friday, February 26, 2016

    Hi, very nice project. However, it has significant bug.

    reParse procedure uses regular expressions to split string into single words and when there are two or more following phrases that should be treated as words, they will be matched as single phrase. So, having search string like

    "this is phrase1" "this is phrase2" "this is phrase3" word

    the coded regular expression will match only 2 words:

    1) "this is phrase1" "this is phrase2" "this is phrase3"

    2) word

    What I expected was outcome with 4 words:

    1) "this is phrase1"

    2) "this is phrase2"

    3) "this is phrase3"

    4) word

    I think the code that generates following regular expression: ([+-]?[[:alnum:]]+[*]?)|([+-]?".+") should be changed to generate ([+-]?[[:alnum:]]+[*]?)|([+-]?"[^"]+")

  • david Monday, November 6, 2017
    but near search does not seem to work there is a mistake.
  • Roger Monday, November 6, 2017
    You mean the parser.nearSearch function? How doesn't it work? A simple case like this works fine:
    SQL> select * from t where contains(c, parser.nearsearch('dog cat rabbit')) > 0;

    the blue dog chased the rabbit past the cat
  • LILIAN Monday, January 28, 2019
    Hi Roger,

    This is a marvelous example code that can fulfill most requests of our users. Thanks a lot.

    I have a question for the wildcard search, is there a way to put the wildcard in the front of the word instead of the back of the word?

    If I search like this -
    SELECT score (1) score, text
    FROM xx_avtest
    WHERE contains (text, parser.simplesearch ('rab*'), 1) > 0

    I can get the result without problem.
    But if I search like this-
    SELECT score (1) score, text
    FROM xx_avtest
    WHERE contains (text, parser.simplesearch ('*bit'), 1) > 0

    no result.

    How can I make it work? Thanks for any hints in advance.

    Best regards,
  • Roger Ford Monday, January 28, 2019
    Thanks for the comments, Lillian.

    Leading wildcards are not supported because they're often quite inefficient. Unless you have SUBSTRING_INDEX or PREFIX_INDEX switched on, then using a leading wildcard will prevent the use of the index on the $I token table - meaning it will need to do a full table scan to find the words that match your '*bit'.

    If you do have SUBSTRING_INDEX or PREFIX_INDEX on, then it would need a code change in the procedure to allow that. I've had a quick look and it might not be too complicated (though I could have misjudged it!), I'll see if I can get it to work at some point this week.

  • Lilian Tuesday, January 29, 2019
    Hi Roger,

    Thanks for your quick reply.

    I did set SUBSTRING_INDEX to TRUE. And I tried to add a wildcard before each word. (In 'reparse' procedure add: match := syswild || match;) This way, I can search with '*bit'.

    But when I search by parser.progrelax ('+*dog +*bit*'), the record without 'dog' also searched. Seems the wildcard make the word optional regardless the '+'. Can I still make each word requested?

    Sorry for keeping disturbing and thanks for any comments in advance.

    Best regards,
  • Roger Ford Tuesday, January 29, 2019
    It's a little more complicated, especially if we have to allow for double truncation (eg. '*bit*').

    In the code, where you see:

    if substr(match, length(match), 1) = userWild then
    match := substr(match, 1, length(match)-1);
    if length(joinChars) > 0 then
    match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
    end if;
    match := match || sysWild;
    match := regexp_replace (match, '^('||lc||')(.*)', '1{2}');
    end if;

    That code must be replaced with:

    if substr(match, length(match), 1) = userWild or substr(match, 1, 1) = userWild then

    if substr(match, length(match), 1) = userWild AND substr(match, 1, 1) = userWild then
    -- double truncation
    match := substr(match, 2, length(match)-2);
    if length(joinChars) > 0 then
    p('doing joinchars');
    match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
    end if;
    match := sysWild || match || sysWild;
    if substr(match, length(match), 1) = userWild then
    -- trailing wildcard
    match := substr(match, 1, length(match)-1);
    if length(joinChars) > 0 then
    match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
    end if;
    match := match || sysWild;
    -- leading wildcard
    match := substr(match, 2, length(match));
    if length(joinChars) > 0 then
    match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
    end if;
    match := sysWild || match;
    end if;
    end if;
    match := regexp_replace (match, '^('||lc||')(.*)', '1{2}');
    end if;

    I haven't done extensive testing, but as far as I can see it appears to work properly. Let me know if you find any problems.

  • Roger Ford Tuesday, January 29, 2019
    Lovely. The blog software has removed all the formatting. Sorry about that. Hopefully you can still use it, or email me for a full version. My email address is easy to guess (given we use first.last at oracle.com) or you can find me on LinkedIn.
  • Lilian Wednesday, January 30, 2019
    It works after adding a wildcard (ec) in pattern behind leading char (lc).

    Thank you very much.

  • Said Bouzenzana Tuesday, October 8, 2019
    does somebody knows why when doing a google like exact match (using double quote ) it’s removing the dot ???
    select parser.simpleSearch ('"XXXX YYB 12.0011U"') from dual
    is giving ({XXXX YYB 12 0011U})
  • Roger Ford Tuesday, October 8, 2019
    Because, generally speaking, the '.' character is a break character and therefore not indexed. All break characters cause the words to be split, and in simpleSearch words are separated by spaces.

    However, I can see now this could cause problems with numbers - or strings that look like numbers to the query lexer.

    I'm not sure how easy that will be to fix, I'll take a look at the code.

    Thanks for the feedback.
  • Vi Monday, October 28, 2019
    Hi Roger,

    Thanks a lot for your support.

    can we search with the following syntax:
    (+a or +b) -d -e

    it means the result must contain the word a OR b, and not contain d AND e.

    Thanks Roger.
  • Roger Ford Monday, October 28, 2019
    No, I don't think that's possible as there is no explicit OR in the syntax, and as soon as you use +a it means that it must be there, regardless of what else you include in the query.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.