Oracle Text provides a rich query syntax which enables powerful text searches.
However, this syntax isn't intended for use by inexperienced end-users. If you provide a simple search box in your application, you probably want users to be able to type "Google-like" searches into the box, and have your application convert that into something that Oracle Text understands.
For example if your user types "windows nt networking" then you probably want to convert this into something like
"windows ACCUM nt ACCUM networking". But beware - "NT" is a reserved word, and needs to be escaped. So let's escape all words:
"{windows} ACCUM {nt} ACCUM {networking}". That's fine - until you start introducing wild cards. Then you must escape only non-wildcarded searches:
"win% ACCUM {nt} ACCUM {networking}". There are quite a few other "gotchas" that you might encounter along the way.
Then there's the issue of scoring. Given a query for "oracle text query syntax", it would be nice if we could score a full phrase match higher than a hit where all four words are present but not in a phrase. And then perhaps lower than that would be a document where three of the four terms are present.
Progressive relaxation helps you with this, but you need to code the "progression" yourself in most cases.
To help with this, I've developed a query parser which will take queries in Google-like syntax, and convert them into Oracle Text queries. It's designed to be as flexible as possible, and will generate either simple queries or progressive relaxation queries. The input string will typically just be a string of words, such as "oracle text query syntax" but the grammar does allow for more complex expressions:
word : score will be improved if word exists
+word : word must exist
-word : word CANNOT exist
"phrase words" : words treated as phrase (may be preceded by + or -)
field:(expression) : find expression (which allows +,- and phrase as above) within "field".
So for example if I searched for
+"oracle text" query +syntax -ctxcat
Then the results would have to contain the phrase "oracle text" and the word syntax. Any documents mentioning ctxcat would be excluded from the results.
All the instructions are in the top of the file (see "Downloads" at the bottom of this blog entry). Please download the file, read the instructions, then
try it out by running "parser.pls" in either SQL*Plus or SQL Developer.
I am also uploading a test file "test.sql". You can run this and/or modify it to run your own tests or run against your own text index. test.sql is designed to be run from SQL*Plus and may not produce useful output in SQL Developer (or it may, I haven't tried it).
I'm putting the code up here for testing and comments. I don't consider it "production ready" at this point, but would welcome feedback. I'm particularly interested in comments such as
- "The instructions are unclear - I couldn't figure out how to do XXX"
- "It didn't work in my environment" (please provide as many details as possible)
- "We can't use it in our application" (why not?)
- "It needs to support XXX feature"
- "It produced an invalid query output when I fed in XXXX"
Downloads
The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As".
If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page).
parser.pls - main program file : download link open in browser
test.sql - for testing procedure : download link open in browser
Have a look at my other entry: https://blogs.oracle.com/searchtech/entry/indexing_data_from_multiple_tables
In order to prioritize columns you can use multipliers in the query.
For example:
select * from tab where contains (text, ' ( oracle WITHIN title )*2 OR ( oracle WITHIN body )*0.5' ) > 0
In that query, hits in the title section will score 4x as much as hits in the body section.
Roger,
Is there any updated version of this code which is more production-ready?
Because I'm working exactly on this kind of single input box interface to Oracle full text search.
Regards.
The current links have code which fixes a couple of issues people have found during testing. The code IS in use in production systems, so to that extent I would say it is now production ready. However, the code still has the status of "sample code" - if you use it, you must own it and be prepared to maintain it, if necessary. I'm very happy to personally help with any issues you might find (in terms of bugs, probably not upgrades to add further functionality), but Oracle Support will not be able to help with it.
Hi, very nice project. However, it has significant bug.
reParse procedure uses regular expressions to split string into single words and when there are two or more following phrases that should be treated as words, they will be matched as single phrase. So, having search string like
"this is phrase1" "this is phrase2" "this is phrase3" word
the coded regular expression will match only 2 words:
1) "this is phrase1" "this is phrase2" "this is phrase3"
2) word
What I expected was outcome with 4 words:
1) "this is phrase1"
2) "this is phrase2"
3) "this is phrase3"
4) word
I think the code that generates following regular expression: ([+-]?[[:alnum:]]+[*]?)|([+-]?".+") should be changed to generate ([+-]?[[:alnum:]]+[*]?)|([+-]?"[^"]+")
SQL> select * from t where contains(c, parser.nearsearch('dog cat rabbit')) > 0;
C
--------------------------------------------------------------------------------
the blue dog chased the rabbit past the cat
This is a marvelous example code that can fulfill most requests of our users. Thanks a lot.
I have a question for the wildcard search, is there a way to put the wildcard in the front of the word instead of the back of the word?
If I search like this -
SELECT score (1) score, text
FROM xx_avtest
WHERE contains (text, parser.simplesearch ('rab*'), 1) > 0
I can get the result without problem.
But if I search like this-
SELECT score (1) score, text
FROM xx_avtest
WHERE contains (text, parser.simplesearch ('*bit'), 1) > 0
no result.
How can I make it work? Thanks for any hints in advance.
Best regards,
Lilian
Leading wildcards are not supported because they're often quite inefficient. Unless you have SUBSTRING_INDEX or PREFIX_INDEX switched on, then using a leading wildcard will prevent the use of the index on the $I token table - meaning it will need to do a full table scan to find the words that match your '*bit'.
If you do have SUBSTRING_INDEX or PREFIX_INDEX on, then it would need a code change in the procedure to allow that. I've had a quick look and it might not be too complicated (though I could have misjudged it!), I'll see if I can get it to work at some point this week.
Roger
Thanks for your quick reply.
I did set SUBSTRING_INDEX to TRUE. And I tried to add a wildcard before each word. (In 'reparse' procedure add: match := syswild || match;) This way, I can search with '*bit'.
But when I search by parser.progrelax ('+*dog +*bit*'), the record without 'dog' also searched. Seems the wildcard make the word optional regardless the '+'. Can I still make each word requested?
Sorry for keeping disturbing and thanks for any comments in advance.
Best regards,
Lilian
In the code, where you see:
if substr(match, length(match), 1) = userWild then
match := substr(match, 1, length(match)-1);
if length(joinChars) > 0 then
match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
end if;
match := match || sysWild;
else
match := regexp_replace (match, '^('||lc||')(.*)', '1{2}');
end if;
That code must be replaced with:
if substr(match, length(match), 1) = userWild or substr(match, 1, 1) = userWild then
if substr(match, length(match), 1) = userWild AND substr(match, 1, 1) = userWild then
-- double truncation
match := substr(match, 2, length(match)-2);
if length(joinChars) > 0 then
p('doing joinchars');
match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
end if;
match := sysWild || match || sysWild;
else
if substr(match, length(match), 1) = userWild then
-- trailing wildcard
match := substr(match, 1, length(match)-1);
if length(joinChars) > 0 then
match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
end if;
match := match || sysWild;
else
-- leading wildcard
match := substr(match, 2, length(match));
if length(joinChars) > 0 then
match := regexp_replace (match, '(['||jc1||joinC||jc2||jc3||'])', '\1');
end if;
match := sysWild || match;
end if;
end if;
else
match := regexp_replace (match, '^('||lc||')(.*)', '1{2}');
end if;
I haven't done extensive testing, but as far as I can see it appears to work properly. Let me know if you find any problems.
Roger
Thank you very much.
Lilian
does somebody knows why when doing a google like exact match (using double quote ) it’s removing the dot ???
select parser.simpleSearch ('"XXXX YYB 12.0011U"') from dual
is giving ({XXXX YYB 12 0011U})
However, I can see now this could cause problems with numbers - or strings that look like numbers to the query lexer.
I'm not sure how easy that will be to fix, I'll take a look at the code.
Thanks for the feedback.
Thanks a lot for your support.
can we search with the following syntax:
(+a or +b) -d -e
it means the result must contain the word a OR b, and not contain d AND e.
Thanks Roger.