Getting started Part 2 - More queries

This post follows on from Getting Started with Oracle Text

We will be looking at Oracle Text queries in a bit more detail. We'll assume you've set up the QUICKSTART table as used in part one.

We mentioned before that Oracle Text CONTAINS queries look for whole words.  This means that a search for "smit" would not succeed, since "smit" does not occur as a whole word in any of our text.  If we want to do partial word matches, we must use wildcards. As with the SQL "LIKE" operator, we can use underscore ("_") to mean "any unknown single character" and percent ("%") to mean "any number of unknown characters - including none".

So to find words starting with "smit" we would do:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'smi%', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith

Again we see that the first record scores higher than the second as there are two matches for "smit%" in the first.

Note that a search for "%mit%" will work, but may be slow.  If you need to use leading wildcards (that is, a wild card at the beginning of the word) you should consider using the SUBSTRING_INDEX option - look it upif you need it.  But remember that Oracle Text is primarily a word-based index, it's not really designed, at least with its default settings, for finding substrings anywhere in the text.

So what if  we want to search for a phrase - two or more words together, in order?  That's easy, we just use the phrase in the search:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'peter smith', 99) > 0 order by score(99) desc;

Note that unlike search engine such as Google, the CONTAINS search is quite precise.  If you look for the phrase "peter smith", it will not find documents containing "Smith Peter" or "Peter John Smith".  If you wanted to find containing combinations of those words, you could use an AND or an OR query. Let's look at one of those now:


SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john OR smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          1 John Doe
         4          2 John Smith

We'll take an aside to look at the scoring here - skip to the next paragraph if you're not interested.  The first row scores higher because "smith" occurs twice. OK, but why doesn't the third one score higher, as it has two hit terms. "john" and "smith"?  The answer is that the OR operator scores the higher of the two expressions it joins.  Since "john" and "smith" both score similar low scores, the result is just the one low score. In contrast, the AND operator scores the lower of two expressions.  This might seem unobvious, but it makes sense if you consider what happens when one of the search terms doesn't exist - and therefore scores zero.  The OR operator scores the higher of (zero and something), so always returns a score when one term is present. The AND operator scores the lower of (zero and something) - which is zero - so always returns a zero score unless both terms are present.

Let's see a few more simple queries and their results. First the AND operator we've already mention - both terms must be present.


SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john AND smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          2 John Smith

Then there's the NOT operator - if the second term is present at all then the query fails: 

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john NOT smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          1 John Doe

And the ACCUM operator - similar to OR but instead of scoring the lower of the two expressions, it adds them together (or accumulates them)

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john ACCUM smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
        52          2 John Smith
         4          3 Peter Smith-Smith
         2          1 John Doe

 And a glimpse at one of the more powereful capabilities of the query language - the FUZZY operator: do an "inexact" search for words which are spelled or sound similar:

 SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'fuzzy((smythe))', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith

You can add these operators together in arbitrarily complex expressions, using parenthesis to establish precedent if required:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, '((john ACCUM peter) OR smith) NOT peter', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          2 John Smith
         2          1 John Doe

A few last things:

Oracle Text has a large number of reserved words - AND, OR, ACCUM and FUZZY all being in that list. If we want to search for any of those reserved words, we must enclose them in braces to negate their special meaning: 

select score(99), id, full_name from quickstart where contains ( full_name, '{fuzzy} OR {accum}', 99) > 0 order by score(99) desc;

Case sensitivity: CONTEXT indexes are case-insensitive by default in most environments. "smith" will match "Smith", etc. However, for certain languages (notably German) this is switched and the indexes are case-sensitive by default. The default language will be based on the normal NLS settings for your database.  Case sensitivity for an index can be chosen as a setting when the index is created.

This is only a glimpse at the capabilities of the Oracle Text CONTAINS query capabilities.  For a more detailed look at the operators available, see the documentation.

Move on to Part 3 - Index Maintenance

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Categories
Archives
« May 2015
MonTueWedThuFriSatSun
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today