Monday Jul 28, 2014

Getting started Part 2 - More queries

[Read More]

Thursday Jul 24, 2014

Getting started with Oracle Text

My previous posts have mostly been aimed at experienced Oracle Text users. But what if you (or your colleagues) have never used it before?  What are the basics to get started? I hope this post will help some of those people.

Because I'm old-school (or maybe just old), my examples will be listed as though you're entering them in command-line SQL*Plus. But they will work just as well if you use them in SQL Developer.

First of all, any user of Oracle Text needs a special role CTXAPP.   Let's create a new user DEMO with that role, as well as the common CONNECT and RESOURCE roles - and also CREATE JOB which we'll need later in this series of tutorials.

create user demo identified by demo default tablespace users temporary tablespace temp quota unlimited on users;

grant connect, resource, ctxapp, create job to demo;

Now if we're going to call Oracle Text functionality from within PL/SQL procedures, we need some extra grants (since grants to roles don't work in packages).  See Step 3 in the Text Developer's Guide. We won't cover that here.

Now we'll assume the user has an application which queries a table called QUICKSTART.  For the demo, we should login as the DEMO user we just created, and run:

create table quickstart (id number primary key, country varchar2(2), full_name varchar2(40));

 We'll insert some records into that:

insert into quickstart values (1, 'US', 'John Doe');

insert into quickstart values (2, 'GB', 'John Smith');

insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');

Now a simple query against that table might look like

select * from quickstart where country = 'GB';

Now let's say we want to search for anyone with "Smith" in their name.  We could do this by simply extending the query to:

select * from quickstart where country = 'GB' and upper(full_name) like '%SMITH%';

but there are a few problems with this:

  • There is no index. The kernel must read every full_name field and scan it. Not a problem here, but certainly a problem if there are millions of records.
  • The query would match SMITHSON and BLACKSMITH. Not a problem if that's what we want, but tricky otherwise.

We can create a word-based index on the FULL_NAME column using the following syntax:

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;

 Note this is like any other "create index" statement, with the addition of the phrase "indextype is ctxsys.context", telling the kernel to create a specialized CONTEXT index and allow the use of query operators associated with that indextype.

For example we can do:

 select * from quickstart where country = 'GB' and contains ( full_name, 'smith') > 0;

A few things to note:

  • The CONTAINS operator can only be used if an index is present on the column specified in the first argument
  • The search string doesn't need any wildcards either side.  We are looking for the whole word 'smith' (if we wanted to match 'smithson' as well, then we could use a wildcard: 'smith%').
  • SQL doesn't have boolean functions, so CONTAINS returns zero for a non-match, and greater-than-zero for a match.  The actual value returned is the 'score' (see later) but there are very rarely any circumstances where you do anything other than testing where the return value is greater than zero.

Scoring: Most SQL queries just select between rows that match a criterion, and those that don't. There is rarely the concept of a good or better match. But if we're searching a lot of technical papers for the word "Exadata", then a document which has many occurrences of this term is likely to be a better match for our search than one where there is only a single match. Therefore CONTEXT indexes return a score for each search.  As mentioned, that is returned by the CONTAINS clause, but is rarely useful there.  It's usually better to use it in the SELECT clause, and perhaps in an ORDER BY clause as well.  A score is associated with a particular CONTAINS clause, of which a query may have several, so we use a number as the third argument in the CONTAINS function, and SCORE operator takes the same number.  It doesn't matter what that number is so long as it's the same in both cases.  We'll use 99 in this example:

select score(99), id, full_name from quickstart where contains ( full_name, 'smith', 99) > 0 order by score(99) desc;

This produces the output:

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith

Note that the first item here scores higher than the second, because the search term smith appears twice.  Note there is no "absolute" meaning to those scores - you can't say  the first is a "7% match" or anything like that - all you can say is that a record with a higher score is more relevant than one with a lower score.

Scoring is quite a complex topic, which we'll cover in more detail later.

Move ahead to Part 2 - More queries




« July 2014 »