Monday Aug 04, 2014

Getting started Part 4 - Advanced index customization

This post follows on from Part 1, Part 2 and Part 3 and uses the same example "quickstart" table.

Here we're going to look at some more advanced options for customizing the Oracle Text index.  If you just want basic full-text search on a text column, you may not need to know this at all. However, if you want to maximum use of Oracle Text, it's as well to understand the mechanisms used to customize the index.

So far we've looked at queries run against a default CONTEXT index, which you'll recall we created with the statement:

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;

CONTEXT indexes have many options available. We can't possibly cover them all here, but instead we'll look at the mechanism for applying customization to an index. This is done by creating a set of preferences, each of which contains its own attributes.  These preferences and attributes are created by calling PL/SQL stored procedures in the package CTX_DDL. It is in order to be able to call these procedures that we granted the CTXAPP role to our user in part one.

First of all, we'll consider defining join characters for our index. We saw earlier that the string "Smith-Smith" was indexed as two separate words, or tokens, in our index.  Indeed, we can check this by peeking at one of the tables which is created automatically when we created the index. The table we need to check is the so-called "$I table" - also known as the token table, which is given the name DR$index_name$I

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
JOHN
PETER
SMITH

Breaking up text into words or tokens is handled by a process known as the lexer.  By default, the hyphen character - like all non-alphanumerics charatcers -  is treated by the lexer as a break character. It causes a break between tokens.  We can also change it so that it joins tokens.  It can either become part of the token, so the token is "smith-smith" by defining it as a printjoins character, or it can invisibly join the token as "smithsmith" by defining it as a skipjoins character.  We'll make it a printjoins character.

Looking in the documentation, we find that PRINTJOINS is an attribute of the BASIC_LEXER.  So first we need to create a preference which uses the BASIC_LEXER:

We need to give the preference a name, so we'll call it my_lex_pref:

SQL> execute ctx_ddl.create_preference('my_lex_pref', 'BASIC_LEXER')

Then, for this preference, we need to set the printjoins attribute:

SQL> execute ctx_ddl.set_attribute('my_lex_pref', 'PRINTJOINS', '-')

So now we have a suitable preference, which we can include in the index via an index parameters clause.  First let's drop our existing index:

SQL> drop index full_name_index;

Index dropped.

And create a new one, specifying our newly-created lexer preference in the parameters clause:

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context
  parameters ('lexer my_lex_pref');

Index created.

Now if we peek into the index table again, we'll see something different:

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
JOHN
PETER
SMITH
SMITH-SMITH

So we can see that now the "-" character has been included in the indexed token.  We can also test this with a query. By default the "-" character is an operator in the CONTAINS query language (it means the same as MINUS - subtract the score of expression2 from expression1, similar but not the same as NOT), so we will need to escape it to keep it as part of the query:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'smith\-smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          3 Peter Smith-Smith

We've examined one way of customizing the CONTEXT index. There are many other ways that we don't have space to consider here - see the documentation for more details.  Most of these customizations use the same model of creating a preference then setting an attribute for that preference - but there are a couple of exceptions which we'll look at.

Firstly, Oracle Text has the concept of a default stoplist. This contains a list of common "noise" words which are not particularly useful for searching and take up a lot of space in the index.  In English, examples are in, on, and, for, to, that, etc. If we chose, we can use an empty stoplist (predefined as CTXSYS.EMPTY_STOPLIST) or create our own custom stoplist, and add stopwords to it:

SQL> execute ctx_ddl.create_stoplist('my_stoplist')
PL/SQL procedure successfully completed.

SQL> execute ctx_ddl.add_stopword('my_stoplist', 'JOHN')
PL/SQL procedure successfully completed.

SQL> execute ctx_ddl.add_stopword('my_stoplist', 'PETER')
PL/SQL procedure successfully completed.

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('stoplist my_stoplist');
Index created.

Now if we look in our $I table we will see a reduced set of tokens:

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
SMITH

So what happens if we use a stopword in a query? The answer is that it matches any word:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john smith', 99) > 0 order by score(99) desc;

SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
       7            3 Peter Smith-Smith
       4            2 John Smith

The other customization which doesn't use preferences is section groups. These are designed to allow us to search within specific sections of the document such as title, author, etc. The simplest one to use is AUTO_SECTION_GROUP which automatically identifies any section marked up with XML-like tags.  To illustrate this, we'll need to add a new record to our table.

insert into quickstart values (7, 'US', '<first>John</first> <last>Smith</last>');

Then we'll create a new section group using AUTO_SECTION_GROUP, and recreate the index using it:

SQL> execute ctx_ddl.create_section_group('my_sec_grp', 'AUTO_SECTION_GROUP') PL/SQL procedure successfully completed.

SQL> drop index full_name_index;

Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('section group my_sec_grp');

Index created.

And we can then query in the FIRST section using the WITHIN operator:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john WITHIN first', 99) > 0 order by score(99) desc;

SCORE(99)          ID FULL_NAME
---------- ---------- ----------------------------------------
         3          7 <first>John</first> <last>Smith</last>

That concludes our brief look at index customization.  There is much more to it - please look through the doc for more ideas.

Getting started Part 3 - Index maintenance

This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table.

One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates to the table are not immediately reflected in the index. Instead, the index must be synchronized. We can do this manually with a call to CTX_DLL.SYNC_INDEX, or we can arrange to have it done automatically for us.

Let's show the manual method first:. We'll insert a new run then search for it. We won't find it because the index is not up-to-date. Then we'll call SYNC_INDEX, giving it the name of the index, and search again:

SQL> insert into quickstart values (4, 'GB', 'Michael Smith');
1 row created.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
no rows selected

SQL> execute ctx_ddl.sync_index('full_name_index')
PL/SQL procedure successfully completed.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          4 Michael Smith

If we don't want to bother with manual SYNC calls,  we can use the parameter string "SYNC (ON COMMIT)" when we create the index. That means that immediately after updates are committed to the table, a SYNC operation will be run to get the index into step with the table.

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync (on commit)');
Index created.

SQL> insert into quickstart values (5, 'US', 'Scott Peters');
1 row created

We can search for that, but we won't find it since we haven't yet committed - so the SYNC has not run: 

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
no rows selected

Now we can commit...

SQL> commit;
Commit complete.

And this time it will find it:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          5 Scott Peters

SYNC(ON COMMIT) is fine if you don't make too many updates to your table, but is not ideal if you have a very large table and lots of changes take place on it.

Because of the way CONTEXT indexes are built, frequent changes will cause fragmentation of the index, decreasing performance over time. So a common way is to arrange for indexes to be synchronized at certain fixed intervals.  We can do this manually, using the database scheduler, or we can have it done automatically with a "SYNC (EVERY ... )" string in the parameters clause.  The time clause in that uses scheduler syntax, and can be a little convoluted. Every 1 minute can be represented in days as 1/24/60", so we can try the following:

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync ( every SYSDATE+1/24/60 )');
Index created.

SQL> insert into quickstart values (6, 'US', 'Paul Peters');
1 row created.

SQL> commit;
Commit complete.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc;

no rows selected

So we've committed, but the regular SYNC hasn't kicked off yet. We can wait a minute and try again, and this time we'll find it:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          6 Paul Peters

Back in Part 1 we said that it was necessary to grant the "CREATE JOB" privilege to our user - this is why.  If the user doesn't have CREATE JOB privilege, then we will get an "insufficient privileges" error if we try to create the index with "SYNC (EVERY ...)".

Finally, we should talk about index optimization. CONTEXT indexes, over time, get less efficient as they get updated.  This inefficiency takes two forms:

  1. The index gets filled with garbage - it contains references to deleted or updated documents which are no longer current
  2. The "postings lists" get fragmented. The pointers to which documents contain each word get broken up into small chunks instead of the idea long strings.

We can fix this by running index optimization.  As with the manual call to sync indexes, this is done with a call to the PL/SQL package CTX_DDL:

SQL> execute ctx_ddl.optimize_index('full_name_index', 'FULL')
PL/SQL procedure successfully completed. 

The "FULL" parameter there is the mode of optimization. You can choose from the following:

  • FAST: Only posting fragmentation is fixed - no garbage collection occurs
  • FULL: Both fragmentation and garbage collection is dealt with
  • REBUILD: The entire index is copied to a fresh table (requires more disk space but produces the best results).

The more frequently you run SYNC, the more fragmented your index will become, and the more often you will need to optimize.

A typical "maintenance regime" at many customers is to run SYNC every five minutes, then run OPTIMIZE in FULL mode nightly, and OPTIMIZE in REBUILD mode once a week. 

The next topic is for advanced users: Getting Started Part 4 - Advanced Index Customization

About

bocadmin_ww

Search

Categories
Archives
« August 2014
MonTueWedThuFriSatSun
    
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today