Monday Aug 04, 2014

Getting started Part 4 - Advanced index customization

This post follows on from Part 1, Part 2 and Part 3 and uses the same example "quickstart" table.

Here we're going to look at some more advanced options for customizing the Oracle Text index.  If you just want basic full-text search on a text column, you may not need to know this at all. However, if you want to maximum use of Oracle Text, it's as well to understand the mechanisms used to customize the index.

So far we've looked at queries run against a default CONTEXT index, which you'll recall we created with the statement:

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;

CONTEXT indexes have many options available. We can't possibly cover them all here, but instead we'll look at the mechanism for applying customization to an index. This is done by creating a set of preferences, each of which contains its own attributes.  These preferences and attributes are created by calling PL/SQL stored procedures in the package CTX_DDL. It is in order to be able to call these procedures that we granted the CTXAPP role to our user in part one.

First of all, we'll consider defining join characters for our index. We saw earlier that the string "Smith-Smith" was indexed as two separate words, or tokens, in our index.  Indeed, we can check this by peeking at one of the tables which is created automatically when we created the index. The table we need to check is the so-called "$I table" - also known as the token table, which is given the name DR$index_name$I

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
JOHN
PETER
SMITH

Breaking up text into words or tokens is handled by a process known as the lexer.  By default, the hyphen character - like all non-alphanumerics charatcers -  is treated by the lexer as a break character. It causes a break between tokens.  We can also change it so that it joins tokens.  It can either become part of the token, so the token is "smith-smith" by defining it as a printjoins character, or it can invisibly join the token as "smithsmith" by defining it as a skipjoins character.  We'll make it a printjoins character.

Looking in the documentation, we find that PRINTJOINS is an attribute of the BASIC_LEXER.  So first we need to create a preference which uses the BASIC_LEXER:

We need to give the preference a name, so we'll call it my_lex_pref:

SQL> execute ctx_ddl.create_preference('my_lex_pref', 'BASIC_LEXER')

Then, for this preference, we need to set the printjoins attribute:

SQL> execute ctx_ddl.set_attribute('my_lex_pref', 'PRINTJOINS', '-')

So now we have a suitable preference, which we can include in the index via an index parameters clause.  First let's drop our existing index:

SQL> drop index full_name_index;

Index dropped.

And create a new one, specifying our newly-created lexer preference in the parameters clause:

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context
  parameters ('lexer my_lex_pref');

Index created.

Now if we peek into the index table again, we'll see something different:

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
JOHN
PETER
SMITH
SMITH-SMITH

So we can see that now the "-" character has been included in the indexed token.  We can also test this with a query. By default the "-" character is an operator in the CONTAINS query language (it means the same as MINUS - subtract the score of expression2 from expression1, similar but not the same as NOT), so we will need to escape it to keep it as part of the query:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'smith\-smith', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          3 Peter Smith-Smith

We've examined one way of customizing the CONTEXT index. There are many other ways that we don't have space to consider here - see the documentation for more details.  Most of these customizations use the same model of creating a preference then setting an attribute for that preference - but there are a couple of exceptions which we'll look at.

Firstly, Oracle Text has the concept of a default stoplist. This contains a list of common "noise" words which are not particularly useful for searching and take up a lot of space in the index.  In English, examples are in, on, and, for, to, that, etc. If we chose, we can use an empty stoplist (predefined as CTXSYS.EMPTY_STOPLIST) or create our own custom stoplist, and add stopwords to it:

SQL> execute ctx_ddl.create_stoplist('my_stoplist')
PL/SQL procedure successfully completed.

SQL> execute ctx_ddl.add_stopword('my_stoplist', 'JOHN')
PL/SQL procedure successfully completed.

SQL> execute ctx_ddl.add_stopword('my_stoplist', 'PETER')
PL/SQL procedure successfully completed.

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('stoplist my_stoplist');
Index created.

Now if we look in our $I table we will see a reduced set of tokens:

SQL> select token_text from dr$full_name_index$i;

TOKEN_TEXT
----------------------------------------------------------------
DOE
SMITH

So what happens if we use a stopword in a query? The answer is that it matches any word:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john smith', 99) > 0 order by score(99) desc;

SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
       7            3 Peter Smith-Smith
       4            2 John Smith

The other customization which doesn't use preferences is section groups. These are designed to allow us to search within specific sections of the document such as title, author, etc. The simplest one to use is AUTO_SECTION_GROUP which automatically identifies any section marked up with XML-like tags.  To illustrate this, we'll need to add a new record to our table.

insert into quickstart values (7, 'US', '<first>John</first> <last>Smith</last>');

Then we'll create a new section group using AUTO_SECTION_GROUP, and recreate the index using it:

SQL> execute ctx_ddl.create_section_group('my_sec_grp', 'AUTO_SECTION_GROUP') PL/SQL procedure successfully completed.

SQL> drop index full_name_index;

Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('section group my_sec_grp');

Index created.

And we can then query in the FIRST section using the WITHIN operator:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john WITHIN first', 99) > 0 order by score(99) desc;

SCORE(99)          ID FULL_NAME
---------- ---------- ----------------------------------------
         3          7 <first>John</first> <last>Smith</last>

That concludes our brief look at index customization.  There is much more to it - please look through the doc for more ideas.

Getting started Part 3 - Index maintenance

This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table.

One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates to the table are not immediately reflected in the index. Instead, the index must be synchronized. We can do this manually with a call to CTX_DLL.SYNC_INDEX, or we can arrange to have it done automatically for us.

Let's show the manual method first:. We'll insert a new run then search for it. We won't find it because the index is not up-to-date. Then we'll call SYNC_INDEX, giving it the name of the index, and search again:

SQL> insert into quickstart values (4, 'GB', 'Michael Smith');
1 row created.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
no rows selected

SQL> execute ctx_ddl.sync_index('full_name_index')
PL/SQL procedure successfully completed.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          4 Michael Smith

If we don't want to bother with manual SYNC calls,  we can use the parameter string "SYNC (ON COMMIT)" when we create the index. That means that immediately after updates are committed to the table, a SYNC operation will be run to get the index into step with the table.

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync (on commit)');
Index created.

SQL> insert into quickstart values (5, 'US', 'Scott Peters');
1 row created

We can search for that, but we won't find it since we haven't yet committed - so the SYNC has not run: 

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
no rows selected

Now we can commit...

SQL> commit;
Commit complete.

And this time it will find it:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          5 Scott Peters

SYNC(ON COMMIT) is fine if you don't make too many updates to your table, but is not ideal if you have a very large table and lots of changes take place on it.

Because of the way CONTEXT indexes are built, frequent changes will cause fragmentation of the index, decreasing performance over time. So a common way is to arrange for indexes to be synchronized at certain fixed intervals.  We can do this manually, using the database scheduler, or we can have it done automatically with a "SYNC (EVERY ... )" string in the parameters clause.  The time clause in that uses scheduler syntax, and can be a little convoluted. Every 1 minute can be represented in days as 1/24/60", so we can try the following:

SQL> drop index full_name_index;
Index dropped.

SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync ( every SYSDATE+1/24/60 )');
Index created.

SQL> insert into quickstart values (6, 'US', 'Paul Peters');
1 row created.

SQL> commit;
Commit complete.

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc;

no rows selected

So we've committed, but the regular SYNC hasn't kicked off yet. We can wait a minute and try again, and this time we'll find it:

SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc;

 SCORE(99)         ID FULL_NAME
---------- ---------- ----------------------------------------
         5          6 Paul Peters

Back in Part 1 we said that it was necessary to grant the "CREATE JOB" privilege to our user - this is why.  If the user doesn't have CREATE JOB privilege, then we will get an "insufficient privileges" error if we try to create the index with "SYNC (EVERY ...)".

Finally, we should talk about index optimization. CONTEXT indexes, over time, get less efficient as they get updated.  This inefficiency takes two forms:

  1. The index gets filled with garbage - it contains references to deleted or updated documents which are no longer current
  2. The "postings lists" get fragmented. The pointers to which documents contain each word get broken up into small chunks instead of the idea long strings.

We can fix this by running index optimization.  As with the manual call to sync indexes, this is done with a call to the PL/SQL package CTX_DDL:

SQL> execute ctx_ddl.optimize_index('full_name_index', 'FULL')
PL/SQL procedure successfully completed. 

The "FULL" parameter there is the mode of optimization. You can choose from the following:

  • FAST: Only posting fragmentation is fixed - no garbage collection occurs
  • FULL: Both fragmentation and garbage collection is dealt with
  • REBUILD: The entire index is copied to a fresh table (requires more disk space but produces the best results).

The more frequently you run SYNC, the more fragmented your index will become, and the more often you will need to optimize.

A typical "maintenance regime" at many customers is to run SYNC every five minutes, then run OPTIMIZE in FULL mode nightly, and OPTIMIZE in REBUILD mode once a week. 

The next topic is for advanced users: Getting Started Part 4 - Advanced Index Customization

Thursday Jul 24, 2014

Getting started with Oracle Text

My previous posts have mostly been aimed at experienced Oracle Text users. But what if you (or your colleagues) have never used it before?  What are the basics to get started? I hope this post will help some of those people.

Because I'm old-school (or maybe just old), my examples will be listed as though you're entering them in command-line SQL*Plus. But they will work just as well if you use them in SQL Developer.

First of all, any user of Oracle Text needs a special role CTXAPP.   Let's create a new user DEMO with that role, as well as the common CONNECT and RESOURCE roles - and also CREATE JOB which we'll need later in this series of tutorials.

create user demo identified by demo default tablespace users temporary tablespace temp quota unlimited on users;

grant connect, resource, ctxapp, create job to demo;

Now if we're going to call Oracle Text functionality from within PL/SQL procedures, we need some extra grants (since grants to roles don't work in packages).  See Step 3 in the Text Developer's Guide. We won't cover that here.

Now we'll assume the user has an application which queries a table called QUICKSTART.  For the demo, we should login as the DEMO user we just created, and run:

create table quickstart (id number primary key, country varchar2(2), full_name varchar2(40));

 We'll insert some records into that:

insert into quickstart values (1, 'US', 'John Doe');

insert into quickstart values (2, 'GB', 'John Smith');

insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');

Now a simple query against that table might look like

select * from quickstart where country = 'GB';

Now let's say we want to search for anyone with "Smith" in their name.  We could do this by simply extending the query to:

select * from quickstart where country = 'GB' and upper(full_name) like '%SMITH%';

but there are a few problems with this:

  • There is no index. The kernel must read every full_name field and scan it. Not a problem here, but certainly a problem if there are millions of records.
  • The query would match SMITHSON and BLACKSMITH. Not a problem if that's what we want, but tricky otherwise.

We can create a word-based index on the FULL_NAME column using the following syntax:

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;

 Note this is like any other "create index" statement, with the addition of the phrase "indextype is ctxsys.context", telling the kernel to create a specialized CONTEXT index and allow the use of query operators associated with that indextype.

For example we can do:

 select * from quickstart where country = 'GB' and contains ( full_name, 'smith') > 0;

A few things to note:

  • The CONTAINS operator can only be used if an index is present on the column specified in the first argument
  • The search string doesn't need any wildcards either side.  We are looking for the whole word 'smith' (if we wanted to match 'smithson' as well, then we could use a wildcard: 'smith%').
  • SQL doesn't have boolean functions, so CONTAINS returns zero for a non-match, and greater-than-zero for a match.  The actual value returned is the 'score' (see later) but there are very rarely any circumstances where you do anything other than testing where the return value is greater than zero.

Scoring: Most SQL queries just select between rows that match a criterion, and those that don't. There is rarely the concept of a good or better match. But if we're searching a lot of technical papers for the word "Exadata", then a document which has many occurrences of this term is likely to be a better match for our search than one where there is only a single match. Therefore CONTEXT indexes return a score for each search.  As mentioned, that is returned by the CONTAINS clause, but is rarely useful there.  It's usually better to use it in the SELECT clause, and perhaps in an ORDER BY clause as well.  A score is associated with a particular CONTAINS clause, of which a query may have several, so we use a number as the third argument in the CONTAINS function, and SCORE operator takes the same number.  It doesn't matter what that number is so long as it's the same in both cases.  We'll use 99 in this example:

select score(99), id, full_name from quickstart where contains ( full_name, 'smith', 99) > 0 order by score(99) desc;

This produces the output:

 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith

Note that the first item here scores higher than the second, because the search term smith appears twice.  Note there is no "absolute" meaning to those scores - you can't say  the first is a "7% match" or anything like that - all you can say is that a record with a higher score is more relevant than one with a lower score.

Scoring is quite a complex topic, which we'll cover in more detail later.

Move ahead to Part 2 - More queries

Wednesday Sep 25, 2013

High Volume Indexing with Oracle Text

My presentation from OpenWorld San Francisco 2013 is available here: High Volume Indexing with Oracle Text.

The Hands On Lab workbook is here: Text Analytics with Oracle Text.  Note this doesn't include scripts, or any data for the Classification section.

Friday Feb 01, 2013

Partition Maintenance and Oracle Text Indexes

Oracle Text supports "locally partitioned indexes" on partitioned tables.

That means that if you have a RANGE partitioned base table, you can use the LOCAL keyword when you create your index, and Oracle Text will automatically produce a partitioned index, with one index partition for each partition of the base table.

A partitioned index is very much like having a separate Oracle Text index on each partition. There is a $I table, a $K table, a $R table, etc, for each partition. The naming convention for the index is changed slightly - the $I table for an unpartitioned table is DR$index_name$I, for a partitioned table it is DR#index_name0001$I, DR#index_name0002$I, etc (the eagle-eyed will have spotted an apparent limit of 9,999 partitions, but in fact this is lifted in 11.2, as it moves to an alphunumeric naming convention after 9999).

Note that currently only RANGE partitioned tables are supported with local Text indexes.

An important feature of any partitioned table is the ability to split and merge partitions. Splitting means dividing one existing partition into two new partitions, and merging means combining two (or sometimes more) partitions into a single one.

Normally, any LOCAL indexes on partitions which have been split or merged become invalid - they must be rebuilt before they are useable. However, if you specify UPDATE GLOBAL INDEXES, the local indexes will be automatically rebuilt as part of the split or merge:
ALTER TABLE my_table
  MERGE PARTITIONS p2, p3 
  INTO PARTITION p2and3;
  UPDATE GLOBAL INDEXES
That's fine - but if we've got lots of data in there, it's going to take quite some time to rebuild those indexes, and our application is going to be unavailable for all that time.

Is there a way we can do a split or merge without having to take the application offline while we rebuild the indexes? Yes, there is. We can make use of a feature known as Partition Exchange, which allows us to swap a complete table, including its ready-built indexes, in place of an existing partition in a partitioned table.  The syntax looks like:
ALTER TABLE my_table
  EXCHANGE PARTITION p2a WITH TABLE my_temp_table
  INCLUDING INDEXES
  WITHOUT VALIDATION;
The WITHOUT VALIDATION is optional, but makes it a lot quicker. It avoids the need for the system to check every partition key value to make sure the row is going into the correct partition. Obviously it needs to be used with care.

So now we have a solution.  When we do a merge, we first copy the data from the partitions to be merged into a temporary table. We create an index on the temporary table using all the same index options as for our main table. When that index is complete, we do the ALTER TABLE ... MERGE PARTITION as above, but without the "UPDATE GLOBAL INDEXES" clause (thus leaving our indexes invalid for a short period) and then we do the ALTER TABLE ... EXCHANGE PARTITION to swap in the partition complete with its freshly built index in place of the merged partition with its invalid index.

We can do something similar with SPLIT PARTITION. This time, though, we will need two temporary tables.  We copy the data out of the first part of the partition to be split into one temporary table, and the data from the second part of the partition to be split into the other temporary table.  We create indexes on both of those temporary tables (we can build the two indexes simultaneously if we have enough resources, or sequentially if not.  We then split the partition, without using UPDATE GLOBAL INDEXES:
ALTER TABLE my_table 
  SPLIT PARTITION p2 AT ( 50 )
  INTO ( PARTITION p2a, PARTITION p2b )
and then we do two partition exchanges
ALTER TABLE my_table
  EXCHANGE PARTITION p2a WITH TABLE my_temp_table_1
  INCLUDING INDEXES
  WITHOUT VALIDATION
and again for the second
ALTER TABLE my_table
  EXCHANGE PARTITION p2b WITH TABLE my_temp_table_2
  INCLUDING INDEXES
  WITHOUT VALIDATION

... and that's it. Or is it? What about changes which have been made to the base table since we copied the data out into temporary tables?

Well the simple answer to that is to make sure we don't do any changes to the table. We could even make it read-only:

ALTER TABLE my_table READ ONLY

That solves the problem, but it's rather heavy-handed. After all, the reason we're doing all this is that it can take a long time to build text indexes, and we would prefer not to have to prevent updates to the table.  So instead we need to institute some kind of "change management".  There are doubtless several ways to achieve this, but I've done it by creating triggers which monitor any updates or inserts on the base table, and copy them to a temporary "staging" table. These transactions can then be copied back to the main table after the partition split or merge is complete, and the index sync'd in the normal way.

I've provided a pair of complete examples which illustrate this process. They do a split and a merge respectively on a simple partitioned table, creating indexes on temporary tables as necessary and exchanging those tables in place of partitions in the main table.  They also create the necessary triggers to monitor updates, and roll these back into the main table on completion.

One note about the example: I've partitioned the tables on a the "price" column.  While this seemed a good idea in the context of a simple, understandable "e-commerce" type model, it's actually a pretty bad idea, since you can't change the value of a partition key if it would cause the row to move partitions.  So if we have a partition boundary at 50, and an item which costs 49, then we can't increas its price to 51.

Downloads

The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As". 

If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page).

split_partitions.pls - download link / open in browser

merge_partitions.pls - download link / open in browser

About

bocadmin_ww

Search

Categories
Archives
« February 2015
MonTueWedThuFriSatSun
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
 
       
Today