Wednesday Sep 25, 2013

High Volume Indexing with Oracle Text

My presentation from OpenWorld San Francisco 2013 is available here: High Volume Indexing with Oracle Text.

The Hands On Lab workbook is here: Text Analytics with Oracle Text.  Note this doesn't include scripts, or any data for the Classification section.

Friday Mar 22, 2013

Using Entity Extraction

Entity Extraction using Oracle Text was introduced with the 11.2.0.2 release of the Oracle Database. It's not a particularly well-known feature, so I thought I'd make some notes on how to use it.

Please note that Entity Extraction uses some third party technology "under the covers" which is deliberately disabled in 11.2.0.3. If you run these examples in 11.2.0.3 (or 11.2.0.4 when it's out) you will receive a "feature not generally available" message. The functionality can be re-enabled, but only by contacting Oracle Support so they can run some checks on whether your applications will be affected by the replacement of this third party technology in the next major release.

Entity extraction is the capability to identify, and extract, named entities within a text.

Entities are mainly nouns and noun phrases. They include names, places, times, coded strings (such as phone numbers and zip codes), percentages, monetary amounts and many others.

The ctx_entity package implements entity extraction by means of a built-in dictionary and set of rules for English text. The capabilities can be extended for English, or for other languages, by means of user-provided add-on dictionaries and rule sets.

Let's look at a very simple, basic example.  

Let's assume we have a clob containing the text:

New York, United States of America <p>
The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.'

We will use ctx_entity.extract to find all the entities in the CLOB value. (For now, we won't worry about how the text got into the clob or how we provide the output clob - that's all in the examples if you want to look).

Entity extraction requires a new type of policy - an "extract policy" - which allows you to specify options. For now, we will create a default policy, thus:

ctx_entity.create_extract_policy( 'mypolicy' );

We can then call extract to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example by calling dbms_lob.createtemporary).

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )

outclob contains the XML identifying extracted entities.  If we display the contents (preferably by selecting it as an xmltype so it gets formatted nicely) we will see:

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="1" offset="150" length="18" source="SuppliedRule">
    <text>Oracle Corporation</text>
    <type>company</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
  <entity id="3" offset="83" length="2" source="SuppliedRule">
    <text>5%</text>
    <type>percent</type>
  </entity>
  <entity id="4" offset="113" length="8" source="SuppliedDictionary">
    <text>software</text>
    <type>product</type>
  </entity>
  <entity id="5" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>state</type>
  </entity>
</entities>

That's fine if we're going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use XML DB functions to convert it:

select xtab.offset, xtab.text, xtab.type, xtab.source
from xmltable( '/entities/entity'
PASSING xmltype(outclob)
  COLUMNS
    offset number       PATH '@offset',
    lngth number        PATH '@length',
    text   varchar2(50) PATH 'text/text()',
    type   varchar2(50) PATH 'type/text()',
    source varchar2(50) PATH '@source'
) as xtab order by offset;

which produces as output:

    OFFSET TEXT                      TYPE                 SOURCE
---------- ------------------------- -------------------- --------------------
         0 New York                  city                 SuppliedDictionary
         0 New York                  state                SuppliedDictionary
        10 United States of America  country              SuppliedDictionary
        83 5%                        percent              SuppliedRule
       113 software                  product              SuppliedDictionary
       150 Oracle Corporation        company              SuppliedRule

Hopefully that's fairly self-explanatory.  Each entity found is listed with it's offset (starting point within the text), length, text, entity type, and source (how it was found).  The entity "New York" is identified as both a City and a State, so it appears twice.

If we don't want all the different types of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types.  For example:

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' )

That would give us the XML

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
</entities>

Next let's look at creating a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax.
The rule is added to an extraction policy, and will then apply whenever that policy is used

We will create a rule to identify increases, for example in a stock index.

There are many ways to express an increase - we're hoping to match any of the following expressions:
  climbed by 5%
  increased by over 30 percent
  jumped 5.5%

So we'll create a regular expression which matches any of those, and create a new type of entity. User defined entities must start with the letter "x", so we'll call our entity "xPositiveGain":

ctx_entity.add_extract_rule( 'mypolicy', 1,
    '<rule>'                                                          ||
      '<expression>'                                                  ||
         '((climbed|gained|jumped|increasing|increased|rallied)'      ||
         '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))'    ||
      '</expression>'                                                 ||
      '<type refid="1">xPositiveGain</type>'                          ||
    '</rule>');


Note the "refid" in there. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that's the outermost (and first occurring) parentheses, so that's refid=1.

This time, it is necessary to compile the policy before we can use it:

  ctx_entity.compile('mypolicy');

Then we can use it as before

  ctx_entity.extract('mypolicy', mydoc, null, myresults)

The (abbreviated) output of that would be

<entities>
  ...
  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
    <text>climbed by over 5%</text>
    <type>xPositiveGain</type>
  </entity>
</entities>

Finally, we're going to add another user defined entity, but this time as using a dictionary.  We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex.  While we're at it we'll add "S&P 500" as well. To do that, we create an XML file containing the following:

<dictionary>
  <entities>
    <entity>
      <value>dow jones industrial average</value>
      <type>xIndex</type>
    </entity>
    <entity>
      <value>S&amp;P 500</value>
      <type>xIndex</type>
    </entity>
  </entities>
</dictionary>

Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity "&amp;" - otherwise the XML would not be valid.

This XML file is loaded into the system using the CTXLOAD utility.  If the file was called "dict.load", we would use the command:

ctxload -user username/password -extract -name mypolicy -file dict.load

Again we need to compile the policy using ctx_entity.compile. Then when we run ctx_entity.extract we will see in the output:

<entities>
  ...
  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
    <text>climbed by over 5%</text>
    <type>xPositiveGain</type>
  </entity>
  <entity id="7" offset="43" length="28" source="UserDictionary">
    <text>Dow Jones Industrial Average</text>
    <type>xIndex</type>
  </entity>
</entities>

Here's some example code so you can try it yourself.  It uses SQL*Plus bind variables, which don't work too well in SQL Developer - if you really don't like using SQL*Plus, you'll need to convert it - perhaps put the whole thing into a PL/SQL procedure and use PL/SQL variables instead of SQL*Plus variables.

Download link (right-click and "Save As" on most systems): entity_extraction.sql


Friday Feb 01, 2013

Partition Maintenance and Oracle Text Indexes

Oracle Text supports "locally partitioned indexes" on partitioned tables.

That means that if you have a RANGE partitioned base table, you can use the LOCAL keyword when you create your index, and Oracle Text will automatically produce a partitioned index, with one index partition for each partition of the base table.

A partitioned index is very much like having a separate Oracle Text index on each partition. There is a $I table, a $K table, a $R table, etc, for each partition. The naming convention for the index is changed slightly - the $I table for an unpartitioned table is DR$index_name$I, for a partitioned table it is DR#index_name0001$I, DR#index_name0002$I, etc (the eagle-eyed will have spotted an apparent limit of 9,999 partitions, but in fact this is lifted in 11.2, as it moves to an alphunumeric naming convention after 9999).

Note that currently only RANGE partitioned tables are supported with local Text indexes.

An important feature of any partitioned table is the ability to split and merge partitions. Splitting means dividing one existing partition into two new partitions, and merging means combining two (or sometimes more) partitions into a single one.

Normally, any LOCAL indexes on partitions which have been split or merged become invalid - they must be rebuilt before they are useable. However, if you specify UPDATE GLOBAL INDEXES, the local indexes will be automatically rebuilt as part of the split or merge:
ALTER TABLE my_table
  MERGE PARTITIONS p2, p3 
  INTO PARTITION p2and3;
  UPDATE GLOBAL INDEXES
That's fine - but if we've got lots of data in there, it's going to take quite some time to rebuild those indexes, and our application is going to be unavailable for all that time.

Is there a way we can do a split or merge without having to take the application offline while we rebuild the indexes? Yes, there is. We can make use of a feature known as Partition Exchange, which allows us to swap a complete table, including its ready-built indexes, in place of an existing partition in a partitioned table.  The syntax looks like:
ALTER TABLE my_table
  EXCHANGE PARTITION p2a WITH TABLE my_temp_table
  INCLUDING INDEXES
  WITHOUT VALIDATION;
The WITHOUT VALIDATION is optional, but makes it a lot quicker. It avoids the need for the system to check every partition key value to make sure the row is going into the correct partition. Obviously it needs to be used with care.

So now we have a solution.  When we do a merge, we first copy the data from the partitions to be merged into a temporary table. We create an index on the temporary table using all the same index options as for our main table. When that index is complete, we do the ALTER TABLE ... MERGE PARTITION as above, but without the "UPDATE GLOBAL INDEXES" clause (thus leaving our indexes invalid for a short period) and then we do the ALTER TABLE ... EXCHANGE PARTITION to swap in the partition complete with its freshly built index in place of the merged partition with its invalid index.

We can do something similar with SPLIT PARTITION. This time, though, we will need two temporary tables.  We copy the data out of the first part of the partition to be split into one temporary table, and the data from the second part of the partition to be split into the other temporary table.  We create indexes on both of those temporary tables (we can build the two indexes simultaneously if we have enough resources, or sequentially if not.  We then split the partition, without using UPDATE GLOBAL INDEXES:
ALTER TABLE my_table 
  SPLIT PARTITION p2 AT ( 50 )
  INTO ( PARTITION p2a, PARTITION p2b )
and then we do two partition exchanges
ALTER TABLE my_table
  EXCHANGE PARTITION p2a WITH TABLE my_temp_table_1
  INCLUDING INDEXES
  WITHOUT VALIDATION
and again for the second
ALTER TABLE my_table
  EXCHANGE PARTITION p2b WITH TABLE my_temp_table_2
  INCLUDING INDEXES
  WITHOUT VALIDATION

... and that's it. Or is it? What about changes which have been made to the base table since we copied the data out into temporary tables?

Well the simple answer to that is to make sure we don't do any changes to the table. We could even make it read-only:

ALTER TABLE my_table READ ONLY

That solves the problem, but it's rather heavy-handed. After all, the reason we're doing all this is that it can take a long time to build text indexes, and we would prefer not to have to prevent updates to the table.  So instead we need to institute some kind of "change management".  There are doubtless several ways to achieve this, but I've done it by creating triggers which monitor any updates or inserts on the base table, and copy them to a temporary "staging" table. These transactions can then be copied back to the main table after the partition split or merge is complete, and the index sync'd in the normal way.

I've provided a pair of complete examples which illustrate this process. They do a split and a merge respectively on a simple partitioned table, creating indexes on temporary tables as necessary and exchanging those tables in place of partitions in the main table.  They also create the necessary triggers to monitor updates, and roll these back into the main table on completion.

One note about the example: I've partitioned the tables on a the "price" column.  While this seemed a good idea in the context of a simple, understandable "e-commerce" type model, it's actually a pretty bad idea, since you can't change the value of a partition key if it would cause the row to move partitions.  So if we have a partition boundary at 50, and an item which costs 49, then we can't increas its price to 51.

Downloads

The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As". 

If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page).

split_partitions.pls - download link / open in browser

merge_partitions.pls - download link / open in browser

Wednesday May 16, 2012

Oracle Text and accented characters

There are a number of lexer attributes in Oracle Text that affect the way accented characters are dealt with. We're going to look at each of these and consider the effects they have, alone and together.

Before we look into them, a quick aside on typing accented characters if you don't have them natively on your keyboard. I'm not going to go too deep into character sets and code pages, but if you're on a European Windows machine, you're probably using the WIN1252 character set.

If I want to type one of the accented characters in that set, in most Windows applications I can hold down the ALT key, then enter the four-digit decimal code for that character on my numeric keypad. So to enter a lower-case o-umlaut character ("o" with two dots above it: "ö"), I would hold down ALT and type 0246 on the numeric keypad.

BASE_LETTER

Let's start with perhaps the simplest - the BASE_LETTER attribute. If I set this to "true" then characters with accents in the text will effectively be indexed as their base form.

So for example if I index the German word "schön" (meaning beautiful in English) then the actual indexed token will be "schon". That means if the user searches for " schön" or "schon" then the word will be found (since the query term is processed in the same way).

That makes life much easier for people who don't have the oh-umlaut character on their keyboard, and don't know the ALT trick mentioned above (or don't want to have to look up the necessary character code).

So that's simple - might as well set that option on always, right? Well, not necessarily. Because in German, the word "schon" actually has a quite different meaning to " schön", and German users wouldn't want to find the unaccented word if they specifically meant to look for the accented word.

So the simple rule of thumb is this: If you think that most of your users will be searching for words for which they have the correct keyboard (for example German users searching German text) then it's best to set BASE_LETTER to false. But if you think users might want to search foreign words for which they do not have the correct keyboard (for example English, or multi-national users searching German or mixed text) then it's best to set BASE_LETTER to TRUE.

ALTERNATE_SPELLING

Now it starts to get a little more complex. Some languages - notably German again - allow you to avoid the use of accented characters by spelling the words differently. In German, an accented character is replaced by the un-accented form of the same letter, followed by an "e". So "shön" could equally validly be spelled "shoen", and every German user would recognise it as the same word. Equally "Muenchen" (the city English speakers call Munich) would be recognized as the same as "München".

So that we can treat these alternate spelling forms as equivalent Oracle Text has the ALTERNATE_SPELLING attribute. When set to "german", Oracle Text will look for the "alternate form" of "vowel + e" and index both that and the accented form of the word. When processing a query, it will equally translate "oe" into "ö", etc, thus ensuring that the word can always be found, regardless of which of the alternate forms is searched for, and which is in the indexed text.

Aside: anyone following closely might ask "why does it index both the alternate form "shoen" and the accented form "shön"? Surely it would be sufficient to just index the "shön" form? Well, mostly it would. But what happens if the word in question was actually an English word in the middle of the German text, such as "poet"? OK, it would be indexed as "pöt" and anybody seaching for "poet" would still find it (since the transformation is applied to the searchterm as well). But what if they used a wildcard and searched for "po%"? They would still expect to find the term, but if only the accented form was indexed, they wouldn't. Hence the reason we index both forms, just in case.

Combining ALTERNATE_SPELLING and BASE_LETTER

OK, so we want ALTERNATE_SPELLING set to "german" for our German text. But we also know that people with non-German keyboards are often going to search it. So we BASE_LETTER on as well. What happens now?

If the indexer comes across "schön", ALTERNATE_SPELLING would normally index that without any change. But BASE_LETTER forces it to the unaccented form, and "schon" is actually indexed. If the indexer comes across "shoen", then ALTERNATE_SPELLING decides it should be indexed as both "schoen" and "schön". But before the tokens are written to the index, BASE_LETTER is applied, so the tokens "shoen" and "shon" are indexed.

That all works fine, and we can find either term by searching for "shon", "shön" or "shoen". Great!

But (there's always a but) what happens if we index the French word "Rouède" (the name of a town near the Spanish border)? "uè" is not a candidate for ALTERNATE_SPELLING, so it is left alone. Then BASE_LETTER is applied, and the word "Rouede" is written to the index. If the user searches for "Rouède" then the query-time processing works the same, the search is converted to "Rouede" and the query works fine. However, if the user searches for the base letter form "Rouede", things don't go so well. This time ALTERNATE_SPELLING does get applied to the query term (since the query processor has no way of knowing that the "e" character should be accented) and the searchterm is converted to "Roüde". BASE_LETTER is then applied, and it looks for "Roude" in the index. But the indexed term is "Rouede", so nothing is found.

OVERRIDE_BASE_LETTER

To solve this problem, the OVERRIDE_BASE_LETTER attribute was introduced.

If you set OVERRIDE_BASE_LETTER to "true", then ALTERNATE_SPELLING will "mask" BASE_LETTER. That means that if we meet accented characters in the text which have a alternate form (such as "ö"), we will index them in their original, accented form and also in their alternate form. If we meet them in their alternate form (eg Muenchen) we will index ONLY the alternate form and not transform them.  Accented characters which do not have an alternate form (such as "è") have BASE_LETTER processing applied to them to transform them to their equivalent unaccented character.  Then at query time, we apply only ALTERNATE_SPELLING to any appropriate accented search terms, and BASE_LETTER to all others.This has the positive effect that our previous example "rouède" can be found, if searched for with or without the accent on the "e" character.

It does have the negative effect that base letter searches no longer work on German words - we can't search for "shon" anymore, only "shön" or "shoen" will work. So OVERRIDE_BASE_LETTER makes sense if we want to perform ALTERNATE_SPELLING on German (or other specified language) words, and BASE_LETTER on all other languages.

Appendix: Test Script

This is the script I used to test the effects of the various options. To avoid any issues with character set translation, I used the UNISTR() function to create Unicode characters for my accented characters. Note the German words are prefixed by two-letter codes "wa" - with accent, "na" - no accent and "af" alternate form. That allowed me to distinguish in the index between the index terms derived from "schön" and those derived from "schon".

begin
ctx_ddl.drop_preference ( 'my_lexer');
end;
/

begin
ctx_ddl.create_preference ( 'my_lexer', 'BASIC_LEXER' );
ctx_ddl.set_attribute( 'my_lexer', 'BASE_LETTER', 'true' );
ctx_ddl.set_attribute( 'my_lexer', 'OVERRIDE_BASE_LETTER', 'true');
ctx_ddl.set_attribute( 'my_lexer', 'ALTERNATE_SPELLING','german' );
end;
/

drop table tt;
create table tt(a1 number primary key,text varchar2(45));

-- town name "Rouède", accent on the e
insert into tt values (1,'rou'||unistr('\00E8')||'de');

-- shön with accent (wa)
insert into tt values (2,'wash'||unistr('\00F6')||'n');

-- shon no accent (na)
insert into tt values (3,'nashon');

-- muenchen alternate form (af)
insert into tt values (4,'afmuenchen');

commit;

select * from tt;

create index tta on tt(text) indextype is ctxsys.context
parameters ( 'lexer my_lexer' );

set feedback 2

select token_text, token_type from dr$tta$i;

PROMPT searching for the base letter form, without accent on the first e
select * from tt where contains(text,'Rouede')>0;

PROMPT and with the accent
select * from tt where contains(text,'Rou'||unistr('\00E8')||'de') > 0;

--select * from tt where contains(text,'m'||unistr('\00FC')||'nchen')>0;
--select * from tt where contains(text,'muenchen') > 0;

set echo on

--select * from tt where contains(text,'nashoen') > 0;
--select * from tt where contains(text,'nashon') > 0;
--select * from tt where contains(text,'na'||unistr('\00F6')||'n') > 0;

select * from tt where contains(text,'washon') > 0;
select * from tt where contains(text,'washoen') > 0;
select * from tt where contains(text,'wa'||unistr('\00F6')||'n') > 0;

set echo off

-- The following section shows how to see how the query has been transformed - it shows
-- the actual words looked for in the index.

drop table test_explain;
create table test_explain(
explain_id varchar2(30),
id number,
parent_id number,
operation varchar2(30),
options varchar2(30),
object_name varchar2(64),
position number,
cardinality number);

begin
ctx_query.explain(
index_name => 'tta',
text_query => 'wasch'||unistr('\00F6')||'n',
explain_table => 'test_explain',
sharelevel => 0,
explain_id => 'Test');
end;
/

col explain_id for a10
col id for 99
col parent_id for 99
col operation for a10
col options for a10
col object_name for a20
col position for 99

select explain_id, id, parent_id, operation, options, object_name, position
from test_explain order by id;

About

bocadmin_ww

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today