Friday Mar 22, 2013

Using Entity Extraction

Entity Extraction using Oracle Text was introduced with the 11.2.0.2 release of the Oracle Database. It's not a particularly well-known feature, so I thought I'd make some notes on how to use it.

Please note that Entity Extraction uses some third party technology "under the covers" which is deliberately disabled in 11.2.0.3. If you run these examples in 11.2.0.3 (or 11.2.0.4 when it's out) you will receive a "feature not generally available" message. The functionality can be re-enabled, but only by contacting Oracle Support so they can run some checks on whether your applications will be affected by the replacement of this third party technology in the next major release.

Entity extraction is the capability to identify, and extract, named entities within a text.

Entities are mainly nouns and noun phrases. They include names, places, times, coded strings (such as phone numbers and zip codes), percentages, monetary amounts and many others.

The ctx_entity package implements entity extraction by means of a built-in dictionary and set of rules for English text. The capabilities can be extended for English, or for other languages, by means of user-provided add-on dictionaries and rule sets.

Let's look at a very simple, basic example.  

Let's assume we have a clob containing the text:

New York, United States of America <p>
The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.'

We will use ctx_entity.extract to find all the entities in the CLOB value. (For now, we won't worry about how the text got into the clob or how we provide the output clob - that's all in the examples if you want to look).

Entity extraction requires a new type of policy - an "extract policy" - which allows you to specify options. For now, we will create a default policy, thus:

ctx_entity.create_extract_policy( 'mypolicy' );

We can then call extract to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example by calling dbms_lob.createtemporary).

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )

outclob contains the XML identifying extracted entities.  If we display the contents (preferably by selecting it as an xmltype so it gets formatted nicely) we will see:

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="1" offset="150" length="18" source="SuppliedRule">
    <text>Oracle Corporation</text>
    <type>company</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
  <entity id="3" offset="83" length="2" source="SuppliedRule">
    <text>5%</text>
    <type>percent</type>
  </entity>
  <entity id="4" offset="113" length="8" source="SuppliedDictionary">
    <text>software</text>
    <type>product</type>
  </entity>
  <entity id="5" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>state</type>
  </entity>
</entities>

That's fine if we're going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use XML DB functions to convert it:

select xtab.offset, xtab.text, xtab.type, xtab.source
from xmltable( '/entities/entity'
PASSING xmltype(outclob)
  COLUMNS
    offset number       PATH '@offset',
    lngth number        PATH '@length',
    text   varchar2(50) PATH 'text/text()',
    type   varchar2(50) PATH 'type/text()',
    source varchar2(50) PATH '@source'
) as xtab order by offset;

which produces as output:

    OFFSET TEXT                      TYPE                 SOURCE
---------- ------------------------- -------------------- --------------------
         0 New York                  city                 SuppliedDictionary
         0 New York                  state                SuppliedDictionary
        10 United States of America  country              SuppliedDictionary
        83 5%                        percent              SuppliedRule
       113 software                  product              SuppliedDictionary
       150 Oracle Corporation        company              SuppliedRule

Hopefully that's fairly self-explanatory.  Each entity found is listed with it's offset (starting point within the text), length, text, entity type, and source (how it was found).  The entity "New York" is identified as both a City and a State, so it appears twice.

If we don't want all the different types of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types.  For example:

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' )

That would give us the XML

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
</entities>

Next let's look at creating a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax.
The rule is added to an extraction policy, and will then apply whenever that policy is used

We will create a rule to identify increases, for example in a stock index.

There are many ways to express an increase - we're hoping to match any of the following expressions:
  climbed by 5%
  increased by over 30 percent
  jumped 5.5%

So we'll create a regular expression which matches any of those, and create a new type of entity. User defined entities must start with the letter "x", so we'll call our entity "xPositiveGain":

ctx_entity.add_extract_rule( 'mypolicy', 1,
    '<rule>'                                                          ||
      '<expression>'                                                  ||
         '((climbed|gained|jumped|increasing|increased|rallied)'      ||
         '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))'    ||
      '</expression>'                                                 ||
      '<type refid="1">xPositiveGain</type>'                          ||
    '</rule>');


Note the "refid" in there. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that's the outermost (and first occurring) parentheses, so that's refid=1.

This time, it is necessary to compile the policy before we can use it:

  ctx_entity.compile('mypolicy');

Then we can use it as before

  ctx_entity.extract('mypolicy', mydoc, null, myresults)

The (abbreviated) output of that would be

<entities>
  ...
  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
    <text>climbed by over 5%</text>
    <type>xPositiveGain</type>
  </entity>
</entities>

Finally, we're going to add another user defined entity, but this time as using a dictionary.  We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex.  While we're at it we'll add "S&P 500" as well. To do that, we create an XML file containing the following:

<dictionary>
  <entities>
    <entity>
      <value>dow jones industrial average</value>
      <type>xIndex</type>
    </entity>
    <entity>
      <value>S&amp;P 500</value>
      <type>xIndex</type>
    </entity>
  </entities>
</dictionary>

Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity "&amp;" - otherwise the XML would not be valid.

This XML file is loaded into the system using the CTXLOAD utility.  If the file was called "dict.load", we would use the command:

ctxload -user username/password -extract -name mypolicy -file dict.load

Again we need to compile the policy using ctx_entity.compile. Then when we run ctx_entity.extract we will see in the output:

<entities>
  ...
  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
    <text>climbed by over 5%</text>
    <type>xPositiveGain</type>
  </entity>
  <entity id="7" offset="43" length="28" source="UserDictionary">
    <text>Dow Jones Industrial Average</text>
    <type>xIndex</type>
  </entity>
</entities>

Here's some example code so you can try it yourself.  It uses SQL*Plus bind variables, which don't work too well in SQL Developer - if you really don't like using SQL*Plus, you'll need to convert it - perhaps put the whole thing into a PL/SQL procedure and use PL/SQL variables instead of SQL*Plus variables.

Download link (right-click and "Save As" on most systems): entity_extraction.sql


Friday Mar 15, 2013

A staging table for Oracle Text index updates

Oracle Text CONTEXT indexes differ from standard BTREE indexes in that they are not synchronous.  That is, the index is not updated at the same time that the data changes. It is necessary to invoke an index sync in some manner before any updates (or inserts) to the table are reflected in the index.

The simplest way to do this is to specify SYNC (ON COMMIT) in the parameters clause of the index creation. That means that as soon as changes are committed, the index is sync'd.

 (This still isn't quite synchronous. The indexing doesn't start until the commit, so although changes to the base table will be visible to the user who made them before doing a commit, that user will NOT see changes reflected in the index. Additionally, there will be a short period between the end of the commit and the end of the sync when other userswill be able to see the table changes but not the index changes).

 SYNC(ON COMMIT) is popular, but it's not necessarily the best option.  This is because frequent syncs cause fragmentation of the index. New data is written in small chunks to the end of the $I postings list table, and the more and smaller these are, the worse query performance will be until the index is optimized.

So there is usually a trade-off between update frequency, and the fragmentation of an index. Users would like to have their updates available for searching immediately by means of SYNC(ON COMMIT), but DBAs would prefer to have SYNCs spaced widely apart to avoid fragmentation of the index.

One way around this is to create a staging table to hold recent updates to the main table.  The staging table will be a copy of the main table's layout, with the addition of a timestamp column.

Indexes are created on both tables (main and stage).  These indexes are identical, except the main index has manual sync, and the staging table has SYNC(ON COMMIT).

We then create triggers which copy inserts and updates from the main table to the stage table, where they are indexed near-instantaneously.  Because the stage table is kept small, we needn't worry about fragmentation.  It's probably advisable to arrange things such that the stage table is "pinned" in the SGA buffer cache, but if it's used regularly, it will probably stay in memory anyway.

Queries which would previously have run against the main table are modified so that they do a UNION between the main table and the staging table. That way, we get all the unmodified rows from the main table, and the modified rows from the stage table.  If you're thinking "won't we get the modified rows in their old form from the base table?" then remember that DELETEs are actioned immediately in a CONTEXT index, and that updates are implemented internally as DELETE followed by INSERT).

Then we just need to put a wrapper around the SYNC for the main table which performs the actual SYNC then deletes all the rows from the staging table which were inserted before the start of the SYNC (hence the need for a timestamp column).  If we choose, we can also optimize the stage table index at this point to.

I've written some sample code to demonstrate this on a simple table. It is intended to be run in SQL*Plus but will probably work in SQL Developer with little or no modification.  Be aware that it drops and recreates a user called TESTUSER.  You will need to edit the CONNECT line unless your SYSTEM password is "manager".

Download link (right click and save): staging_index.sql or view in browser: staging_index.sql.txt



About

bocadmin_ww

Search

Categories
Archives
« March 2013 »
MonTueWedThuFriSatSun
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
       
Today