X

An Oracle blog about Oracle Text Index

Recent Posts

Oracle Text Indexes - New logging and tracing methods in 12cR2

Prior to Oracle 12cR2, getting a log of your index session required you to call something like this before starting the index creation: ctx_output.start_log ('myfilename.log') That would then create the output in a file $ORACLE_HOME/ctx/log/myfilename.log.  If I'm using PARALLEL 2 in index creation, I'll also see files myfilename.log_0 and myfilename.log_1. There are a few problems with this approach: It doesn't work in a "read only Oracle Home" environment The user needs to have direct access to the $ORACLE_HOME/ctx/log/ directory to access the log files It gets messy and difficult in a cloud and/or PDB environment There's no way to start logging mid-session.  If you're wondering how far your indexing has got after several hours and you didn't turn on logging, your only option is kill indexing, drop the part-builstart logging and create the index from scratch. So in 12cR2 (12.2.01) we have moved all logging and tracing to the standard database framework.  In some respects, this is a little more complex to deal with, but it offers powerful and secure features for analyzing logs. Logging the current session Assuming we can remember to turn logging on before we start the indexing job, we can use: alter session set events 'TRACE[CONTEXT_INDEX] disk=low'; Naturally this does require ALTER SESSION privilege. The final word can be low, medium or high - we'll cover the meanings later. To find out where the logging is going to, we can use: select value from v$diag_info where name = 'Default Trace File'; (note that case is important).  In my case that returned: VALUE ------------------------------------------------------------------- /u01/app/oracle/diag/rdbms/r12201/r12201/trace/r12201_ora_17260.trc We can also get the same information from the package ctx_output - we have to call start_log, then logfilename: SQL> exec ctx_output.start_log('anystringhere') PL/SQL procedure successfully completed. SQL> select ctx_output.logfilename from dual; LOGFILENAME ------------------------------------------------------------------- /u01/app/oracle/diag/rdbms/r12201/r12201/trace/r12201_ora_17260.trc So if we have OS access to the machine, we can just go and look at that file. If we don't have OS access, we just need to know the file name (not the path) and we can use either internal views, or the snappily-named Automatic Diagnostic Repository Command Interpreter (adcri). Using ADRCI The Automatic Diagnostic Repository Command Interpreter is a command-line tool for analyzing diagnostic files which was introduced in 11g.  It requires shell access to the database server, so is not much use in a Cloud DBaaS environment. However, it does have powerful capabilities. Look to the doc for details, but we'll cover the very basics here. adrci> SHOW TRACE (co, op) <trcfilename> -xp "[co='CONTEXT_INDEX']" ‘show trace’ outputs the data to a file (*.ado) by default. You can push the output to a terminal with the “-term” switch SHOW TRACE (co, op) x3_ora_26659.trc -term -xp "[co='CONTEXT_INDEX']" To list all the parallel indexing trace file names, sorted by timestamp in reverse order: adrci> SHOW TRACEFILE %p00% -rt Note: To view query logs, use component name 'CONTEXT_QUERY' rather than 'CONTEXT_INDEX'. Using Views: There are new views in 12.2 to allow you to examine database trace files without having OS acces. Once you've established the name of the tracefile (see above) you can run this: select PAYLOAD from V$DIAG_TRACE_FILE_CONTENTS where COMPONENT_NAME = 'CONTEXT_INDEX' and TRACE_FILENAME = 'r12201_ora_17260.trc'; Note that 'ordinary' users can't SELECT from V$DIAG_TRACE_FILE_CONTENTS, you must be a DBA or have SELECT on that table explicitly granted to you. I guess in theory you should add 'order by timestamp' to the end of that statement, but in fact they do seem to come out in timestamp order by default - though by normal SQL rules that's not guaranteed. Logging another Session If you've already started an indexing job, and need to know how far it's got, you won't be able to run 'ALTER SESSION' for that session. So instead you must do an ALTER SYSTEM call instead, which will affect all processes. alter system set events 'TRACE[CONTEXT_QUERY] disk=low'; Finding the name of the tracefile can be challenging. You can't use v$diag_info as above, so instead you must construct the file name yourself.  The file name will normally take the form <SID>_ora_<procid>.trc. The value for <procid> can be found by looking up the spid column in V$PROCESS. Determining which row in V$PROCESS is the one you're looking for is beyond the scope of this article. Alternatively, if you have OS access to the system in question, you can just go to the trace file directory (find it with > select value from v$diag_info where name = 'Default Trace File') and look for recent files, or grep for the string(s) you're expecting to find. Tracing levels There are three levels of tracing you can use: low, medium and high. Trace level low: alter session set events 'TRACE[CONTEXT_INDEX] disk=low'; This logs summary indexing information only (eg start, finish, no. of rows), and intermediate writes to disk Trace level medium: alter session set events 'TRACE[CONTEXT_INDEX] disk=medium'; This logs summary information and logs every 100 rows processed (similar to ctx_output.start_log) Trace level high: alter session set events 'TRACE[CONTEXT_INDEX] disk=high'; As medium plus it logs individual rowids and tokens. The following events are enabled: CTX_OUTPUT.EVENT_INDEX_PRINT_ROWID CTX_OUTPUT.EVENT_OPT_PRINT_TOKEN CTX_OUTPUT.EVENT_INDEX_PRINT_TOKEN CTX_OUTPUT.EVENT_DRG_DUMP_ERRORSTACK (for all errors) Query Logging Logging is most commonly used for index creation and optimization. But you can also log Oracle Text queries as well. For this we use the key string CONTEXT_QUERY.  There is only one setting, low. alter session set events 'TRACE[CONTEXT_QUERY] disk=low'; This Log query expressions only (similar to ctx_output.start_query_log).

Prior to Oracle 12cR2, getting a log of your index session required you to call something like this before starting the index creation: ctx_output.start_log ('myfilename.log') That would then create...

Search - Technical Tips

Oracle Text Filters - unofficial updates

If you want Oracle Text to index the contents of a binary document - maybe PDF, Microsoft Word, or Excel - then Oracle Text needs to extract the indexable text from the document before it can index it. It does this using Oracle's own Outside In Technology (OIT). Outside In recognizes over 150 different document formats. The actual OIT implementation in Oracle Text consists of an executable "ctxhx" (which on Linux systems lives in $ORACLE_HOME/ctx/bin) and a set of libraries and data files (in $ORACLE_HOME/ctx/lib). If you're on an older version of Oracle (let's assume 11.2.0.4) then the OIT libraries will date from a similar time.  That means that they may not handle some newer file formats, and they may also contain bugs which have been fixed in newer versions. Such bug fixes can't always be backported.  Oracle Text development are working on getting the latest OIT libraries back-ported to 11.2.0.4, but it's a big job. There are 177 shared libraries, and the source code runs to thousands of files. In the meantime, is there any way you can run a more recent version of OIT with an earlier version of Oracle? Not a supported way, no. But if you just want to do some testing, then it can be done.  The following instructions assume a Linux system.  You can probably do similar on a Solaris system - I have no idea if it would work on Windows. Just to repeat this is not supported.  If you do this, you are on your own when it comes to filtering problems. You can email me, but Oracle Support won't help. Firstly, we need to install Oracle 12c, the latest version (currently 12.2.0.1).  You can install that on the same machine as your 11.2.0.4 (or 12.1) system, or you can install it on a separate machine (but see the discussion on paths below). Just install the software, no need to create a database. We then need to copy the ctxhx executable from the newly installed 12.2 system to the 11.2 system. ctxhx is found in $ORACLE_HOME/ctx/bin. Obviously keep the old version (rename it or move it) in case of problem.  So we should now have the new ctxhx executable in $ORACLE_HOME/ctx/bin in the old 11.2 environment. Now the ctxhx executable needs call all the shared libraries normally found in $ORACLE_HOME/ctx/lib.  You can run it on the command line using: $ cd $ORACLE_HOME/ctx/bin $ ./ctxhx If you get a long usage message, it's working.  If you installed 12.2 on a different machine, you will likely get an error such as: ./ctxhx: error while loading shared libraries: libsc_ca.so: cannot open shared object file: No such file or directory This is because the path to the libraries is hard-coded into the executable when it is linked.  If your 12.2 installation still exists, and is on the same machien, the executables will use the libraries from there.  If the installation was on another machine, or you've deleted the 12.2 installation, it's not going to work. In theory it ought to be possible to relink the ctxhx executable, and use the new libraries in the old location. But I've not managed to get that to work (if you want to try - there's a make file ins_ctxhx.mk in $ORACLE_HOME/ctx/lib). So we really have a few of choices: If the 12.2 installation was on the same machine, leave the newly-installed 12.2 libraries where they were installed originally. You can delete the rest of the 12.2 installation, but leave $ORACLE_HOME/ctx/lib. If the 12.2 installation is done on a different machine, make sure the installation path on that machine exactly matches the path on the 11.2 installation.  That way, the hard-coded paths will match. Then you'll need to replace all the files in 11.2's $ORACLE_HOME/ctx/lib with the files from the 12.2 installation on the other machine. I've not tried it, but it might be possible to set LD_LIBRARY_PATH to point to a new location for the libraries.  Normally, LD_LIBRARY_PATH must be set before starting the TNS listener, using 'lsnrctl start', since environment settings are inherited from the listener. I haven't done extensive testing on this.  It works for the few file formats I've tried, but it's entirely possible that if you're filtering some obscure file format, it might throw problems. Let me know in the comments if you've tried this and it works - or it doesn't work - or even better if you've figured out how to get the 12.2 executable to relink with the proper paths.  

If you want Oracle Text to index the contents of a binary document - maybe PDF, Microsoft Word, or Excel - then Oracle Text needs to extract the indexable text from the document before it can index...

Parallel Sync of Text Indexes

When syncing a CONTEXT index, we would normally use a command such as:   exec ctx_ddl.sync_index( index_name=>'MYINDEX', part_name=>'P1' ) However, we can "suggest" that this is run in parallel using the command   exec ctx_ddl.sync_index( index_name=>'MYINDEX', part_name=>'P1', parallel_degree=>16 ) Why "suggest"? Well, sometimes the kernel thinks it knows better than you, and will decide not to run things in parallel even if you tell it to. But in most cases this will work. Now, if you generate an AWR report, or otherwise examine your system's SQL, you may see a query something like this, which appears to be taking a great deal of resources: select column_value from table(cast(ctxsys.drvparx.ParallelPopuIndex(cursor(select /*+ DYNAMIC_SAMPLING(0) ordered PARALLEL(pnd.dr$pending 16) INDEX_FFS(pnd.dr$pending) NOCACHE(base) */ base.rowid, NULL, NULL, NULL, NULL, nvl(pnd_lock_failed, 'N'), base."TEXT" from ctxsys.drv$pending pnd, "ROGER"."DOCS" PARTITION("P1") base where pnd.pnd_cid = :CID and pnd.pnd_pid = :PID and pnd.pnd_timestamp <= to_date(:TSTAMP, 'YYYYMMDDSSSSS') and base.rowid = pnd.pnd_rowid), :idxownid, :idxid, :idxown, :idxname, :ixpname, :popstate) as sys.odcivarchar2list)) What's going on there?  Well, the answer is that this is a "driver query" for the parallel sync.  SQL doesn't provide generalized syntax for executing PL/SQL statements in parallel, but it does provide a parallel query mechanism. That mechanism knows stuff like how many CPUs the system has, and can make 'intelligent' decisions about whether running things in parallel is good or not.  So we call a parallel query, but put a PL/SQL function in the SELECT list. This PL/SQL function will be called in parallel by all the parallel query threads - thus giving us "free" parallel execution of our PL/SQL functions. By having a "PARRALLEL" hint in the query, we can give the kernel a recommended degree of parallelism. One side-issue about that query: People often look at it, or similar queries and ask "why is using dynamic sampling"? Dynamic sampling is an optimizer technique which involves sampling data in the table to figure out which query path makes sense. It's not a good way to approach any query where you know for sure the best plan to use. The answer to that question is "it's not".  The key is in the zero argument - DYNAMIC_SAMPLING(0) means "don't use dynamic sampling, even if it seems like a good idea".

When syncing a CONTEXT index, we would normally use a command such as:   exec ctx_ddl.sync_index( index_name=>'MYINDEX', part_name=>'P1' ) However, we can "suggest" that this is run in parallel using...

Loading documents and other file data into the Oracle Database

Introduction With recent versions of Oracle, there is more and more emphasis on handling unstructured text, as well as the traditional fully structured data such as numbers, dates and short text fields. Here I want to consider various methods of getting "documents" into the Oracle database. None of these methods are completely trivial, and the choice of which to use will depend on various factors. What are the characteristics of a "document"? In general, I'm referring to text which may be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to 32767 bytes. The document most likely originates from a file on a file system, and it may be in a character set which differs from the database character set. In many cases, documents will have metadata (such as author and creation date) that we want to load at the same time. We should distinguish between four types of data: Metadata. This is short textual, numeric or date data which describes the main document. For example: document id, author, date created. Semi-structured Data. This is textual strings in a specific format, such as JSON or XML. We may be able to make assertions about such data, such as "will always start with '<'", or "does not contain any newline characters". Full Text.  This is text of arbitrary length. It can and usually will contain all sorts of characters including newlines and quotes. It may consist of multibyte characters in UTF-8 or UCS16 (or other) character sets. Binary Data. Arbitrary length data which is not character encoded. May contain any bytes or combinations of bytes including nulls and end-of-file markers.  Loading a mixture of these types of data can be problematic. If we want to load metadata and full text from a single file, then we need some method of determining where the full text starts and finishes. The easiest way to do that is to enclose the full text in some sort of quotes. But then what if the full text contains those quote characters?  So it's usually easier to arrange for each full text document to exist in a separate file, referenced from the main loading method. It would be impossible to load multiple binary data records from a single file, unless the binary data was first encoded using a scheme such as BASE64, since by its nature you can never specify a delimiter string that could not appear in the binary data. For these examples, we're going to consider three use-cases:  A set of JSON documents. These documents reside in a single file, one per line.  A set of files, which contain one document per file. Each file/document has metadata associated with it in a "summary" file.  As above, but one of the files contains multibyte (UTF-8) data. Client / Server considerations - where is your source data? If your client and server are the same machine, then any of these methods should work. However, if you are connecting to your database from a remote client over SQL*Net, you might first decide whether you want to load files from the client (the machine from which you connect) or the server (the machine where the database resides).   Load Method Data on Client Data on Server SQL*Plus Yes No SQL*Loader Yes No External Tables No Yes BFILES No Yes   Method 1: Using SQL*Plus Scripts Let's consider we have a table with a several short columns and a CLOB column: CREATE TABLE my_table   (id        NUMBER,    author    VARCHAR2(30),    created   DATE,    text      CLOB ); If I just want to do some "quick and dirty" loading of this table, I can just create a SQL script file and run that in SQL*Plus. For example: INSERT INTO my_table VALUES ( 1, 'Roger', SYSDATE, 'the quick brown fox jumps over the lazy dog' ); INSERT INTO my_table VALUES ( 2, 'John', SYSDATE, 'the angry aligator ate the antelope'); COMMIT; Simple and quick, and fairly easy to generate from source files using a language such as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations: Any single quotes in the text need to be doubled up: INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'my best friend''s brother said ''hi'''); Any ampersand ("&") characters in the script will cause problems unless you've put SET DEFINE OFF somewhere in the script   An input line to SQL*Plus cannot be longer than 2499 characters. If you want to include more than this you'll need to concatenate strings: INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'this is a long line that continues over more than one line'    || ' in the source code file'); Even then, that concatenation method will only allow you to assemble up to 4000 characters. To continue longer than that would require you to create temporary lobs, and use DBMS_LOB.APPEND for each chunk.   It's slower than most other methods which will be an issue if you're loading large quantities of data. Let's take our usecase 1 - the JSON documents. We could cut-and-paste from the file into a hand-coded script, but it's probably better to do it using a text-processing language such as Perl.  The attached file jsonToSql.pl replaces any single quotes with two single quotes, and breaks each line into chunks no longer than 2400 characters (allowing some space for syntactic overheads. Assuming we have perl installed and available, we can call it from the command line (Windows or Unix) using perl jsonToSql.pl jsonsource.txt > json.sql And we can then load the resulting file json.sql (after creating the necessary table - see above) using sqlplus user/pass @json.sql SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two and three out for this method. Multibyte Considerations SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful correct (and viewable!) output you will probably want to use SQL Developer to run the scripts. Method 2: Using SQL*Loader SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use, but the vast number of options can make it intimidating for the uninitiated. SQL*Loader loads files from the client.  It has no access to files on the server. It is normal to provide SQL*Loader with two inputs: A data file containing the data to be loaded to the database, and a control file which specifies how the data file is laid out, and which columns the data should be written to (it's possible to include the data inside the control file, but we'll stick with separate files for this example). Use case 1:  JSON documents in a single file We'll use the same table "my_table" as defned earlier. This is fairly straightforward.  We create a control file called jsonloader.ctl which consists of LOAD DATA INFILE 'jsonsource.txt'   INTO TABLE jsondocs   FIELDS    ( jsontext CHAR(32000) )   Then we load it using the command: sqlldr roger/roger control=jsonloader.ctl log=loader.log bad=loader.bad The "log" and "bad" arguments aren't required, but can be most useful for debugging. Note that we've limited the length of the jsontext field to 32000 characters, but you can increase that size if you need to. Use case 2: Loading from separate files A quick reminder about our second scenario. We have a table which we'll define as: CREATE TABLE my_table   (id        NUMBER,    author    VARCHAR2(30),    created   DATE,    text      CLOB ); and into this table we want to load the files file1.txt and file2.txt file1.txt: The quick brown fox jumps over the lazy dog file2.txt: The angry alligator ate the antelope. These files only have one line in each, but they could just as easily have multiple lines per file. One file is loaded into each database record. If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the filenames for those files must be included in the data file. We use the "FILLER" syntax to specify that the filename isn't to be loaded into any column, and then the LOBFILE syntax to load the text into a column. So here's our control file. We can see that it's similar to the previous control file but the "FIELDS" section is more complicated, to allow for multiple metadata fields, and the reference to the external file to be loaded. loader.ctl: LOAD DATA INFILE 'loader.txt'   INTO TABLE my_table   FIELDS TERMINATED BY ','   ( id         CHAR(10),     author     CHAR(30),     created    DATE "YYYY-MM-DD" ":created",     fname      FILLER CHAR(80),     text       LOBFILE(fname) TERMINATED BY EOF   ) "fname" is an arbitrary label, we could have used "fred" and it would have worked exactly the same. It just needs to be the same on the two lines where it is used. Note that the control file includes the name of our data file, loader.txt. loader.txt contains the metadata associated with each file, and the filename. loader.txt: 1,John Smith,2015-04-29,file1.txt 2,Pete Jones,2013-01-31,file2.txt So we're loading two files, along with the metadata for each. Now we can run SQL*Loader from the command line as follows: sqlldr roger/roger control=loader.ctl log=loader.log bad=loader.bad (again, the "log" and "bad" arguments aren't required, but can be most useful for debugging). Then we can use SQL*Plus or SQL Developer to examine our loaded table:   SQL> select * from my_table;     ID AUTHOR        CREATED   TEXT ---------- ---------------- --------- ------------------------------------------------      1 John Smith        29-APR-15 The quick brown fox jumps over the lazy dog      2 Pete Jones        31-JAN-13 The angry alligator ate the antelope.   Common mistakes: Do not be tempted to add spacing to format the loader.txt file nicely. If you use 1, John Smith, 2015-04-29, file1.txt Then not only will the author be loaded with a space in front of "John", but you'll see an obscure error: SQL*Loader-509: System error: The device does not recognize the command. SQL*Loader-502: unable to open data file ' file1.txt' for field TEXT table Muti-byte considerations So what if your file contains multi-byte characters? Perhaps some Chinese text mixed up with English? You will simply need to specify the character set of the file in the LOBFILE directive. For example, I have a file encoded in UTF-8, so I modify the control file to read: LOAD DATA INFILE 'loader.txt'   INTO TABLE my_table   FIELDS TERMINATED BY ','   ( id         CHAR(10),     author     CHAR(30),     created    DATE "YYYY-MM-DD" ":created",     fname      FILLER CHAR(80),     text       LOBFILE(fname CHARACTERSET UTF8) TERMINATED BY EOF   ) Loading Binary Files with SQL*Loader The above method (without the CHARACTERSET addition) will work exactly the same if our column is a binary BLOB column rather than a character CLOB. There is no difference in the loading mechanism - all that changes is that the database does not attempt to do any kind of character set conversation on the data as it is loaded into the database.   Method 3: Using BFILES and PL/SQL A BFILE is a non-standard datatype which provides read-only access to data held on an external file system. NOTE: This method will only work with files on the server. BFILES have no access to files on the client machine. It's possible to work directly with BFILEs, but it's perhaps more common to use them as a kind of temporary staging area to allow you to read data into database columns. BFILEs require a DIRECTORY - which is an internal database pointer to a folder/directory on the file system. To create a directory you must have CREATE ANY DIRECTORY privilege - or you can have a DBA create the directory and grant you access to it. This provides a layer of protection to prevent users from reading or writing files on the database server. There is a provided package DBMS_LOB which provides a number of utility functions including LOADFROMFILE. On the face of it, this might seem all we need, but in fact there's some preparation work we need to do. LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator, open it, call LOADFROMFILE, then close it afterwards. The following example shows a procedure which writes to our previously-created my_table. Firstly it creates a BFILE from the supplied directory and filename, then it inserts the metadata into my_table, fetching back the lob locator for the newly created row. It then uses LOADFROMFILE to write into that lob: CREATE SEQUENCE loadtest_seq; CREATE TABLE my_table(id NUMBER, author VARCHAR2(100), load_date DATE, text CLOB); CREATE OR REPLACE PROCEDURE LOAD_A_FILE    ( dir_name   VARCHAR2,      file_name  VARCHAR2,      author     VARCHAR2    ) IS l_bfile   BFILE;   l_clob    CLOB; BEGIN    l_bfile := BFILENAME(dir_name, file_name);    IF (dbms_lob.fileexists(l_bfile) = 1) THEN       dbms_output.put_line('File Exists');       INSERT INTO my_table T         VALUES (loadtest_seq.NEXTVAL,                 author,                 sysdate,                EMPTY_CLOB()       ) RETURN text INTO l_clob;       L_BFILE := bfilename(dir_name, file_name);       dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY );       dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength(l_bfile) );       dbms_lob.fileclose( l_bfile );       COMMIT;    ELSE      dbms_output.put_line('File does not exist');    END IF;   END; / To use this procedure, we might do: CREATE OR REPLACE DIRECTORY my_directory AS '/scratch/raford/Projects/loadingFiles'; -- remember that needs CREATE ANY DIRECTORY priv. EXECUTE load_a_file( 'MY_DIRECTORY', 'file1.txt', 'Roger'); Note that in this date we're using SYSDATE for the creation date, which varies slightly from previous examples. We could of course include create_date as another argument to our procedure.     Common Mistakes:   Remember that the file must be found on the database server. You cannot use this method to load a file which is located on a local machine if you are using SQL*Plus or SQL Developer remotely. The directory name is an identifier, similar to a username, table name or column name. As such, if you specify it without double quotes, it will be automatically upper-cased. But LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you must specify it in upper-case if it was originally created without quotes. No character set conversion is applied by LOADCLOBFROMFILE. This generally means the above procedure will not work if your data is not simple single-byte characters. Often you'll find the text is inexplicably truncated, becauuse of the difference between the number of characters and the number of bytes. See the next section for a solution.   Multibyte Considerations   If we want to load multibyte character data, we must forget DBMS_LOB.LOADFROMFILE and copy the contents of the BFILE into the CLOB using a more "manual" technique. This involves reading chunks into a RAW buffer, and then using UTL_RAW.CAST_TO_VARCHAR2 to allow us to write the RAW data into the CLOB value: CREATE OR REPLACE PROCEDURE LOAD_MULTIBYTE_FILE ( dir_name VARCHAR2, file_name VARCHAR2, author VARCHAR2 ) IS l_bfile BFILE; l_clob CLOB; buf RAW(32767); vc VARCHAR2(32767); maxsize INTEGER := 8192; -- a char can take up to 4 bytes, -- so this is the maximum safe length in chars amt INTEGER; amtvc INTEGER; v_offset INTEGER := 1; BEGIN l_bfile := bfilename(dir_name, file_name); IF (dbms_lob.fileexists(l_bfile) = 1) THEN dbms_output.put_line('File Exists'); INSERT INTO my_table T VALUES (loadtest_seq.NEXTVAL, author, sysdate, EMPTY_CLOB() ) RETURN text INTO l_clob; l_bfile := BFILENAME(dir_name, file_name); amt := dbms_lob.getlength( l_bfile ); dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY ); WHILE amt > 0 LOOP IF amt > maxsize THEN amt := maxsize; END IF; dbms_lob.read( l_bfile,amt, v_offset, buf ); vc := utl_raw.cast_to_varchar2(buf); amtvc := LENGTH(vc); dbms_lob.writeappend( l_clob, amtvc, vc ); v_offset := v_offset + amt; amt := dbms_lob.getlength( l_bfile ) - v_offset + 1; END LOOP; dbms_lob.fileclose( l_bfile ); COMMIT; ELSE dbms_output.put_line('File does not exist'); END IF; END; /   Method 4: Using External Tables External tables appear to the database much like normal tables, but their data resides on the file system. The layout of the external files is described using a variation on SQL*Loader syntax. However, SQL*Loader loads from the client file system, whereas external tables expect their data files to exist on the server file system. External tables are particularly useful for loading large numbers of records from a single file, so each record appears in its own row in the table. CREATE TABLE my_ext_table ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY MY_DIRECTORY ACCESS PARAMETERS ( RECORDS DELIMITED BY 0x'0A' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD", text CHAR(131071) ) ) LOCATION ( MY_DIRECTORY:'external.dmp' ) ); And here's our data file external.dmp: 1,Roger,2015-08-08,The quick brown fox jumps over the lazy dog 2,John,2012-01-01,"The angry aligator, acting alone, ate the antelope" Note the second line requires quotes around the full text string, as otherwise the field would be terminated at the comma.  Of course we meet the usual problem that the full text string may contain quotes. We can counter that by doubling them: 2,John,2012-01-01,"The angry aligator, acting ""alone"", ate the antelope"   But we still have the issue that the full text might contain the record delimiter character (0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each document. External tables do not support the FILLER directive, so instead we have to use a COLUMN TRANSFORMS line to specify that our fname field contains the name of our external file: DROP TABLE my_ext_table2; CREATE TABLE my_ext_table2 ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY MY_DIRECTORY ACCESS PARAMETERS ( RECORDS DELIMITED BY 0x'0A' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD", fname char(100) ) COLUMN TRANSFORMS (text FROM LOBFILE(fname) FROM (MY_DIRECTORY) ) ) LOCATION ( 'loader.txt' ) ); select * from my_ext_table2; Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already specified that as the default directory. However if the loader.txt file and the individual clob files were in different locations, it would be needed. Once your data is in an external table, you can either leave it there and perform normal table operations (DML and most DDL) on the external table,  or you can use the external table as a "staging" table to get the data into a normal table. You can simply do CREATE TABLE normaltable SELECT * FROM externaltable;   to create a new normal table. And of course you can similarly do: INSERT INTO normaltable SELECT * FROM externaltable; Conclusion There are several ways to load documents into LOB columns in Oracle database tables. The choice of method will depend on the nature and location of your data, the performance required for loading, and perhaps the loading method you're already most familiar with. The author welcomes corrections and comments:  roger.ford@oracle.com

Introduction With recent versions of Oracle, there is more and more emphasis on handling unstructured text, as well as the traditional fully structured data such as numbers, dates and short text...

Getting started Part 4 - Advanced index customization

This post follows on from Part 1, Part 2 and Part 3 and uses the same example "quickstart" table. Here we're going to look at some more advanced options for customizing the Oracle Text index.  If you just want basic full-text search on a text column, you may not need to know this at all. However, if you want to maximum use of Oracle Text, it's as well to understand the mechanisms used to customize the index. So far we've looked at queries run against a default CONTEXT index, which you'll recall we created with the statement: create index full_name_index on quickstart( full_name ) indextype is ctxsys.context; CONTEXT indexes have many options available. We can't possibly cover them all here, but instead we'll look at the mechanism for applying customization to an index. This is done by creating a set of preferences, each of which contains its own attributes.  These preferences and attributes are created by calling PL/SQL stored procedures in the package CTX_DDL. It is in order to be able to call these procedures that we granted the CTXAPP role to our user in part one. First of all, we'll consider defining join characters for our index. We saw earlier that the string "Smith-Smith" was indexed as two separate words, or tokens, in our index.  Indeed, we can check this by peeking at one of the tables which is created automatically when we created the index. The table we need to check is the so-called "$I table" - also known as the token table, which is given the name DR$index_name$I SQL> select token_text from dr$full_name_index$i; TOKEN_TEXT----------------------------------------------------------------DOEJOHNPETERSMITH Breaking up text into words or tokens is handled by a process known as the lexer.  By default, the hyphen character - like all non-alphanumerics charatcers -  is treated by the lexer as a break character. It causes a break between tokens.  We can also change it so that it joins tokens.  It can either become part of the token, so the token is "smith-smith" by defining it as a printjoins character, or it can invisibly join the token as "smithsmith" by defining it as a skipjoins character.  We'll make it a printjoins character. Looking in the documentation, we find that PRINTJOINS is an attribute of the BASIC_LEXER.  So first we need to create a preference which uses the BASIC_LEXER: We need to give the preference a name, so we'll call it my_lex_pref: SQL> execute ctx_ddl.create_preference('my_lex_pref', 'BASIC_LEXER') Then, for this preference, we need to set the printjoins attribute: SQL> execute ctx_ddl.set_attribute('my_lex_pref', 'PRINTJOINS', '-') So now we have a suitable preference, which we can include in the index via an index parameters clause.  First let's drop our existing index: SQL> drop index full_name_index; Index dropped. And create a new one, specifying our newly-created lexer preference in the parameters clause: SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context  parameters ('lexer my_lex_pref'); Index created. Now if we peek into the index table again, we'll see something different: SQL> select token_text from dr$full_name_index$i; TOKEN_TEXT----------------------------------------------------------------DOEJOHNPETERSMITHSMITH-SMITH So we can see that now the "-" character has been included in the indexed token.  We can also test this with a query. By default the "-" character is an operator in the CONTAINS query language (it means the same as MINUS - subtract the score of expression2 from expression1, similar but not the same as NOT), so we will need to escape it to keep it as part of the query: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'smith\-smith', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME---------- ---------- ------------------------------         4          3 Peter Smith-Smith We've examined one way of customizing the CONTEXT index. There are many other ways that we don't have space to consider here - see the documentation for more details.  Most of these customizations use the same model of creating a preference then setting an attribute for that preference - but there are a couple of exceptions which we'll look at. Firstly, Oracle Text has the concept of a default stoplist. This contains a list of common "noise" words which are not particularly useful for searching and take up a lot of space in the index.  In English, examples are in, on, and, for, to, that, etc. If we chose, we can use an empty stoplist (predefined as CTXSYS.EMPTY_STOPLIST) or create our own custom stoplist, and add stopwords to it: SQL> execute ctx_ddl.create_stoplist('my_stoplist')PL/SQL procedure successfully completed. SQL> execute ctx_ddl.add_stopword('my_stoplist', 'JOHN')PL/SQL procedure successfully completed. SQL> execute ctx_ddl.add_stopword('my_stoplist', 'PETER')PL/SQL procedure successfully completed. SQL> drop index full_name_index;Index dropped. SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('stoplist my_stoplist');Index created. Now if we look in our $I table we will see a reduced set of tokens: SQL> select token_text from dr$full_name_index$i; TOKEN_TEXT----------------------------------------------------------------DOESMITH So what happens if we use a stopword in a query? The answer is that it matches any word: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john smith', 99) > 0 order by score(99) desc; SCORE(99)         ID FULL_NAME---------- ---------- ------------------------------        7            3 Peter Smith-Smith        4            2 John Smith The other customization which doesn't use preferences is section groups. These are designed to allow us to search within specific sections of the document such as title, author, etc. The simplest one to use is AUTO_SECTION_GROUP which automatically identifies any section marked up with XML-like tags.  To illustrate this, we'll need to add a new record to our table. insert into quickstart values (7, 'US', '<first>John</first> <last>Smith</last>'); Then we'll create a new section group using AUTO_SECTION_GROUP, and recreate the index using it: SQL> execute ctx_ddl.create_section_group('my_sec_grp', 'AUTO_SECTION_GROUP')PL/SQL procedure successfully completed. SQL> drop index full_name_index; Index dropped. SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('section group my_sec_grp'); Index created. And we can then query in the FIRST section using the WITHIN operator: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john WITHIN first', 99) > 0 order by score(99) desc; SCORE(99)          ID FULL_NAME---------- ---------- ----------------------------------------         3          7 <first>John</first> <last>Smith</last> That concludes our brief look at index customization.  There is much more to it - please look through the doc for more ideas.

This post follows on from Part 1, Part 2 and Part 3 and uses the same example "quickstart" table. Here we're going to look at some more advanced options for customizing the Oracle Text index.  If you...

Getting started Part 3 - Index maintenance

This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table. One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates to the table are not immediately reflected in the index. Instead, the index must be synchronized. We can do this manually with a call to CTX_DDL.SYNC_INDEX, or we can arrange to have it done automatically for us. Let's show the manual method first:. We'll insert a new run then search for it. We won't find it because the index is not up-to-date. Then we'll call SYNC_INDEX, giving it the name of the index, and search again: SQL> insert into quickstart values (4, 'GB', 'Michael Smith'); 1 row created. SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc; no rows selected SQL> execute ctx_ddl.sync_index('full_name_index') PL/SQL procedure successfully completed. SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ----------------------------------------          5          4 Michael Smith If we don't want to bother with manual SYNC calls,  we can use the parameter string "SYNC (ON COMMIT)" when we create the index. That means that immediately after updates are committed to the table, a SYNC operation will be run to get the index into step with the table. SQL> drop index full_name_index; Index dropped. SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync (on commit)'); Index created. SQL> insert into quickstart values (5, 'US', 'Scott Peters'); 1 row created We can search for that, but we won't find it since we haven't yet committed - so the SYNC has not run:  SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc; no rows selected Now we can commit... SQL> commit; Commit complete. And this time it will find it: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ----------------------------------------          5          5 Scott Peters SYNC(ON COMMIT) is fine if you don't make too many updates to your table, but is not ideal if you have a very large table and lots of changes take place on it. Because of the way CONTEXT indexes are built, frequent changes will cause fragmentation of the index, decreasing performance over time. So a common way is to arrange for indexes to be synchronized at certain fixed intervals.  We can do this manually, using the database scheduler, or we can have it done automatically with a "SYNC (EVERY ... )" string in the parameters clause.  The time clause in that uses scheduler syntax, and can be a little convoluted. Every 1 minute can be represented in days as 1/24/60", so we can try the following: SQL> drop index full_name_index; Index dropped. SQL> create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync ( every SYSDATE+1/24/60 )'); Index created. SQL> insert into quickstart values (6, 'US', 'Paul Peters'); 1 row created. SQL> commit; Commit complete. SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc; no rows selected So we've committed, but the regular SYNC hasn't kicked off yet. We can wait a minute and try again, and this time we'll find it: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'paul', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ----------------------------------------          5          6 Paul Peters Back in Part 1 we said that it was necessary to grant the "CREATE JOB" privilege to our user - this is why.  If the user doesn't have CREATE JOB privilege, then we will get an "insufficient privileges" error if we try to create the index with "SYNC (EVERY ...)". Finally, we should talk about index optimization. CONTEXT indexes, over time, get less efficient as they get updated.  This inefficiency takes two forms:   The index gets filled with garbage - it contains references to deleted or updated documents which are no longer current The "postings lists" get fragmented. The pointers to which documents contain each word get broken up into small chunks instead of the idea long strings. We can fix this by running index optimization.  As with the manual call to sync indexes, this is done with a call to the PL/SQL package CTX_DDL: SQL> execute ctx_ddl.optimize_index('full_name_index', 'FULL') PL/SQL procedure successfully completed.  The "FULL" parameter there is the mode of optimization. You can choose from the following:   FAST: Only posting fragmentation is fixed - no garbage collection occurs FULL: Both fragmentation and garbage collection is dealt with REBUILD: The entire index is copied to a fresh table (requires more disk space but produces the best results). The more frequently you run SYNC, the more fragmented your index will become, and the more often you will need to optimize.   A typical "maintenance regime" at many customers is to run SYNC every five minutes, then run OPTIMIZE in FULL mode nightly, and OPTIMIZE in REBUILD mode once a week.  The next topic is for advanced users: Getting Started Part 4 - Advanced Index Customization    

This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table. One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates...

Getting started Part 2 - More queries

This post follows on from Getting Started with Oracle Text We will be looking at Oracle Text queries in a bit more detail. We'll assume you've set up the QUICKSTART table as used in part one. We mentioned before that Oracle Text CONTAINS queries look for whole words.  This means that a search for "smit" would not succeed, since "smit" does not occur as a whole word in any of our text.  If we want to do partial word matches, we must use wildcards. As with the SQL "LIKE" operator, we can use underscore ("_") to mean "any unknown single character" and percent ("%") to mean "any number of unknown characters - including none". So to find words starting with "smit" we would do: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'smi%', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          7          3 Peter Smith-Smith          4          2 John Smith Again we see that the first record scores higher than the second as there are two matches for "smit%" in the first. Note that a search for "%mit%" will work, but may be slow.  If you need to use leading wildcards (that is, a wild card at the beginning of the word) you should consider using the SUBSTRING_INDEX option - look it upif you need it.  But remember that Oracle Text is primarily a word-based index, it's not really designed, at least with its default settings, for finding substrings anywhere in the text. So what if  we want to search for a phrase - two or more words together, in order?  That's easy, we just use the phrase in the search: SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'peter smith', 99) > 0 order by score(99) desc; Note that unlike search engine such as Google, the CONTAINS search is quite precise.  If you look for the phrase "peter smith", it will not find documents containing "Smith Peter" or "Peter John Smith".  If you wanted to find containing combinations of those words, you could use an AND or an OR query. Let's look at one of those now:   SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john OR smith', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          7          3 Peter Smith-Smith          4          1 John Doe          4          2 John Smith   We'll take an aside to look at the scoring here - skip to the next paragraph if you're not interested.  The first row scores higher because "smith" occurs twice. OK, but why doesn't the third one score higher, as it has two hit terms. "john" and "smith"?  The answer is that the OR operator scores the higher of the two expressions it joins.  Since "john" and "smith" both score similar low scores, the result is just the one low score. In contrast, the AND operator scores the lower of two expressions.  This might seem unobvious, but it makes sense if you consider what happens when one of the search terms doesn't exist - and therefore scores zero.  The OR operator scores the higher of (zero and something), so always returns a score when one term is present. The AND operator scores the lower of (zero and something) - which is zero - so always returns a zero score unless both terms are present.   Let's see a few more simple queries and their results. First the AND operator we've already mention - both terms must be present. SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john AND smith', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          4          2 John Smith Then there's the NOT operator - if the second term is present at all then the query fails:  SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john NOT smith', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          4          1 John Doe   And the ACCUM operator - similar to OR but instead of scoring the lower of the two expressions, it adds them together (or accumulates them)   SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'john ACCUM smith', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------         52          2 John Smith          4          3 Peter Smith-Smith          2          1 John Doe  And a glimpse at one of the more powereful capabilities of the query language - the FUZZY operator: do an "inexact" search for words which are spelled or sound similar:    SQL> select score(99), id, full_name from quickstart where contains ( full_name, 'fuzzy((smythe))', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          7          3 Peter Smith-Smith          4          2 John Smith     You can add these operators together in arbitrarily complex expressions, using parenthesis to establish precedent if required:   SQL> select score(99), id, full_name from quickstart where contains ( full_name, '((john ACCUM peter) OR smith) NOT peter', 99) > 0 order by score(99) desc;  SCORE(99)         ID FULL_NAME ---------- ---------- ------------------------------          4          2 John Smith          2          1 John Doe A few last things: Oracle Text has a large number of reserved words - AND, OR, ACCUM and FUZZY all being in that list. If we want to search for any of those reserved words, we must enclose them in braces to negate their special meaning:    select score(99), id, full_name from quickstart where contains ( full_name, '{fuzzy} OR {accum}', 99) > 0 order by score(99) desc; Case sensitivity: CONTEXT indexes are case-insensitive by default in most environments. "smith" will match "Smith", etc. However, for certain languages (notably German) this is switched and the indexes are case-sensitive by default. The default language will be based on the normal NLS settings for your database.  Case sensitivity for an index can be chosen as a setting when the index is created. This is only a glimpse at the capabilities of the Oracle Text CONTAINS query capabilities.  For a more detailed look at the operators available, see the documentation. Move on to Part 3 - Index Maintenance

This post follows on from Getting Started with Oracle Text We will be looking at Oracle Text queries in a bit more detail. We'll assume you've set up the QUICKSTART table as used in part one. We...

Getting started with Oracle Text

My previous posts have mostly been aimed at experienced Oracle Text users. But what if you (or your colleagues) have never used it before?  What are the basics to get started? I hope this post will help some of those people. Because I'm old-school (or maybe just old), my examples will be listed as though you're entering them in command-line SQL*Plus. But they will work just as well if you use them in SQL Developer. First of all, any user of Oracle Text needs a special role CTXAPP.   Let's create a new user DEMO with that role, as well as the common CONNECT and RESOURCE roles - and also CREATE JOB which we'll need later in this series of tutorials. create user demo identified by demo default tablespace users temporary tablespace temp quota unlimited on users; grant connect, resource, ctxapp, create job to demo; Now if we're going to call Oracle Text functionality from within PL/SQL procedures, we need some extra grants (since grants to roles don't work in packages).  See Step 3 in the Text Developer's Guide. We won't cover that here. Now we'll assume the user has an application which queries a table called QUICKSTART.  For the demo, we should login as the DEMO user we just created, and run: create table quickstart (id number primary key, country varchar2(2), full_name varchar2(40));  We'll insert some records into that: insert into quickstart values (1, 'US', 'John Doe'); insert into quickstart values (2, 'GB', 'John Smith'); insert into quickstart values (3, 'NZ', 'Peter Smith-Smith'); Now a simple query against that table might look like select * from quickstart where country = 'GB'; Now let's say we want to search for anyone with "Smith" in their name.  We could do this by simply extending the query to: select * from quickstart where country = 'GB' and upper(full_name) like '%SMITH%'; but there are a few problems with this: There is no index. The kernel must read every full_name field and scan it. Not a problem here, but certainly a problem if there are millions of records. The query would match SMITHSON and BLACKSMITH. Not a problem if that's what we want, but tricky otherwise. We can create a word-based index on the FULL_NAME column using the following syntax: create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;  Note this is like any other "create index" statement, with the addition of the phrase "indextype is ctxsys.context", telling the kernel to create a specialized CONTEXT index and allow the use of query operators associated with that indextype. For example we can do:  select * from quickstart where country = 'GB' and contains ( full_name, 'smith') > 0; A few things to note: The CONTAINS operator can only be used if an index is present on the column specified in the first argument The search string doesn't need any wildcards either side.  We are looking for the whole word 'smith' (if we wanted to match 'smithson' as well, then we could use a wildcard: 'smith%'). SQL doesn't have boolean functions, so CONTAINS returns zero for a non-match, and greater-than-zero for a match.  The actual value returned is the 'score' (see later) but there are very rarely any circumstances where you do anything other than testing where the return value is greater than zero. Scoring: Most SQL queries just select between rows that match a criterion, and those that don't. There is rarely the concept of a good or better match. But if we're searching a lot of technical papers for the word "Exadata", then a document which has many occurrences of this term is likely to be a better match for our search than one where there is only a single match. Therefore CONTEXT indexes return a score for each search.  As mentioned, that is returned by the CONTAINS clause, but is rarely useful there.  It's usually better to use it in the SELECT clause, and perhaps in an ORDER BY clause as well.  A score is associated with a particular CONTAINS clause, of which a query may have several, so we use a number as the third argument in the CONTAINS function, and SCORE operator takes the same number.  It doesn't matter what that number is so long as it's the same in both cases.  We'll use 99 in this example: select score(99), id, full_name from quickstart where contains ( full_name, 'smith', 99) > 0 order by score(99) desc; This produces the output:  SCORE(99)         ID FULL_NAME---------- ---------- ------------------------------         7          3 Peter Smith-Smith         4          2 John Smith Note that the first item here scores higher than the second, because the search term smith appears twice.  Note there is no "absolute" meaning to those scores - you can't say  the first is a "7% match" or anything like that - all you can say is that a record with a higher score is more relevant than one with a lower score. Scoring is quite a complex topic, which we'll cover in more detail later. Move ahead to Part 2 - More queries

My previous posts have mostly been aimed at experienced Oracle Text users. But what if you (or your colleagues) have never used it before?  What are the basics to get started? I hope this post will...

Datastore triggers in 12c

If you want to create an Oracle Text CONTEXT index on multiple columns, you would normally use a MULTI_COLUMN_DATASTORE or USER_DATASTORE to bring in the various columns for indexing. This works well, but the kernel doesn't know which columns you've used in the index.  As far as the kernel is concerned, you have created the index on a specific column, and it will only update the index when the contents of that column change. Hence it is common to create a trigger something like the following: CREATE OR REPLACE TRIGGER customer_update_trg BEFORE UPDATE OF lastname ON customers   FOR EACH ROW BEGIN   :new.firstname := :new.firstname; END; In this case, the index was created on the firtname column, but the multi_column_datastore fetches in both lastname and firstname.  The trigger ensures that whenever lastname is changed, we force an update of firstname in order to reindex the row. However, things have changed in 12c. As a fix for bug 14155722, triggers check the "before" and "after" state of any referenced column values, and if the column has not changed, then no index updates take place.  So the trigger above won't cause an index update if only lastname has changed. The original behavior can be restored by an alter session or alter system operation, as follows: alter session set "_fix_control"='14155722:OFF' or alter system set "_fix_control"='14155722:OFF'; If done as an alter session, it must be done before the update operation which invokes the trigger, and will need ALTER SESSION privilege.  If done as alter system it will apply globally, but must be run by a DBA user.  It can also be put into the database initialization file, either by adding "scope=spfile" to the alter system command, or if a pfile is in use by adding it explicitly there.

If you want to create an Oracle Text CONTEXT index on multiple columns, you would normally use a MULTI_COLUMN_DATASTORE or USER_DATASTORE to bring in the various columns for indexing. This works well,...

Using Entity Extraction

Entity Extraction using Oracle Text was introduced with the 11.2.0.2 release of the Oracle Database. It's not a particularly well-known feature, so I thought I'd make some notes on how to use it. Please note that Entity Extraction uses some third party technology "under the covers" which is deliberately disabled in 11.2.0.3. If you run these examples in 11.2.0.3 (or 11.2.0.4 when it's out) you will receive a "feature not generally available" message. The functionality can be re-enabled, but only by contacting Oracle Support so they can run some checks on whether your applications will be affected by the replacement of this third party technology in the next major release. Entity extraction is the capability to identify, and extract, named entities within a text.Entities are mainly nouns and noun phrases. They include names, places, times, coded strings (such as phone numbers and zip codes), percentages, monetary amounts and many others.The ctx_entity package implements entity extraction by means of a built-in dictionary and set of rules for English text. The capabilities can be extended for English, or for other languages, by means of user-provided add-on dictionaries and rule sets.Let's look at a very simple, basic example.  Let's assume we have a clob containing the text: New York, United States of America <p>The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.' We will use ctx_entity.extract to find all the entities in the CLOB value. (For now, we won't worry about how the text got into the clob or how we provide the output clob - that's all in the examples if you want to look).Entity extraction requires a new type of policy - an "extract policy" - which allows you to specify options. For now, we will create a default policy, thus:ctx_entity.create_extract_policy( 'mypolicy' );We can then call extract to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example by calling dbms_lob.createtemporary).ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob ) outclob contains the XML identifying extracted entities.  If we display the contents (preferably by selecting it as an xmltype so it gets formatted nicely) we will see:<entities>  <entity id="0" offset="0" length="8" source="SuppliedDictionary">    <text>New York</text>    <type>city</type>  </entity>  <entity id="1" offset="150" length="18" source="SuppliedRule">    <text>Oracle Corporation</text>    <type>company</type>  </entity>  <entity id="2" offset="10" length="24" source="SuppliedDictionary">    <text>United States of America</text>    <type>country</type>  </entity>  <entity id="3" offset="83" length="2" source="SuppliedRule">    <text>5%</text>    <type>percent</type>  </entity>  <entity id="4" offset="113" length="8" source="SuppliedDictionary">    <text>software</text>    <type>product</type>  </entity>  <entity id="5" offset="0" length="8" source="SuppliedDictionary">    <text>New York</text>    <type>state</type>  </entity></entities>That's fine if we're going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use XML DB functions to convert it:select xtab.offset, xtab.text, xtab.type, xtab.sourcefrom xmltable( '/entities/entity'PASSING xmltype(outclob)  COLUMNS     offset number       PATH '@offset',    lngth number        PATH '@length',    text   varchar2(50) PATH 'text/text()',    type   varchar2(50) PATH 'type/text()',    source varchar2(50) PATH '@source') as xtab order by offset;which produces as output:    OFFSET TEXT                      TYPE                 SOURCE---------- ------------------------- -------------------- --------------------         0 New York                  city                 SuppliedDictionary         0 New York                  state                SuppliedDictionary        10 United States of America  country              SuppliedDictionary        83 5%                        percent              SuppliedRule       113 software                  product              SuppliedDictionary       150 Oracle Corporation        company              SuppliedRule Hopefully that's fairly self-explanatory.  Each entity found is listed with it's offset (starting point within the text), length, text, entity type, and source (how it was found).  The entity "New York" is identified as both a City and a State, so it appears twice. If we don't want all the different types of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types.  For example:ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' ) That would give us the XML<entities>  <entity id="0" offset="0" length="8" source="SuppliedDictionary">    <text>New York</text>    <type>city</type>  </entity>  <entity id="2" offset="10" length="24" source="SuppliedDictionary">    <text>United States of America</text>    <type>country</type>  </entity></entities>Next let's look at creating a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax.The rule is added to an extraction policy, and will then apply whenever that policy is usedWe will create a rule to identify increases, for example in a stock index. There are many ways to express an increase - we're hoping to match any of the following expressions:  climbed by 5%  increased by over 30 percent  jumped 5.5% So we'll create a regular expression which matches any of those, and create a new type of entity. User defined entities must start with the letter "x", so we'll call our entity "xPositiveGain": ctx_entity.add_extract_rule( 'mypolicy', 1,    '<rule>'                                                          ||      '<expression>'                                                  ||         '((climbed|gained|jumped|increasing|increased|rallied)'      ||         '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))'    ||      '</expression>'                                                 ||      '<type refid="1">xPositiveGain</type>'                          ||    '</rule>'); Note the "refid" in there. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that's the outermost (and first occurring) parentheses, so that's refid=1.This time, it is necessary to compile the policy before we can use it:  ctx_entity.compile('mypolicy');Then we can use it as before  ctx_entity.extract('mypolicy', mydoc, null, myresults)The (abbreviated) output of that would be <entities>  ...  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">    <text>climbed by over 5%</text>    <type>xPositiveGain</type>  </entity></entities>Finally, we're going to add another user defined entity, but this time as using a dictionary.  We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex.  While we're at it we'll add "S&P 500" as well. To do that, we create an XML file containing the following:<dictionary>  <entities>    <entity>      <value>dow jones industrial average</value>      <type>xIndex</type>    </entity>    <entity>      <value>S&amp;P 500</value>      <type>xIndex</type>    </entity>  </entities></dictionary>Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity "&amp;" - otherwise the XML would not be valid.This XML file is loaded into the system using the CTXLOAD utility.  If the file was called "dict.load", we would use the command:ctxload -user username/password -extract -name mypolicy -file dict.loadAgain we need to compile the policy using ctx_entity.compile. Then when we run ctx_entity.extract we will see in the output:<entities>  ...  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">    <text>climbed by over 5%</text>    <type>xPositiveGain</type>  </entity>  <entity id="7" offset="43" length="28" source="UserDictionary">    <text>Dow Jones Industrial Average</text>    <type>xIndex</type>  </entity></entities> Here's some example code so you can try it yourself.  It uses SQL*Plus bind variables, which don't work too well in SQL Developer - if you really don't like using SQL*Plus, you'll need to convert it - perhaps put the whole thing into a PL/SQL procedure and use PL/SQL variables instead of SQL*Plus variables. Download link (right-click and "Save As" on most systems): entity_extraction.sql

Entity Extraction using Oracle Text was introduced with the 11.2.0.2 release of the Oracle Database. It's not a particularly well-known feature, so I thought I'd make some notes on how to use it. Please...

A staging table for Oracle Text index updates

Oracle Text CONTEXT indexes differ from standard BTREE indexes in that they are not synchronous.  That is, the index is not updated at the same time that the data changes. It is necessary to invoke an index sync in some manner before any updates (or inserts) to the table are reflected in the index. The simplest way to do this is to specify SYNC (ON COMMIT) in the parameters clause of the index creation. That means that as soon as changes are committed, the index is sync'd.  (This still isn't quite synchronous. The indexing doesn't start until the commit, so although changes to the base table will be visible to the user who made them before doing a commit, that user will NOT see changes reflected in the index. Additionally, there will be a short period between the end of the commit and the end of the sync when other userswill be able to see the table changes but not the index changes).  SYNC(ON COMMIT) is popular, but it's not necessarily the best option.  This is because frequent syncs cause fragmentation of the index. New data is written in small chunks to the end of the $I postings list table, and the more and smaller these are, the worse query performance will be until the index is optimized. So there is usually a trade-off between update frequency, and the fragmentation of an index. Users would like to have their updates available for searching immediately by means of SYNC(ON COMMIT), but DBAs would prefer to have SYNCs spaced widely apart to avoid fragmentation of the index. One way around this is to create a staging table to hold recent updates to the main table.  The staging table will be a copy of the main table's layout, with the addition of a timestamp column. Indexes are created on both tables (main and stage).  These indexes are identical, except the main index has manual sync, and the staging table has SYNC(ON COMMIT). We then create triggers which copy inserts and updates from the main table to the stage table, where they are indexed near-instantaneously.  Because the stage table is kept small, we needn't worry about fragmentation.  It's probably advisable to arrange things such that the stage table is "pinned" in the SGA buffer cache, but if it's used regularly, it will probably stay in memory anyway. Queries which would previously have run against the main table are modified so that they do a UNION between the main table and the staging table. That way, we get all the unmodified rows from the main table, and the modified rows from the stage table.  If you're thinking "won't we get the modified rows in their old form from the base table?" then remember that DELETEs are actioned immediately in a CONTEXT index, and that updates are implemented internally as DELETE followed by INSERT). Then we just need to put a wrapper around the SYNC for the main table which performs the actual SYNC then deletes all the rows from the staging table which were inserted before the start of the SYNC (hence the need for a timestamp column).  If we choose, we can also optimize the stage table index at this point to. I've written some sample code to demonstrate this on a simple table. It is intended to be run in SQL*Plus but will probably work in SQL Developer with little or no modification.  Be aware that it drops and recreates a user called TESTUSER.  You will need to edit the CONNECT line unless your SYSTEM password is "manager". Download link (right click and save): staging_index.sql or view in browser: staging_index.sql.txt

Oracle Text CONTEXT indexes differ from standard BTREE indexes in that they are not synchronous.  That is, the index is not updated at the same time that the data changes. It is necessary to invoke an...

Partition Maintenance and Oracle Text Indexes

Oracle Text supports "locally partitioned indexes" on partitioned tables.That means that if you have a RANGE partitioned base table, you can use the LOCAL keyword when you create your index, and Oracle Text will automatically produce a partitioned index, with one index partition for each partition of the base table.A partitioned index is very much like having a separate Oracle Text index on each partition. There is a $I table, a $K table, a $R table, etc, for each partition. The naming convention for the index is changed slightly - the $I table for an unpartitioned table is DR$index_name$I, for a partitioned table it is DR#index_name0001$I, DR#index_name0002$I, etc (the eagle-eyed will have spotted an apparent limit of 9,999 partitions, but in fact this is lifted in 11.2, as it moves to an alphunumeric naming convention after 9999).Note that currently only RANGE partitioned tables are supported with local Text indexes.An important feature of any partitioned table is the ability to split and merge partitions. Splitting means dividing one existing partition into two new partitions, and merging means combining two (or sometimes more) partitions into a single one. Normally, any LOCAL indexes on partitions which have been split or merged become invalid - they must be rebuilt before they are useable. However, if you specify UPDATE GLOBAL INDEXES, the local indexes will be automatically rebuilt as part of the split or merge: ALTER TABLE my_table  MERGE PARTITIONS p2, p3   INTO PARTITION p2and3;  UPDATE GLOBAL INDEXESThat's fine - but if we've got lots of data in there, it's going to take quite some time to rebuild those indexes, and our application is going to be unavailable for all that time.Is there a way we can do a split or merge without having to take the application offline while we rebuild the indexes? Yes, there is. We can make use of a feature known as Partition Exchange, which allows us to swap a complete table, including its ready-built indexes, in place of an existing partition in a partitioned table.  The syntax looks like: ALTER TABLE my_table  EXCHANGE PARTITION p2a WITH TABLE my_temp_table  INCLUDING INDEXES  WITHOUT VALIDATION;The WITHOUT VALIDATION is optional, but makes it a lot quicker. It avoids the need for the system to check every partition key value to make sure the row is going into the correct partition. Obviously it needs to be used with care.So now we have a solution.  When we do a merge, we first copy the data from the partitions to be merged into a temporary table. We create an index on the temporary table using all the same index options as for our main table. When that index is complete, we do the ALTER TABLE ... MERGE PARTITION as above, but without the "UPDATE GLOBAL INDEXES" clause (thus leaving our indexes invalid for a short period) and then we do the ALTER TABLE ... EXCHANGE PARTITION to swap in the partition complete with its freshly built index in place of the merged partition with its invalid index.We can do something similar with SPLIT PARTITION. This time, though, we will need two temporary tables.  We copy the data out of the first part of the partition to be split into one temporary table, and the data from the second part of the partition to be split into the other temporary table.  We create indexes on both of those temporary tables (we can build the two indexes simultaneously if we have enough resources, or sequentially if not.  We then split the partition, without using UPDATE GLOBAL INDEXES: ALTER TABLE my_table   SPLIT PARTITION p2 AT ( 50 )  INTO ( PARTITION p2a, PARTITION p2b )and then we do two partition exchanges ALTER TABLE my_table  EXCHANGE PARTITION p2a WITH TABLE my_temp_table_1  INCLUDING INDEXES  WITHOUT VALIDATION and again for the second ALTER TABLE my_table  EXCHANGE PARTITION p2b WITH TABLE my_temp_table_2  INCLUDING INDEXES  WITHOUT VALIDATION ... and that's it. Or is it? What about changes which have been made to the base table since we copied the data out into temporary tables?Well the simple answer to that is to make sure we don't do any changes to the table. We could even make it read-only:ALTER TABLE my_table READ ONLYThat solves the problem, but it's rather heavy-handed. After all, the reason we're doing all this is that it can take a long time to build text indexes, and we would prefer not to have to prevent updates to the table.  So instead we need to institute some kind of "change management".  There are doubtless several ways to achieve this, but I've done it by creating triggers which monitor any updates or inserts on the base table, and copy them to a temporary "staging" table. These transactions can then be copied back to the main table after the partition split or merge is complete, and the index sync'd in the normal way.I've provided a pair of complete examples which illustrate this process. They do a split and a merge respectively on a simple partitioned table, creating indexes on temporary tables as necessary and exchanging those tables in place of partitions in the main table.  They also create the necessary triggers to monitor updates, and roll these back into the main table on completion.One note about the example: I've partitioned the tables on a the "price" column.  While this seemed a good idea in the context of a simple, understandable "e-commerce" type model, it's actually a pretty bad idea, since you can't change the value of a partition key if it would cause the row to move partitions.  So if we have a partition boundary at 50, and an item which costs 49, then we can't increas its price to 51. DownloadsThe "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As".  If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page). split_partitions.pls - download link / open in browsermerge_partitions.pls - download link / open in browser

Oracle Text supports "locally partitioned indexes" on partitioned tables.That means that if you have a RANGE partitioned base table, you can use the LOCAL keyword when you create your index, and...

Oracle Text query parser

Oracle Text provides a rich query syntax which enables powerful text searches. However, this syntax isn't intended for use by inexperienced end-users.  If you provide a simple search box in your application, you probably want users to be able to type "Google-like" searches into the box, and have your application convert that into something that Oracle Text understands. For example if your user types "windows nt networking" then you probably want to convert this into something like "windows ACCUM nt ACCUM networking".  But beware - "NT" is a reserved word, and needs to be escaped.  So let's escape all words: "{windows} ACCUM {nt} ACCUM {networking}".  That's fine - until you start introducing wild cards. Then you must escape only non-wildcarded searches: "win% ACCUM {nt} ACCUM {networking}".  There are quite a few other "gotchas" that you might encounter along the way. Then there's the issue of scoring.  Given a query for "oracle text query syntax", it would be nice if we could score a full phrase match higher than a hit where all four words are present but not in a phrase.  And then perhaps lower than that would be a document where three of the four terms are present.  Progressive relaxation helps you with this, but you need to code the "progression" yourself in most cases. To help with this, I've developed a query parser which will take queries in Google-like syntax, and convert them into Oracle Text queries. It's designed to be as flexible as possible, and will generate either simple queries or progressive relaxation queries. The input string will typically just be a string of words, such as "oracle text query syntax" but the grammar does allow for more complex expressions:   word : score will be improved if word exists   +word : word must exist   -word : word CANNOT exist   "phrase words" : words treated as phrase (may be preceded by + or -)   field:(expression) : find expression (which allows +,- and phrase as above) within "field". So for example if I searched for   +"oracle text" query +syntax -ctxcat Then the results would have to contain the phrase "oracle text" and the word syntax. Any documents mentioning ctxcat would be excluded from the results. All the instructions are in the top of the file (see "Downloads" at the bottom of this blog entry).  Please download the file, read the instructions, then try it out by running "parser.pls" in either SQL*Plus or SQL Developer. I am also uploading a test file "test.sql". You can run this and/or modify it to run your own tests or run against your own text index. test.sql is designed to be run from SQL*Plus and may not produce useful output in SQL Developer (or it may, I haven't tried it). I'm putting the code up here for testing and comments. I don't consider it "production ready" at this point, but would welcome feedback.  I'm particularly interested in comments such as "The instructions are unclear - I couldn't figure out how to do XXX" "It didn't work in my environment" (please provide as many details as possible) "We can't use it in our application" (why not?) "It needs to support XXX feature" "It produced an invalid query output when I fed in XXXX" Downloads The "download links" are for saving to a file - on most browsers you can right click and "Save Target As" or "Save Link As".  If you want to view the code in a browser just click on "open in browser" (the file will have an extra .txt suffix if you then save it from the browser page). parser.pls - main program file : download link  open in browser test.sql - for testing procedure : download link  open in browser  

Oracle Text provides a rich query syntax which enables powerful text searches.However, this syntax isn't intended for use by inexperienced end-users.  If you provide a simple search box in...

Oracle Text and accented characters

There are a number of lexer attributes in Oracle Text thataffect the way accented characters are dealt with. We're going to look at each of these and consider the effects they have, alone and together. Before we look into them, a quick aside on typing accentedcharacters if you don't have them natively on your keyboard. I'm not going to go too deep into charactersets and code pages, but if you're on a European Windows machine, you'reprobably using the WIN1252 character set. If I want to type one of the accented characters in that set,in most Windows applications I can hold down the ALT key, then enter thefour-digit decimal code for that character on my numeric keypad. So to enter alower-case o-umlaut character ("o" with two dots above it: "ö"), I would hold down ALT and type 0246 on thenumeric keypad. BASE_LETTER Let's start with perhaps the simplest - the BASE_LETTERattribute. If I set this to"true" then characters with accents in the text will effectively beindexed as their base form. So for example if I index the German word "schön"(meaning beautiful in English) then the actual indexed token will be "schon". That means if the user searches for " schön"or "schon" then the word will be found (since the query term isprocessed in the same way). That makes life much easier for people who don't have theoh-umlaut character on their keyboard, and don't know the ALT trick mentionedabove (or don't want to have to look up the necessary character code). So that's simple - might as well set that option on always,right? Well, not necessarily. Because inGerman, the word "schon" actually has a quite different meaning to" schön", and German users wouldn't want to find the unaccented wordif they specifically meant to look for the accented word. So the simple rule of thumb is this: If you think that mostof your users will be searching for words for which they have the correctkeyboard (for example German users searching German text) then it's best to setBASE_LETTER to false. But if you thinkusers might want to search foreign words for which they do not have the correctkeyboard (for example English, or multi-national users searching German ormixed text) then it's best to set BASE_LETTER to TRUE. ALTERNATE_SPELLING Now it starts to get a little more complex. Some languages - notably German again - allowyou to avoid the use of accented characters by spelling the wordsdifferently. In German, an accentedcharacter is replaced by the un-accented form of the same letter, followed byan "e". So "shön"could equally validly be spelled "shoen", and every German user wouldrecognise it as the same word. Equally"Muenchen" (the city English speakers call Munich) would berecognized as the same as "München". So that we can treat these alternate spelling forms asequivalent Oracle Text has the ALTERNATE_SPELLING attribute. When set to "german", Oracle Textwill look for the "alternate form" of "vowel + e" and indexboth that and the accented form of the word. When processing a query, it will equally translate "oe" into"ö", etc, thus ensuring that the word can always be found, regardlessof which of the alternate forms is searched for, and which is in the indexedtext. Aside: anyonefollowing closely might ask "why does it index both the alternate form"shoen" and the accented form "shön"? Surely it would be sufficient to just indexthe "shön" form? Well, mostly it would. But what happens if the word in question wasactually an English word in the middle of the German text, such as"poet"? OK, it would beindexed as "pöt" and anybody seaching for "poet" wouldstill find it (since the transformation is applied to the searchterm aswell). But what if they used a wildcardand searched for "po%"? Theywould still expect to find the term, butif only the accented form was indexed, they wouldn't. Hence the reason we index both forms, just incase. Combining ALTERNATE_SPELLING and BASE_LETTER OK, so we want ALTERNATE_SPELLING set to "german"for our German text. But we also knowthat people with non-German keyboards are often going to search it. So we BASE_LETTER on as well. What happensnow? If the indexer comes across "schön",ALTERNATE_SPELLING would normally index that without any change. But BASE_LETTER forces it to the unaccentedform, and "schon" is actually indexed. If the indexer comes across"shoen", then ALTERNATE_SPELLING decides it should be indexed as both"schoen" and "schön". But before the tokens are written to the index, BASE_LETTER is applied,so the tokens "shoen" and "shon" are indexed. That all works fine, and we can find either term bysearching for "shon", "shön" or "shoen". Great! But (there's always a but) what happens if we index theFrench word "Rouède" (the name of a town near the Spanish border)? "uè"is not a candidate for ALTERNATE_SPELLING, so it is left alone. Then BASE_LETTERis applied, and the word "Rouede" is written to the index. If the user searches for "Rouède"then the query-time processing works the same, the search is converted to"Rouede" and the query works fine. However, if the user searches for the baseletter form "Rouede", things don't go so well. This time ALTERNATE_SPELLING does get applied to the query term(since the query processor has no way of knowing that the "e" charactershould be accented) and the searchterm is converted to "Roüde". BASE_LETTER is then applied, and it looks for"Roude" in the index. But theindexed term is "Rouede", sonothing is found. OVERRIDE_BASE_LETTER To solve this problem, the OVERRIDE_BASE_LETTER attributewas introduced.If you set OVERRIDE_BASE_LETTER to "true", then ALTERNATE_SPELLING will "mask" BASE_LETTER. That means that if we meet accented characters in the text which have a alternate form (such as "ö"), we will index them in their original, accented form and also in their alternate form. If we meet them in their alternate form (eg Muenchen) we will index ONLY the alternate form and not transform them.  Accented characters which do not have an alternate form (such as "è") have BASE_LETTER processing applied to them to transform them to their equivalent unaccented character.  Then at query time, we apply only ALTERNATE_SPELLING to any appropriate accented search terms, and BASE_LETTER to all others.This has the positive effect that our previous example"rouède" can be found, if searched for with or without the accent onthe "e" character. It does have the negative effect that base letter searchesno longer work on German words - we can't search for "shon" anymore,only "shön" or "shoen" will work. So OVERRIDE_BASE_LETTER makes sense if wewant to perform ALTERNATE_SPELLING on German (or other specified language)words, and BASE_LETTER on all other languages. Appendix: Test Script This is the script I used to test the effects of the variousoptions. To avoid any issues withcharacter set translation, I used the UNISTR() function to create Unicodecharacters for my accented characters. Note the German words are prefixed by two-letter codes "wa" -with accent, "na" - no accent and "af" alternate form. That allowed me to distinguish in the indexbetween the index terms derived from "schön" and those derived from"schon". begin ctx_ddl.drop_preference ('my_lexer');end;/ begin ctx_ddl.create_preference ('my_lexer', 'BASIC_LEXER' ); ctx_ddl.set_attribute( 'my_lexer','BASE_LETTER', 'true' ); ctx_ddl.set_attribute( 'my_lexer','OVERRIDE_BASE_LETTER', 'true'); ctx_ddl.set_attribute( 'my_lexer','ALTERNATE_SPELLING','german' );end;/ drop table tt;create table tt(a1 number primary key,text varchar2(45)); -- town name"Rouède", accent on the einsert into tt values (1,'rou'||unistr('\00E8')||'de'); -- shön with accent(wa)insert into tt values (2,'wash'||unistr('\00F6')||'n'); -- shon no accent(na)insert into tt values (3,'nashon'); -- muenchen alternateform (af)insert into tt values (4,'afmuenchen'); commit; select * from tt; create index tta on tt(text) indextype is ctxsys.context parameters ( 'lexer my_lexer' ); set feedback 2 select token_text, token_type from dr$tta$i; PROMPT searching for the base letter form, without accent onthe first eselect * from tt where contains(text,'Rouede')>0; PROMPT and with the accentselect * from tt where contains(text,'Rou'||unistr('\00E8')||'de') > 0; --select * from tt wherecontains(text,'m'||unistr('\00FC')||'nchen')>0;--select * from tt where contains(text,'muenchen') > 0; set echo on --select * from tt where contains(text,'nashoen') > 0;--select * from tt where contains(text,'nashon') > 0;--select * from tt where contains(text,'na'||unistr('\00F6')||'n') > 0; select * from tt where contains(text,'washon') > 0;select * from tt where contains(text,'washoen') > 0;select * from tt where contains(text,'wa'||unistr('\00F6')||'n') > 0; set echo off -- The following section shows how to see how the query hasbeen transformed - it shows -- the actual words looked for in the index. drop table test_explain;create table test_explain( explain_id varchar2(30), id number, parent_id number, operation varchar2(30), options varchar2(30), object_name varchar2(64), position number, cardinality number); beginctx_query.explain( index_name => 'tta', text_query =>'wasch'||unistr('\00F6')||'n', explain_table =>'test_explain', sharelevel => 0, explain_id => 'Test');end;/ col explain_id for a10col id for 99col parent_id for 99col operation for a10col options for a10col object_name for a20col position for 99 select explain_id, id, parent_id, operation, options,object_name, positionfrom test_explain order by id;

There are a number of lexer attributes in Oracle Text thataffect the way accented characters are dealt with. We're going to look at each of these and consider the effects they have, alone...

Oracle Text on Solid State Disks

We've known for some time that putting Oracle Text indexes onto flash-memory-based solid state disks (SSDs) gives some big advantages.A while back, we tried benchmarking Oracle Text on Fusion I/O devices, which showed huge improvements. But those PCI-based devices are expensive, whereas consumer level SSDs are getting cheaper and cheaper. I decided to compare some Oracle Text queries on an inexpensive SSD against the same queries on a rotational disk. Test Hardware: Dell D6410 laptop8GB RAM2.4GHz Intel i5 CPU (2 cores, 4 threads)Mushkin Chronos 120GB SSDSeagate 750GB conventional disk (5400 rpm) How I tested First I created two Oracle users, "flash" and "nonflash". Then I created two tablespaces of 20GB each, one on the SSD and one on the conventional disk. I set the default tablespace for each user to the appropriate one.Next I loaded five million rows of random text, totalling about 10GB. For this I used a oci C program "randomtext_oci.c". This creates random length texts up to 4K in length, composed of random word lengths and random characters. The words are composed of a fairly small set of characters - the consonants BCDFG alternating with the vowels AEIO. Although this is not really representative of real text, it's better than completely random characters. In our 10GB of text, there are around 210,000 unique words, which is comparable to what you'd expect on a similarly-sized collection of clean text (by "clean", I mean very few spelling mistakes or made-up words).This text was then copied from one user to the other (using CREATE TABLE mydocs AS SELECT * FROM flash.mydocs UNRECOVERABLE)and each table indexed with a simple Oracle Text index:create index mydocs_index on mydocs(text) indextype is ctxsys.contextparameters ('memory 500000000')/The index took around 40 minutes to create. It was a little quicker on the SSD, but not much.I then ran several queries to check the performance with different numbers of hits returned.Each query took the form "SELECT COUNT(*) FROM mydocs WHERE CONTAINS (text, '<searchterms>') > 0;Using COUNT(*) means that we're testing only the cost of fetching rowids from the text index - if we fetched other data from the table we'd be measuring the cost of fetching base table blocks as well.Before running each query, I rebooted the machine (thankfully quite quick as the OS is on SSD!) then pre-loaded the $R table using the LoadAllDollarR procedure as documented here. The $R table generally gets fully cached in memory within a short time, so we are mainly interested in the time needed to fetch the $I table info.  I also ran a query which fetched zero results, to make sure the Oracle Text data dictionary was loaded into cache.Here are the queries, and the results I found:  Query  No. of Hits  Conventional Disk  SSD  Difference  bab  1,283,890  1.14 sec 0.53 sec  2.2 x  bad OR bac   2,149,822  2.73 sec 0.90 sec  3.0 x  ba  3,478,811  5.44 sec 1.23 sec  4.4 x  b  4,609,044  8.26 sec 1.95 sec  4.2 x We can see this graphically (number of hits on the horizontal axis, number of seconds on the vertical axis) Conclusions Because Oracle Text queries require lots of random I/O to fetch "posting lists" from the index, we see big improvements from using solid state disks.  Although the advantages would be less marked if we compared against faster rotational disks (such as 15K rpm SAS disks), we would still most likely see a significant improvement.  And the SSD used here was a very inexpensive model (at around US $1 per GB), so we might expect to see even better improvements using a more advanced professional device, or by using a PCI-based solution rather than a SATA device.

We've known for some time that putting Oracle Text indexes onto flash-memory-based solid state disks (SSDs) gives some big advantages.A while back, we tried benchmarking Oracle Text on Fusion...

Indexing data from multiple tables with Oracle Text

It's well known that Oracle Text indexes perform best when all the data to be indexed is combined into a single index. The query select * from mytable where contains (title, 'dog') > 0 or contains (body, 'cat') > 0 will tend to perform much worse than select * from mytable where contains (text, 'dog WITHIN title OR cat WITHIN body') > 0 For this reason, Oracle Text provides the MULTI_COLUMN_DATASTORE which will combine data from multiple columns into a single index. Effectively, it constructs a "virtual document" at indexing time, which might look something like: <title>the big dog</title> <body>the ginger cat smiles</body> This virtual document can be indexed using either AUTO_SECTION_GROUP, or by explicitly defining sections for title and body, allowing the query as expressed above. Note that we've used a column called "text" - this might have been a dummy column added to the table simply to allow us to create an index on it - or we could created the index on either of the "real" columns - title or body. It should be noted that MULTI_COLUMN_DATASTORE doesn't automatically handle updates to columns used by it - if you create the index on the column text, but specify that columns title and body are to be indexed, you will need to arrange triggers such that the text column is updated whenever title or body are altered. That works fine for single tables. But what if we actually want to combine data from multiple tables? In that case there are two approaches which work well: Create a real table which contains a summary of the information, and create the index on that using the MULTI_COLUMN_DATASTORE. This is simple, and effective, but it does use a lot of disk space as the information to be indexed has to be duplicated. Create our own "virtual" documents using the USER_DATASTORE. The user datastore allows us to specify a PL/SQL procedure which will be used to fetch the data to be indexed, returned in a CLOB, or occasionally in a BLOB or VARCHAR2. This PL/SQL procedure is called once for each row in the table to be indexed, and is passed the ROWID value of the current row being indexed. The actual contents of the procedure is entirely up to the owner, but it is normal to fetch data from one or more columns from database tables. In both cases, we still need to take care of updates - making sure that we have all the triggers necessary to update the indexed column (and, in case 1, the summary table) whenever any of the data to be indexed gets changed. I've written full examples of both these techniques, as SQL scripts to be run in the SQL*Plus tool. You will need to run them as a user who has CTXAPP role and CREATE DIRECTORY privilege. Part of the data to be indexed is a Microsoft Word file called "1.doc". You should create this file in Word, preferably containing the single line of text: "test document". This file can be saved anywhere, but the SQL scripts need to be changed so that the "create or replace directory" command refers to the right location. In the example, I've used C:\doc. multi_table_indexing_1.sql : creates a summary table containing all the data, and uses multi_column_datastore Download link / View in browser multi_table_indexing_2.sql : creates "virtual" documents using a procedure as a user_datastore Download link / View in browser

It's well known that Oracle Text indexes perform best when all the data to be indexed is combined into a single index. The query select * from mytable where contains (title, 'dog') > 0 or...

Indexing MediaWiki with SES

IntroductionMediaWiki is a popular Wiki system. It is basically the same software used for Wikipedia and its many spin-offs and copies.It is possible to index a MediaWiki with the Web Crawler in SES, but this has certain disadvantages:The web crawler has to follow all links, which is not the most efficient way of getting all pageIn order to determine whether a page has changed, the web crawlerhas to access all page headers, which can put a heavy load on theWiki's web serverIf some content is not public, there is no way for the web crawler to access the security information to see who can see a page.Instead, we want to create a database crawler, which can access the underlying database holding the Wiki pages.This article shows you how to do that. It makes certain assumptions:The database storing the pages is MySQL. MediaWiki supports other databases, but this has not been testedAll content is public. It should be quite possible to extend this method for proper secure search, but that's not covered here.This project was based on the Turnkey MediaWiki- an easy-to-install preconfigured machine with Ubuntu Linux, PHP,MediaWiki and MySQL. It was installed into a virtual machine with 512MBRAM and 8GB of disk running under VMWare Server. The version is 2009.10.Configuring MySQLWe need to ensure that we have a user who is able to connect to the wiki from the SES machine.MySQL in the Turnkey system is configured to only allow connectionsfrom the local machine. We need to modify its config to allow externalconnections from the SES box.This will normally require you to edit the file /etc/mysql/my.cnf If you have a line:bind-address = 127.0.0.1you need to change it to something likebind-address = 123.123.123.321where the address is the external IP address for your mediawiki server (find it with /sbin/ifconfig)Then you need to create a MySQL user who can access the wiki tables from the SES machine. On the wiki machine, do: mysql -u root -ppassword mysql> create user 'sesuser'@'%' identified by 'secretpassword'; mysql> grant all on wiki_db.* to 'sesuser'@'%';Note: the '%' should really contain your SES host name to restrictaccess to just the SES host. However, I was unable to get connectionsworking this way. '%' allows the user to connect from any host.If you have an older version of MySQL in your installation, it may be configured with "skip_networking". To find out, run mysqladmin -u root -ppassword variablesIf skip_networking is ON, you will need to change it. This page may help.Configuring SES to access MySQLSES does not come, out of the box, with the JDBC drivers needed toaccess MySQL. We must download them and do some simple configuration.Most of this is discussed in the SES Admin Guide.Go here and download the Connector/J package. Open the archive, and extract the file mysql-connector-java-5.1.10-bin.jar (of course this name may change with future versions). Save the jar file into the directory $ORACLE_HOME/search/lib/plugins/oracleapplications on your SES server.Now edit the file drivers.properties in the same directory. Add a line as follows: mysql: com.mysql.jdbc.Driver,PAGEIDThis tells SES two things:The name of the jdbc driver for MySQLThe name of the unique key column we are going to use. SES defaultsto KEY as the key string, but KEY is a reserved word in MySQLFinally we need to make sure SES can find the driver JAR file. For this we need to edit the MANIFEST file in the appsjdbc.jar file in the same oracleapplications directory.If you're lucky enough to be an Emacs user on Linux, you can do thisby opening the jar file in emacs, and directly editing themeta-inf/manifest.mf file.Edit the Class-Path line to read: Class-Path: sqljdbc.jar mysql-connector-java-5.1.10-bin.jarAnd save the file. (Change the jar file name if you're using a different version).If you use an inferior editor, or you're on Windows and emacs can't edit jar files, you will need to do jar xvmf META-INF/MANIFEST.MF appsjdbc.jar <edit META-INF/MANIFEST.MF to make changes as above> jar cvmf META-INF/MANIFEST.MF appsjdbc.jar oracle rm -rf oracle # or similar windows delete commandDo make sure you're in the right (oracleapplications) directory when you run that last command!There's no need to restart SES after making these changes. SES reads the files afresh at the start of each crawl.Creating the Source in SESThen we create a database crawler in SES. These are the parameters(you'll need to edit the database connect string and URL prefix to include thehostname or IP address of the MediaWiki server, and possibly change the port3306 if your MySQL uses a non-standard port):Database Connection String: jdbc:mysql://mediawiki.mydomain.com:3306/wiki_dbUser ID: sesuserPassword: secretpasswordQuery: select p.page_id as PAGEID, p.page_title as URL,t.old_text as CONTENT, p.page_title as TITLE, r.rev_timestamp asLASTMODIFIEDDATE, r.rev_user_text as AUTHOR, 'EN' as LANG from page asp left join revision as r on p.page_id = r.rev_page left join text as ton r.rev_text_id = t.old_id where r.rev_id in ( select max(rev_id) fromrevision group by rev_page )URL Prefix: http://mediawiki.mydomain.com/index.php/All other parameters can be left at their default values.In page 2 (Authorization) you must set Authorization to "No Access Control List".You MUST also clear the two lines Plug-in Class Name and Jar File Name.We're done! Crawl away.Here's that SQL in a more readable form: select p.page_id as PAGEID, p.page_title as URL, t.old_text as CONTENT, p.page_title as TITLE, r.rev_timestamp as LASTMODIFIEDDATE, r.rev_user_text as AUTHOR, 'EN' as LANG from page as p left join revision as r on p.page_id = r.rev_page left join text as t on r.rev_text_id = t.old_id where r.rev_id in ( select max(rev_id) from revision group by rev_page )This is a fairly basic crawl. The page contents, title and authorare fetched for the latest revision of each page. The title is fetchedin the slightly unfriendly internal form - for exampleWiki_Index_Notes. A better crawl would transform this by replacing theunderscores with spaces. There is also probably much other usefulinformation which could be extracted from the various tables, includingsuch things as the summary text, the link-in count (for relevanceboosting) and perhaps link text as well. This could be achieved byfurther customization of the SQL above, by reference to the tabledescriptions and schema diagram in the links below.Useful LinksMediaWiki.org HomepageMediWiki Database Layout Table DescriptionsMediaWiki Database Scheme (Large .PNG Image)MySQL documentation libraryTurnkey MediaWiki appliance

Introduction MediaWiki is a popular Wiki system. It is basically the same software used for Wikipedia and its many spin-offs and copies. It is possible to index a MediaWiki with the Web Crawler in...

Office 2007 files in SES 10.1.8.4

Following on from my previous article, here's another handy use for a document service.SES 10.1.8.4 cannot currently handle Office 2007 files formats (.docx, .pptx and .xlsx), but these can be handled by the filters in the latest versions of Oracle Text. The next release of SES will also use these new filters, and thus have full support for Office 2007. Meanwhile, if we should happen to have an installation of Oracle 11.1.0.7 database on the same machine as SES, we can use the 11.1.0.7 filters to do the filtering within SES.We can do this because document services have access the original pre-filtered binary documents. So even though the built-in SES filters will have failed/refused to index the documents, we can pick them up in the document service, filter them using the external filter executable from the 11g installation, and feed the resulting HTML stream back to SES for indexing.The document service to do this can be found here: OfficexFilter.zip. Unzip it and check the readme.txt file for installation instructions. Please note that although Oracle 11.1.0.6 is downloadable from Oracle.com, you need the 11.1.0.7 patchset to get the new filters, which is available from Metalink. Also I'm not trying to guess what the licensing implications are here, you would need to discuss that with your Oracle Sales Representative, or someone else who deals with that sort of thing (which isn't me!).

Following on from my previous article, here's another handy use for a document service. SES 10.1.8.4 cannot currently handle Office 2007 files formats (.docx, .pptx and .xlsx), but these can be...

Choosing which documents to index with a document service

A question posed recently was "How can I tell the NTFS crawler to index normal files but not folders?"The initial answer was "There is no option available to prevent submitting of folders for indexing in an NTFS source".But like many problems in SES, it can be solved quite easily with a document service.Document services sit in the pipeline between the crawler (which fetches the documents from a source) and the indexer, which analyzes and indexes the document.  In a document service you can:Add or modify attributes of the documentChange, add or delete some or all of the text of the documentSpecify whether the current document should be indexed or notWe want to use number 3 here - specify that certain documents should not be indexed. We know that NTFS folders have a URL which ends with a "/".  It would be easy enough to write a document service that checks for this and specifies that a matching document should not be indexed.  But let's make it a bit more flexible, and allow the user to specify a regular expression.  If the URL matches the regular expression, the document will not be indexed.We do this by defininig a parameter to the document service. The document service is installed in SES from a jar file, then we create an instance of the document service, complete with a parameter, which is the regular expression to match. In this case, the regular expression would be ".*/$" (without the quotes). Meaning "match anything ending with a forward slash".You can download the document service to do this from this link: RejectionFilter.zip . It's a zip file, so unzip it with your favorite utility on Windows or using "unzip" on Unix/Linux. The file readme.txt contains full instructions for installing it (hopefully - if not let me know).UPDATE!It seems I posted this a little early. Since I didn't have an NTFS source set up on the current 10.1.8.4 version, I tested this onto the development machine, running an early release of 11g SES. It worked fine. Unfortunately, it seems it doesn't work with 10.1.8.4. In the current release, NTFS folders do not have a trailing slash - indeed there is no way to tell from looking at the URL whether a "document" is a folder or a file.  Fortunately, however, we can look at the document size instead - a folder has no content, so the size is zero. I modified the document service to include a new parameter for minimum document size. Set this to 1, and folders (plus any truly empty documents) will get ignored. You can get the updated document service from here:  RejectionFilterV11.zip

A question posed recently was "How can I tell the NTFS crawler to index normal files but not folders?" The initial answer was "There is no option available to prevent submitting of folders forindexing...

Launching SES schedules from the command line

The scheduler within the SES admin system has some limitations. Forexample you can't ask a schedule to run every weekday but not atweekends.For this reason some users prefer to use a scheduler such as "cron"instead.There is a utility called "searchadminctl" which can be used to launchschedules, but this will not "queue" schedules - if the schedule isalready running then searchadminctl will just exit.The shell script below checks the status of the schedule, and if it isin "launching" or "executing" status it will wait until it completesbefore restarting it.NB to check the schedule, it looks in an internal table. It ispossible, though unlikely, that this table name or layout could changein a future version.#!/bin/bash# Shell script to launch a schedule. If the schedule is already running,# wait for it to complete.# Note: if you get an error "./launch.sh: line 29: [: too manyarguments"# it probably means you have the schedule name wrong# SET THESE NEXT FOUR VARIABLES TO SUIT YOUR ENVIRONMENT# File to set up environment variables ORACLE_HOME, ORACLE_SID and PATHexport ENV_FILE=/home/raford/ses.env# SES Admin Passwordexport SES_PWD=sespassword# Name of Schedule to launchexport SCHEDULE_NAME="My Schedule Name"# Delay time to wait between tests if schedule is running (seconds)export WAIT_TIME=60. $ENV_FILEexport crawlstat=EXECUTINGexport firsttime=TRUEuntil [ $crawlstat != EXECUTING -a $crawlstat != LAUNCHING ] ; doif [ $firsttime != TRUE ] ; then   echo Sleeping...   sleep $WAIT_TIMEfiexport firsttime=FALSEsqlplus -s eqsys/$SES_PWD > /tmp/sql$$ <<EOFset feedback offset heading offexec eq_adm.use_instance(1)select cs_state from eq\$crawler_sched where cs_name = '$SCHEDULE_NAME';quitEOFexport crawlstat=`tail -1 /tmp/sql$$`echo status is $crawlstatdoneecho Launching Schedule...searchadminctl -p $SES_PWD <<EOFschedule -start "$SCHEDULE_NAME"quitEOF# tidy uprm /tmp/sql$$echo ' '

The scheduler within the SES admin system has some limitations. For example you can't ask a schedule to run every weekday but not at weekends.For this reason some users prefer to use a scheduler such...

The "oc4j log file" in SES

A common message in Secure Enterprise Search (SES) is something like "xxx operation has failed. See the oc4j log file for more details".So what is the oc4j log file, and how do we read it?"oc4j" is Oracle Containers for Java - which is the light-weight web server used by SES. The oc4j log file contains all the logging (and sometimes trace/debug) messages written by oc4j.Accessing the oc4j log file is easy if you have login access to the server on which SES is running.  If we assume that $SES_HOME represents the installation directory for SES, you will find the oc4j log file in $SES_HOME/oc4j/j2ee/OC4J_SEARCH/log. In 10.1.8.3 or earlier, there would only be a single file "oc4j.log". In 10.1.8.4 there may be multiple files which include the date they were created.If you don't login access to the server, things are rather more difficult.  However, we can solve it with a one-off installation of a special jsp file. I found an excellent general purpose jsp based file browser, but this is rather too powerful for our needs. Even configured for read-only access to the specific oc4j log directory, it gives a lot of information about other drives and directories. Also, the oc4j log file can sometimes contain information you wouldn't want just anyone to see - like the titles of documents indexed and the names of users - so we really want to restrict access to people who are already logged in to the SES admin system.So I wrote a new jsp file which can be downloaded from here You will need access to the server once to install this file, or you can ask a system administrator to install it on your behalf. Save the downloaded oc4jlogs.jsp file to the directory $SES_HOME/oc4j/j2ee/OC4J_SEARCH/applications/search_admin/admin/control It can then be accessed with the URL http://server:port/search/admin/control/oc4jlogs.jsp (replacing "server" and "port" with the values usually used by SES). That will give you a list of all the files in the oc4j log directory, and you can click on any file to view the contents (it will not allow access to any files outside of the oc4j logs directory). You must be logged in to the admin system to use this jsp.  If you are not logged in, it will redirect to the login page. Entering the password will drop you in the "main" admin page rather than returning you to the oc4j logs page and you will need to manually re-enter the URL for the oc4j logs.

A common message in Secure Enterprise Search (SES) is something like "xxx operation has failed. See the oc4j log file for more details". So what is the oc4j log file, and how do we read it?"oc4j" is...