X

An Oracle blog about Oracle Text Index

Oracle Text Filters - unofficial updates

Roger Ford
Product Manager

If you want Oracle Text to index the contents of a binary document - maybe PDF, Microsoft Word, or Excel - then Oracle Text needs to extract the indexable text from the document before it can index it.

It does this using Oracle's own Outside In Technology (OIT). Outside In recognizes over 150 different document formats.

The actual OIT implementation in Oracle Text consists of an executable "ctxhx" (which on Linux systems lives in $ORACLE_HOME/ctx/bin) and a set of libraries and data files (in $ORACLE_HOME/ctx/lib).

If you're on an older version of Oracle (let's assume 11.2.0.4) then the OIT libraries will date from a similar time.  That means that they may not handle some newer file formats, and they may also contain bugs which have been fixed in newer versions. Such bug fixes can't always be backported.  Oracle Text development are working on getting the latest OIT libraries back-ported to 11.2.0.4, but it's a big job. There are 177 shared libraries, and the source code runs to thousands of files.

In the meantime, is there any way you can run a more recent version of OIT with an earlier version of Oracle?

Not a supported way, no. But if you just want to do some testing, then it can be done.  The following instructions assume a Linux system.  You can probably do similar on a Solaris system - I have no idea if it would work on Windows.

Just to repeat this is not supported.  If you do this, you are on your own when it comes to filtering problems. You can email me, but Oracle Support won't help.

Firstly, we need to install Oracle 12c, the latest version (currently 12.2.0.1).  You can install that on the same machine as your 11.2.0.4 (or 12.1) system, or you can install it on a separate machine (but see the discussion on paths below). Just install the software, no need to create a database.

We then need to copy the ctxhx executable from the newly installed 12.2 system to the 11.2 system. ctxhx is found in $ORACLE_HOME/ctx/bin. Obviously keep the old version (rename it or move it) in case of problem.  So we should now have the new ctxhx executable in $ORACLE_HOME/ctx/bin in the old 11.2 environment.

Now the ctxhx executable needs call all the shared libraries normally found in $ORACLE_HOME/ctx/lib.  You can run it on the command line using:

$ cd $ORACLE_HOME/ctx/bin

$ ./ctxhx

If you get a long usage message, it's working.  If you installed 12.2 on a different machine, you will likely get an error such as:

./ctxhx: error while loading shared libraries: libsc_ca.so: cannot open shared object file: No such file or directory

This is because the path to the libraries is hard-coded into the executable when it is linked.  If your 12.2 installation still exists, and is on the same machien, the executables will use the libraries from there.  If the installation was on another machine, or you've deleted the 12.2 installation, it's not going to work.

In theory it ought to be possible to relink the ctxhx executable, and use the new libraries in the old location. But I've not managed to get that to work (if you want to try - there's a make file ins_ctxhx.mk in $ORACLE_HOME/ctx/lib).

So we really have a few of choices:

  1. If the 12.2 installation was on the same machine, leave the newly-installed 12.2 libraries where they were installed originally. You can delete the rest of the 12.2 installation, but leave $ORACLE_HOME/ctx/lib.
  2. If the 12.2 installation is done on a different machine, make sure the installation path on that machine exactly matches the path on the 11.2 installation.  That way, the hard-coded paths will match. Then you'll need to replace all the files in 11.2's $ORACLE_HOME/ctx/lib with the files from the 12.2 installation on the other machine.
  3. I've not tried it, but it might be possible to set LD_LIBRARY_PATH to point to a new location for the libraries.  Normally, LD_LIBRARY_PATH must be set before starting the TNS listener, using 'lsnrctl start', since environment settings are inherited from the listener.

I haven't done extensive testing on this.  It works for the few file formats I've tried, but it's entirely possible that if you're filtering some obscure file format, it might throw problems.

Let me know in the comments if you've tried this and it works - or it doesn't work - or even better if you've figured out how to get the 12.2 executable to relink with the proper paths.

 

Join the discussion

Comments ( 1 )
  • Mark Powers Thursday, January 31, 2019
    Very useful post. I am running into the opposite issue, 11.2.0.4 is working on our PDFs, but is not 12.2.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.