These days, getting useful insights from all kinds of information is more important than ever. Vector search is becoming really popular because it lets you search by meaning, not just specific words, and it can quickly find answers based on similarities. To make searches even more accurate and relevant, keyword search (known as Oracle Text) comes into play. This is the foundation of Hybrid Search, which combines both keyword search and vector-based search. By integrating these methods, hybrid search can address a broader range of queries, including those requiring both precise results and a deeper understanding of user intent.

Nevertheless, Oracle Text, Oracle’s full-text search tool, included in all database editions, remains popular and significant on its own, and it is utilized in numerous customer projects. For example, it’s used for all sorts of use cases, like searching for company names, spotting financial fraud by matching partial or misspelled names and addresses, managing documents with checkin/checkout/download features, searching through metadata, and much more. Oracle Text, is built into every edition of Oracle Database and works both in the cloud and on-premises.  

In addition there is a deep integration with other Oracle Database features and technologies including but not limited to JSON, Spatial, Graph, but also security, availability, performance, partitioning, sharding (aka Oracle Global Distributed Databases), RAC, Exadata, and more. In general, data held in the Oracle Database can be directly accessed via SQL, without the need to convert it into an intermediary form. Combining data in different formats is easy – all you need to do is join! 

In this tutorial, I’ll show you how to use Oracle Text search on Oracle Autonomous Database Serverless (ADB-S). Whether you want a refresher on Oracle Text or want to see some of the cool new features in Autonomous Database, you’ll find something useful here. Many thanks to Oracle Text development for the review, valuable comments and additions to this blog.

The following topics will be covered in this tutorial:

What is Oracle Text – a short review 

Oracle Text is included in all database editions and is available directly without any further installation. You can start immediately in a “normal” database schema. Oracle Text uses SQL to index, search, and analyze text and documents – stored inside and outside the database.  Queries against text indexes allow for inexact searches, with advanced, customizable relevance ranking algorithms to return the most appropriate results first. The simplest use case is to  perform a full-text search within a single column. But even more complex cases are supported: Search within multiple columns in the same data source (MULTI_COLUMN_DATASTORE). Plain text is typically stored in VARCHAR2 and CLOB columns, but you can also index binary documents: Oracle Text can handle over 150 different binary document formats, include Office files (Word, Powerpoint, Excell and so on), PDF files, and even archives such as ZIP or TAR files. Such files will be filtered, to use the Oracle Text terminology, and the text will be extracted.
Oracle Text supports multiple languages and character sets, including all space separated languages and pictogram languages such as Chinese, Japanese or Korean.

Set up the database table NEWS_DOC and the application user

Let’s setup the search scenario and use documents (around 200,000 short news texts) – stored in a database table called NEWS_DOC. If you already have your own data set in a database table, feel free to skip the setup steps and continue with the next section.

For this tutorial, I’ll use a file called dataset_200K.txt that I’ve uploaded to Oracle Object Store in a bucket “bucketus”. To get some sample data, you can use the “dataset_220K.txt” file available here; this file was put together by my colleague Stephane for another blog post. I’ll access the file using the external table feature. To read data from an object store with external tables, you’ll need to set up database credentials. If you haven’t already done so, you can create those credentials using a command similar to the example below.
 
However first, let’s connect as user ADMIN and create a database user named TEXTUSER with the following privileges.
Note: In 23ai, you can use the role DB_DEVELOPER_ROLE.

create user textuser identified by <password>;
grant db_developer_role to textuser;
alter user textuser quota unlimited on users;

-- and for cloud resource usage
grant all on dbms_cloud to TEXTUSER;
grant all on directory DATA_PUMP_DIR to TEXTUSER;

In any case, make sure that the user has granted the CTXAPP role and quota privileges on the tablespace users.

Now connect as user TEXTUSER and create the credentials if you haven’t already done.

-- Drop and create database credentials
set define off; 
BEGIN
  DBMS_CLOUD.DROP_CREDENTIAL(credential_name => 'CREDUS');
END;
/
BEGIN
    DBMS_CLOUD.CREATE_CREDENTIAL(
    credential_name => 'CREDUS',
    username        => '...IdentityCloudService/ulrike.schwinn@oracle.com',
    password        => 'password');
END;
/

After that, you can create the external table using the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE as shown below. Be sure to check the documentation for more details on syntax if you need it.

-- drop table news_tmp_ext purge;

BEGIN
    DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name      => 'NEWS_TMP_EXT',
    credential_name => 'CREDUS',
    file_uri_list   => '...../bucketus/o/dataset_200K.txt',
    column_list     => 'NEWS_TEXT varchar2(4000)',
    format          => json_object('ignoremissingcolumns' value 'true','recorddelimiter' value '0x''0A''', 
                                   'rejectlimit' value 'unlimited'));
END;
/

Let’s review the content; 199964 rows can be accessed. 

select count(*) from news_tmp_ext;
COUNT(*) 
-------- 
  199964 

Now we create the table NEWS_DOC in the schema TEXTUSER, with the following columns:

-- drop table news_doc;

create table news_doc(
  id   NUMBER generated by default as identity primary key,
  news VARCHAR2(4000));

Let’s insert the data from table NEWS_TMP_EXT and commit.

insert into textuser.news_doc (news) select * from news_tmp_ext;
199,964 rows inserted.

commit;

In the next step, we will create the domain index and illustrate some query examples.

Working with Oracle Text 

Connect to application user TEXTUSER. In the next step, we create an Oracle Text index with default preferences and show some simple examples how we can query the data.

In 23ai, the following simplified syntax for index creation can be used. With CREATE SEARCH INDEX syntax you can create a search index for indexing and querying structured, unstructured, or semi-structured data, such as textual, JSON, and XML documents.

In our case we create an index on a textual column with defaults setting. 

-- drop index idx_news force
create search index idx_news on news_doc(news);

Of course, you can use an updated enhanced CREATE statement to provide parallelism or use partitions, but also optionally specify indexing parameters for memory, maintenance operations or for index preferences.

create search index idx_news on news_doc(news) PARALLEL; 

Please consult the syntax in the documentation e.g. here for CREATE SEARCH index. Please note: Handling “parallel degree” DOP on Autonomous Database depends on the predefined database services you are connected with. Please check out the chapter “Oracle Text and  Autonomous Database” below.

In 19c, you may use the following syntax. Here is an example also with the default settings.

create index idx_news on news_doc(news)
indextype is ctxsys.context;

Let’s check the index.

select index_type, table_name, status, domidx_status, domidx_opstatus 
from all_indexes where index_name ='IDX_NEWS';

INDEX_TYPE                  TABLE_NAME           STATUS   DOMIDX_STATU DOMIDX
--------------------------- -------------------- -------- ------------ ------
DOMAIN                      NEWS_DOC             VALID    VALID        VALID

Use CTX_USER_INDEXES to show details about all Oracle Text indexes for the current user. It can be queried by all users.
As we can see, all documents (199964 docids) of the table are indexed. Automatic maintenance (see value AUTO) is the default mode in release 23 for synchronizing newly created Oracle Text indexes. Indexes with automatic maintenance are synchronized in the background without any manual user intervention.

col idx_name format a20
col idx_table_owner format a20
col idx_table format a20
col IDX_KEY_NAME format a10
col IDX_TEXT_NAME format a15

select idx_name, idx_table_owner, idx_table, idx_key_name, idx_text_name, idx_docid_count, idx_status, 
idx_type, idx_sub_type, idx_maintenance_type
from ctx_user_indexes;

IDX_NAME             IDX_TABLE_OWNER      IDX_TABLE            IDX_KEY_NA IDX_TEXT_NAME   IDX_DOCID_COUNT IDX_STATUS   IDX_TYPE      IDX_ IDX_MA
-------------------- -------------------- -------------------- ---------- --------------- --------------- ------------ ------------- ---- ------
IDX_NEWS             TEXTUSER             NEWS_DOC             ID         NEWS                     199964 INDEXED      SEARCH        TEXT AUTO

And last but not least check on errors. In my case  no errors were found, so I can continue.

select * from ctx_user_index_errors;
no rows selected.

Now we can start to execute Oracle Text queries. You must use the CONTAINS operator in the WHERE clause to enter your query. The basic Oracle Text query takes a query expression, usually a word with or without operators, as input. Oracle Text returns all documents (previously indexed) that satisfy the expression along with a relevance score for each document. You can use the scores to order the documents in the result set.

Let’s search for documents containing the word “Munich” ordered by the Oracle Text score. In our example the highest score is 25.

Note: Oracle Text uses an inverse frequency algorithm based on Salton’s formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. More details about Oracle Text score can be found here

select score(1) score, news from news_doc where contains(news, 'Munich', 1)>0 
order by score desc fetch first 2 rows only; 

SCORE
---------
NEWS                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       25 
Hong Kong, Mandarin Oriental has appointed Mr Michael Ziemer as general manager of the luxury hotel group - s renowned Munich property, Mandarin Oriental, Munich.                                                      
       25 
MUNICH, Oct 26 (SW) - Bayern Munich have brought the gap with league leaders VfL Wolfsburg back to one point while Borussia Dortmund finally managed to reach a win against Berlin based Hertha BSC.

Oracle provides a long list of CONTAINS query operators that allows users to formulate powerful queries. The next example is a simple boolean search (AND operator) requiring both the word “Munich” and “Olympic” for a match.

select news from news_doc 
where contains(news, 'Munich and Olympic')>0 fetch first 4 rows only;

NEWS                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ATHENS  Michael Phelps certainly will not beat Mark Spitz's record seven Olympic swimming gold medals in Munich in 1972, but the American youngster is still racking up the gold in Athens.

Israel's first Olympic gold medallist  dedicated his victory on Wednesday to his 11 countrymen killed  in the Munich massacre of 1972, vowing he would visit their  memorial to show them his medal.

ATHENS  Michael Phelps certainly will not beat Mark Spitz's record seven Olympic swimming gold medals in Munich in 1972, but the American youngster is still racking up the gold in Athens.

BERLIN : Bayern Munich tasted their first home defeat of the season with Schalke 04 grabbing a late 1-0 victory at the Olympic Stadium.

Let’s check the execution plan. It uses the domain index.

SQL> select * from dbms_xplan.display();

Plan hash value: 1684221444 
----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |   100 | 17800 |    27   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| NEWS_DOC |   100 | 17800 |    27   (0)| 00:00:01 |
|*  2 |   DOMAIN INDEX              | IDX_NEWS |       |       |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("CTXSYS"."CONTAINS"("NEWS",' Olympic and Munich')>0)
Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1 because of no expensive parallel operation

A simple misspelling of “Munich” such as “Muench” would yield no results in our example. Therefore, operators like FUZZY allow returning results to match words that have a similar spelling as the specified term. In our example we will receive documents with including the term Munich but also containing the word Munch.  

select news from news_doc
where contains(news, 'fuzzy(Muench)', 1)>0 order by score(1) desc fetch first 4 rows only;

NEWS                                                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hong Kong, Mandarin Oriental has appointed Mr Michael Ziemer as general manager of the luxury hotel group - s renowned Munich property, Mandarin Oriental, Munich.

MUNICH, Oct 26 (SW) - Bayern Munich have brought the gap with league leaders VfL Wolfsburg back to one point while Borussia Dortmund finally managed to reach a win against Berlin based Hertha BSC.

Bayern Munich's Sebastian Deisler was readmitted to the Max-Planck-Institute for Psychiatry in Munich on Tuesday after a minor relapse of depression.

Oslo's Munch Museum could remain closed until June 2005 while officials upgrade its security system following a brazen daylight robbery in August of two Edvard Munch masterpieces.

Some news are long, so that it would be useful to display only relevant fragments for a document that contains the query term. Let’s illustrate this in the next example.

select news, length(news) len from news_doc
where contains(news,'Olympic')>0 and length(news)>1000;
                                                                                                       
NEWS
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       LEN
----------
(InfoWorld) - The Big Wait continues: Xudong Wang, Minister of China's Ministry of Information Industry (MII), spoke about the growth of the country's communications sector, about the rollout of telecommunication infrastructure to rural areas and about the need to fight cybercrime. But the high-ranking official, in his keynote speech Monday at the opening of the Telecom World event in Hong Kong, avoided discussing an issue of huge interest to many attendees: when and how the government plans to issue 3G (third-generation) mobile telecommunications licenses. Manufacturers, operators, and millions of consumers in China eagerly await the arrival of high-speed mobile phone service in the world's largest wireless market. But the government, despite growing pressure to unveil its 3G plans ahead of the Olympic Games in 2008, remains tight-lipped. At present, WCDMA (Wideband Code Division Multiple Access), which dominates Europe and huge chunks of Asia, the U.S.-supported CDMA 2000, and China's self-developed TD-SCDMA (Time Division Synchronous Code Division Multiple Access) are the three potential 3G standards for China. TD-SCDMA has secured extended financial and policy support from the Chinese government. Insiders expect the government to begin issuing 3G licenses in early 2007 after the TD-SCDMA standard is ready for commercial deployment. While Wang preferred to sidestep 3G in his keynote, Viviane Reding, Commissioner for Inf
ormation Society and Media at the European Commission, chose to address the issue directly in her speech. \                                                       
      1575

Use CTX_DOC.SNIPPET as follows to return one or more most relevant fragments for a document that contains the query term “Olympic”. Because CTX_DOC.SNIPPET returns surrounding text, you can immediately evaluate how useful the returned term is. More details can be found in the documentation here.  

select ctx_doc.snippet('IDX_NEWS', id, 'Olympic') snippet
from news_doc
where contains(news,'Olympic')>0 and length(news)>1000;

SNIPPET           
----------------------------------------------------------------------------
unveil its 3G plans ahead of the <b>Olympic</b> Games in 2008, remains tight

Oracle Text and Autonomous Database

Oracle Autonomous Database Serverless also offers some great new features. We just talked about how you can quickly search through text stored inside the database itself. But what if your data actually lives in files on Object Storage instead of the database? No problem – that’s possible too.

You can index objects stored in the OCI Object Store using a variety of approaches. With the Autonomous Database, the DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX procedure makes this process straightforward—simply provide the URI of your Object Storage bucket, and Oracle will automatically index the content of each file within that bucket. This method is very convenient, though it offers less customization compared to creating an index using the traditional CREATE INDEX statement. Alternatively, in ADB-S 23ai, you can use the DBMS_SEARCH API to index content from an external table, or create a custom user datastore that connects to OCI Object Store (also possible in 19c).

Let’s see how this works with the DBMS_CLOUD approach. For this demonstration, I’ll use a bucket named “bucketdocs” where I’ve uploaded several files. Let’s start by examining the contents of that bucket using DBMS_CLOUD.LIST_OBJECTS.

select object_name, bytes 
from DBMS_CLOUD.LIST_OBJECTS(
        credential_name => 'CREDUS',
        location_uri    => 'https://.../b/bucketdocs/o/');

OBJECT_NAME                                                  BYTES
------------------------------------------------------- ----------
Oracle Text Overview May 2022.pptx                         3255497
Registrations_with.csv                                       35608
ai-vector-search-users-guide.pdf                           5507528
cv_us_de_draft.docx                                          27311
database-development-guide.pdf                             5123544
oracle-database-23ai-new-features-guide (1).pdf            1010850
text-application-developers-guide.pdf                      1928110

Now let’s create an Oracle text index.

-- execute DBMS_CLOUD.DROP_EXTERNAL_TEXT_INDEX('IDX_DOCS');
begin
    DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX (
        credential_name => 'CREDUS',
        location_uri    => 'https://.../b/bucketdocs/o/',
        index_name      => 'IDX_DOCS',
        format          => JSON_OBJECT('binary_files' value true)
    );
end;

The LOCATION_URI argument is where you specify the URI of your Object Storage bucket—so in this case, it contains “bucketdocs.” You have to provide this parameter, it’s required. There’s also a FORMAT argument, which is a JSON string where you can set extra options like how often the index should refresh, whether to include binary files, or if you want to use a stop word list.

In our example, we’re using the default refresh rate of 5 minutes, and we’re not using a stop word list. For file types, we’ve allowed binary files (like PDFs, Word docs, Excel files, zip files, etc.) to be included in the index. If you leave the default (which is FALSE), only plain text, HTML, CSV, JSON, and XML files will be indexed—any binary files will get skipped.

Now, let’s see what our index looks like using CTX_USER_INDEXES.

col idx_name format a20
col idx_table_owner format a20
col idx_table format a20
col IDX_KEY_NAME format a10
col IDX_TEXT_NAME format a15

select idx_name, idx_table_owner, idx_table, idx_key_name, idx_text_name, idx_docid_count, idx_status, 
idx_type, idx_sub_type, idx_maintenance_type
from ctx_user_indexes;

IDX_NAME             IDX_TABLE_OWNER      IDX_TABLE            IDX_KEY_NA IDX_TEXT_NAME   IDX_DOCID_COUNT IDX_STATUS   IDX_TYPE      IDX_ IDX_MA
-------------------- -------------------- -------------------- ---------- --------------- --------------- ------------ ------------- ---- ------
IDX_DOCS$IDX         TEXTUSER             IDX_DOCS$TXTIDX                 OBJECT_NAME                   7 INDEXED      CONTEXT       TEXT AUTO  
IDX_NEWS             TEXTUSER             NEWS_DOC             ID         NEWS                     199964 INDEXED      SEARCH        TEXT AUTO

A new table named IDX_DOCS$TXTIDX is created and it has created a text index on the contents of each file.

desc idx_docs$txtidx

Name        Null? Type                        
----------- ----- --------------------------- 
MTIME             TIMESTAMP(6) WITH TIME ZONE 
OBJECT_NAME       VARCHAR2(4000)              
BYTES             NUMBER                      
OBJECT_PATH       VARCHAR2(4000)

The text index is on the OBJECT_NAME column, so that’s the column we’ll need to query. We can do this using a standard CONTAINS query. Let’s ‘s look for the term ‘Ulrike’. Et voila, the result shows the Excel with the registration list and a draft cv in docx format.

select object_name from idx_docs$txtidx 
where contains(object_name, 'Ulrike')>0;

OBJECT_NAME                             
----------------------------------------
Registrations_with.csv
cv_us_de_draft.docx

A simple misspelling of “unrestricted” such as “unnrestrictid” with the FUZZY operator results in a document.

select object_name from idx_docs$txtidx 
where contains(object_name, 'fuzzy(unnrestrictid)')>0;

OBJECT_NAME                                                 
------------------------------------------------------------
oracle-database-23ai-new-features-guide (1).pdf

New files will be indexed automatically. You can see whether the files have been detected by querying the table, e.g. IDX_DOCS$TXTIDX, but note that there may be a short delay between files appearing in that table and the index being updated.

You want to try it out, but you do not have a bucket available yet? Try a public bucket e.g. the MOVIE_STREAM public bucket – also mentioned in Roger Ford’s posting. In that case, we don’t need the credential_name, and can leave it out or set it to NULL. We can even leave it out the format argument (but will need to be aware it could be five minutes before our files are available).  

begin
  DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX (
    location_uri => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/dwcsprod/b/MovieStream/o/',
    index_name   => 'movies'
  );
end;

Check out the files in the bucket.

select object_name from movies$txtidx;

OBJECT_NAME                                                 
----------------------
Countries.csv
Days.csv
Devices.csv
Months.csv
Movie_Sales_2020.csv

You can look inside the files or simply try something out.

select * from movies$txtidx where contains(object_name, 'PC')>0;

OBJECT_NAME                                                 
------------------------------------------------------------
Devices.csv
Movie_Sales_2020.csv

For more information check out the autonomous database documenation “Use Full-Text Search on Files in Object Storage” and the posting Indexing Text in Object Storage by former PM Roger Ford who wrote the first posting about it.

As we’ve seen so far, there aren’t any restrictions when it comes to using Oracle Text. However, to maintain both security and performance in Autonomous Database, certain Oracle Text features are limited. The main limitation involves logging: all logging functions and related APIs, like ctx_report.query_log_summary, aren’t supported.

Additionally, for older database versions, if you want to use the OPTIMIZE Index or CREATE INDEX with the BIG_IO option, you’ll need to grant the CREATE TRIGGER privilege (see the documentation on OPTIMIZE for further details).

Keep in mind there’s also a naming update for some datastore types starting with Oracle Database 19c: FILE_DATASTORE is now deprecated and replaced by DIRECTORY_DATASTORE, and URL_DATASTORE has been replaced by NETWORK_DATASTORE. For more information, please refer to the relevant documentation link.

When using Autonomous Database, keep in mind that there are several predefined database services, each designed with different performance and concurrency profiles. You’ll need to pick the right service for each application or user connecting to your database.
If you select the LOW service, parallelism isn’t available—any specified degree of parallelism (DOP) will be ignored. The HIGH and MEDIUM services, on the other hand, have parallel queries, DDL, and DML enabled by default. These services can even scale beyond the DOP you specify, meaning they may use more parallel resources than you set. 
For the MEDIUM service, you can use the UI console to control the level of concurrency so that the DOP for Text DDL will actually be followed. If you want to make sure the DOP is always honored, users should use the TPURGENT service (if it’s available; note that TPURGENT isn’t an option for Data Warehouse workloads). To learn more about Autonomous Database Services please consult the documentation Database Service Names for Autonomous Database.