Using JSON documents and don’t know what you’re looking for? 23c Search Indexes to the rescue

May 23, 2023 | 10 minute read
Roger Ford
Principal Product Manager
Text Size 100%:

Introduction

Oracle has powerful capabilities for handling JSON. It also has flexible capabilities for full-text searching, like keyword search, phrase search, or proximity search. We're going to see how these capabilities meet in the JSON search index to provide the powerful functionality of full text search in an optimized manner for all your JSON documents.

What are text indexes?

In its basic form, a text index allows you to create a word-based index on a textual field in the database. It is then possible to search the table for fields containing particular words or phrases.

Let's say we have a table emp which contains some employee details:

create table emp(name varchar2(40), salary number, qualifications varchar2(200));
insert into emp values ('John', 1500, 'PhD in physics, Msc Math');
commit;

In 23c I would create a text index on that table using:

create search index emp_qual on emp(qualifications);

In earlier versions I would do:

create index emp_qual on emp(qualifications) indextype is ctxsys.context;

After having created my text index, I can search it using a CONTAINS query, such as:

select * from emp where contains(qualifications, 'physics') > 0;

That’s the simplest example of how to use a text index. That search is looking for the word “physics” somewhere in the qualifications field. The query could be much more complex – we’ll see a few more advanced examples when we look at JSON_TEXTCONTAINS in a moment.

What’s special about a Search Index?

On the face of it, we might think that we could just do a substring search on the qualifications field to look for the word ‘physics’. But substring searches are quite limited

  • They can’t use an index
  • They are case-sensitive
  • Punctuation and spacing will affect the searches
  • We can’t do  oolean searches such as AND, OR or NOT within the substring search itself, we’d have to do multiple searches which would get very inefficient

Having a search index means that each word in our text has its own index entry, so we are able to rapidly find references to individual words or phrases, or boolean combinations of words, without having to scan the original text.

What are JSON indexes?

Now let's say our employee data was stored as JSON, and we need to search using salary ranges. For small numbers of rows, a full scan of the JSON will be plenty fast enough, especially if we're using the binary JSON datatype in 21c/23c.

create table empj(empdata json);

insert into empj values ('{ "name":"john", "salary":1500, "qualifications": "PhD in physics, Msc Math"}');

insert into empj values ('{ "name":"bobby", "salary":900,
"qualifications": "Msc Math", "hobbies": "physics" }');

commit;

select * from empj e where e.empdata.salary.number() > 1000;

But what if we have a very large number of rows in our JSON table? In that case, we'll want to create an index on the salary element of the JSON:

create index emp_salary on empj e(e.empdata.salary.number());

But … what if we don't know what fields we need to search? Or we do know, but we want to do word-based searches on those fields? Ultimately, JSON data is flexible and does not have a static, fits-all schema definition like our relational table before.

Both of these problems are solved with a JSON Search Index. A JSON search index indexes all the data in a JSON object, without having to pre-declare any data types or even to know the attributes in your documents. It’s all JSON. Not only does it index your JSON documents in the most efficient manner, it provides full text search capabilities with this index.

OK ... I need a JSON search index. How do I create one?

Creating a JSON search index is as easy as this:

create search index emp_search on empj(empdata) for json;

With this JSON search index you killed two birds with one stone: First, you have an index for a ‘normal’ JSON search like our salary example before, as seen in the explain plan:

explain plan for select * from empj e where e.empdata.salary.number() > 1000;

select * from table(dbms_xplan.display);

------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |     1 |  4114 |     4   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| EMPJ       |     1 |  4114 |     4   (0)| 00:00:01 |
|*  2 |   DOMAIN INDEX              | EMP_SEARCH |       |       |     4   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

The line with Id=2 shows that we are using a 'domain index' for the search - that's our JSON search index being used to speed up access to the2 salary field - even though we didn't specify that we wanted to specifically index that field.

Second, your JSON search index also allows for full-text search on all string fields within the JSON. We would do that using the JSON_TEXTCONTAINS operator, which takes the column name, a JSON path for where to search, and a full text search expression as arguments. For example:

select e.empdata.name from empj e
where json_textcontains(empdata, '$', 'physics');

NAME
--------------------------------------------------------------------------------
"john"
"bobby"

The '$' in there represents the root, or base of the JSON document (it's a "JSON path expression" if you want to look up more detail) and means we should search for the word 'physics' anywhere in the JSON document. If we wanted to search a particular part of the document, say "qualifications", we could express that as a path instead:

select e.empdata.name from empj e
where json_textcontains(empdata, '$.qualifications', 'physics');

NAME
--------------------------------------------------------------------------------
"john"

That is the power of JSON Search Indexes. Depending on your ‘search space’ – the whole document or a specific attribute – the JSON search index is speeding up your request.

We could have done that with a simple JSON equality operator and some wildcards, but it would not have been fast on a large collection. And we certainly couldn't have done more complex searches like:

select e.empdata.name, e.empdata.qualifications from empj e
where json_textcontains(empdata, '$.qualifications', 'physics AND msc math');

NAME      QUALIFICATIONS
_________ _____________________________
"john"    "PhD in physics, Msc Math"

Meaning "the qualifications field contains the single word "physics" and the contiguous phrase "msc math". Nor could we do a 'fuzzy' search such as:

select e.empdata.name, e.empdata.qualifications from empj e
where json_textcontains(empdata, '$.qualifications', 'fuzzy(phisiks'));

NAME      QUALIFICATIONS
_________ _____________________________
"john"    "PhD in physics, Msc Math"

Very useful if you're unsure of your spelling, or that of whoever created the JSON in the first place. The JSON search index comes to the rescue and you’ll find what you’re looking for!

We can even do relevance ranking in 23c with JSON_TEXTCONTAINS. Let's add a couple more JSON documents to our table:

insert into empj values ('{ "name":"bill", "salary":1000, "qualifications": "Math professor"}');

insert into empj values ('{ "name":"mike", "salary":2000, "qualifications": "Physics student"}');

commit;

We'll need to issue a COMMIT and wait two or three seconds for our index to get updated, then we can run a query with a SCORE. Note the extra final argument to JSON_TEXTCONTAINS - that's a number which associates the SCORE() function with this particular JSON_TEXTCONTAINS.

select score(1), e.empdata.name, e.empdata.qualifications from empj e
where json_textcontains(empdata, '$.qualifications', 'math ACCUM physics', 1)
order by score(1) desc;

The ACCUM operator guarantees that if both terms are found, the record will score higher than if only one term is found. There can be more than two terms, and the higher number of terms found will always score higher. So the query above gives us:

SCORE(1)    NAME       QUALIFICATIONS
___________ __________ _____________________________
         52 "john"     "PhD in physics, Msc Math"
          2 "bobby"    "Msc Math"
          2 "bill"     "Math professor"
          2 "mike"     "Physics student"

The absolute value of the score is not that important - it's not something we'd generally show to the user. Instead it is used, as here, to order the results according to their relevance. You find what you are looking for easily, with the help of JSON Search Indexes.

And that’s just the beginning. There is so much more to discover with Oracle JSON search indexes (and Oracle Text indexes in general). The search query syntax used in JSON_TEXTCONTAINS is the same as for the Oracle Text CONTAINS operator, so check out the many more details on what we've discussed here, and for many more text query operators, see the Oracle Text Reference manual

 

Roger Ford

Principal Product Manager

Roger Ford has been at Oracle since 1987. He started writing code for Oracle 6, and has worked in Development, Support, Technical Marketing and Product Management.


Previous Post

Lock-free reservation in 23c: how to start with

Ulrike Schwinn | 11 min read

Next Post


Develop MongoDB Applications with Oracle Autonomous Database on Dedicated Exadata Infrastructure