In this post, we will explore using hybrid vector indexes in AI Vector Search. A hybrid vector index combines the benefits of fast similarity search using AI Vector Search with fast keyword search using an Oracle Text domain index. You should consider using a hybrid vector index when you have similarity searches that can be enhanced with a keyword search.

When to use a Hybrid Index

The following are two scenarios where a hybrid search may help. The first is where you are trying to narrow the focus of a similarity search. In other words, the keyword query is different from the similarity search. For example, you might add keywords to eliminate results that cause too broad of a result set to be returned from the similarity search. In the examples below, one of the similarity searches that I came up with is “What have scientists discovered?” and then I added a keyword search of “elements” to narrow the focus to try to get articles about scientific discoveries involving elements.

The second scenario is to reinforce the similarity search by essentially creating the same query for both the semantic search and the keyword search to help ensure that no results are missed. If the semantic search misses some results, the keyword search may capture them, and vice versa. For example, maybe you really want to know what elements have been discovered, so you run a semantic search with the question “What elements have scientists discovered?” and then add a keyword search with the word “element” to try to ensure that you don’t miss any results.

What is a hybrid vector index?

A hybrid vector index is made up of both an Oracle Text domain index and an AI Vector Search vector index. The vector index can be either an HSNW vector index or an IVF vector index. See the blog posts Using HSNW Vector Indexes and Using IVF Vector Indexes for an explanation of these two different types of vector indexes. One of the benefits of creating a hybrid vector index is that it automatically does the chunking and vector embedding as part of the index creation.

In case you are not familiar with chunking, chunking is done on large textual data to split it into smaller “chunks”. This is primarily done for two reasons. The first is to enhance the relevance of the text being vectorized. Large portions of text may have many different concepts that are not necessarily relevant to each other, thereby diminishing the effectiveness of a similarity search. The other reason is that embedding models limit the size of the text they can work on. Unfortunately, there is no “one size fits all” for chunk sizing, so the ideal size for your data might take some experimentation with different sizes to see what works best.

How does a hybrid search work?

By default, a hybrid search accesses a hybrid vector index and performs both a keyword search and a similarity search, and then the results are combined into a single result set. A scoring mechanism is used to create the result set, a keyword score for the keyword search, and a semantic score for the similarity search. It is also possible to perform just a keyword search or just a similarity search on the hybrid vector index, and many options are available to adjust how the join and scoring are done.

For the following examples, we will continue using the pre-built all-MiniLM-L12-v2 model that we started with in our first blog, Getting Started with Oracle Database 23ai AI Vector Search. In that post, I detailed how to obtain and then load that model into an Oracle database. I decided to use a new data set for this post using a small set of Wikipedia articles. These articles have a relatively large amount of text, and I thought they would be better suited to showing the benefits of hybrid vector indexes than the Chicago crime incidents dataset that I used in the previous posts Getting Started with Oracle Database 23ai AI Vector Search, Using HSNW Vector Indexes, and Using IVF Vector Indexes.

The Wikipedia articles can be downloaded from this OpenAI hosted link, unzipped into a source CSV file, and loaded into an Oracle database. The CREATE TABLE statement I used was the following:

create table wiki_data
(
  id             number,
  url             varchar2(200),
  title         varchar2(100),
  text          clob,
  title_vector   clob,
  content_vector clob,
  vector_id         number
);

You can then use the following control file, which I called wiki.ctl, to load the data using SQL Loader:

OPTIONS(skip=0)
load data
 infile 'wiki_articles.csv'
 into table wiki_data
 append
  fields csv with embedded terminated by ',' optionally enclosed by '"'
(id,
 url,
 title,
 text char(1000000000),
 title_vector char(1000000000),
 content_vector char(1000000000),
 vector_id)

The SQL Loader command I used was:

sqlldr userid=<user>/<pwd>@<pdb name> control=wiki.ctl log=loader.log

Another, and perhaps simpler approach, was suggested by Jeff Smith (who is also an Oracle Product Manager) in his blog on using SQLcl. He showed how to load the Chicago Crimes dataset using SQLcl from the previous blog posts, and the same process would apply for the Wikipedia articles above.

The full Wikipedia dataset above contains 25,000 articles, and the vector embeddings included in it were created by OpenAI. If you load the whole dataset, it will take about 2 GB of space. To make the data easier to work with, I chose to use only 1000 articles and generate vector embeddings from the pre-built all-MiniLM-L12-v2 model I described above.

The final table that I used was called wiki_articles with the following attributes:

Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                                     NUMBER
 URL                                    VARCHAR2(200)
 TITLE                                  VARCHAR2(100)
 TEXT                                   CLOB

I then created a hybrid vector index with the following syntax:

CREATE HYBRID VECTOR INDEX wiki_articles_hybrid_idx
ON wiki_articles(text)
PARAMETERS ('MODEL minilm_l12_v2');

It is also possible to change the parameters used to create the index by setting specific preferences. After some experimentation, I found that in my examples, it worked better not to have any “overlap” of text between chunks. This is actually the default, but I was originally following an example in the Oracle AI Vector Search User’s Guide which used an overlap. So I dropped the previous index and re-created the index using the following preferences:

begin
  DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
    'my_vectorizer_pref',
     dbms_vector_chain.vectorizer,
    json('{
            "vector_idxtype":  "ivf",
            "distance"      :  "cosine",
            "accuracy"      :  95,
            "model"         :  "minilm_l12_v2",
            "by"            :  "words",
            "max"           :  100,
            "overlap"       :  0,
            "split"         :  "recursively"          }'
        ));
end;
/

CREATE HYBRID VECTOR INDEX wiki_articles_hybrid_idx on
  wiki_articles(text)
    parameters('VECTORIZER my_vectorizer_pref');

Notice that I also created the vector index portion of the hybrid vector index as an IVF index, which again is the default. The AI Vector Search User’s Guide documents how to do this and what choices are available.

Once a hybrid vector index has been created, it is possible to display the characteristics of the index by accessing a view of the format <index_name>$VECTORS. The following shows the WIKI_ARTICLES_HYBRID_IDX$VECTORS view that was created as a result of creating the index above:

desc wiki_articles_hybrid_idx$vectors

Name                          Null?    Type
----------------------------- -------- --------------------
DOC_ROWID                     NOT NULL ROWID
DOC_CHUNK_ID                  NOT NULL NUMBER
DOC_CHUNK_COUNT               NOT NULL NUMBER
DOC_CHUNK_OFFSET              NOT NULL NUMBER
DOC_CHUNK_LENGTH              NOT NULL NUMBER
DOC_CHUNK_TEXT                         VARCHAR2(4000)
DOC_EMBEDDING                          VECTOR(*, *, DENSE)

This view can be handy if you want to look at the details of the index. Notice the CHUNK attributes and the DOC_EMBEDDING column. The CREATE HYBRID VECTOR INDEX has taken our articles, chunked them, and then generated vector embeddings for those chunks.

After looking at a few of the 1000 articles that I loaded into the WIKI_ARTICLES table, and as I mentioned at the beginning of this post, I decided to try a similarity search with the question “What have scientists discovered?”. Since I only had a limited dataset I wanted to make sure that my question and keywords would result in at least a few matches. I added the keyword “elements” to see if it would narrow the results to scientific discoveries involving elements. I ran the following query using the new API to query hybrid vector indexes:

select json_Serialize(
  DBMS_HYBRID_VECTOR.SEARCH(
    json(
         '{
            "hybrid_index_name" : "wiki_articles_hybrid_idx",
            "search_scorer"     : "rsf",
            "search_fusion"     : "INTERSECT",
            "vector":
                      {
                        "search_text"   : "What have scientists discovered?",
                        "search_mode"   : "DOCUMENT",
                        "aggregator"    : "MAX",
                        "score_weight"  : 1
                      },
            "text":
                      {
                       "contains"       : "$elements",
                       "score_weight"   : 1
                      },
            "return":
                      {
                        "values"        : [ "rowid", "score", "vector_score", "text_score" ],
                        "topN"          : 5
                      }
          }'
    )
  ) RETURNING CLOB pretty);

My results were the following:

[
  {
    "rowid" : "AAAS0VAAMAAAGZ7AAD",
    "score" : 82.74,
    "vector_score" : 65.48,
    "text_score" : 100
  },
  {
    "rowid" : "AAAS0VAAMAAAFDdAAF",
    "score" : 82.2,
    "vector_score" : 64.4,
    "text_score" : 100
  },
  {
    "rowid" : "AAAS0VAAMAAE27rAAE",
    "score" : 48.14,
    "vector_score" : 59.27,
    "text_score" : 37
  },
  {
    "rowid" : "AAAS0VAAMAAAGZ7AAJ",
    "score" : 47.22,
    "vector_score" : 57.43,
    "text_score" : 37
  },
  {
    "rowid" : "AAAS0VAAMAAE2/aAAD",
    "score" : 37.06,
    "vector_score" : 59.12,
    "text_score" : 15
  }
]

I then took a look at the five articles. Part of the first two articles are shown below:

SQL> select text from wiki_articles where rowid = '&1';
Enter value for 1: AAAS0VAAMAAAGZ7AAD
old   1: select text from wiki_articles where rowid = '&1'
new   1: select text from wiki_articles where rowid = 'AAAS0VAAMAAAGZ7AAD'

TEXT
--------------------------------------------------------------------------------
Chemistry is a branch of  science that deals with chemical elements and compounds,
and how they work together and change. In other words, chemistry is the branch
of science that study fundamental properties of matter and chemical reactions.

History
Before 1600, people studied substances to figure out how to do things such as turn
lead into gold, but no one managed to do that. This was called alchemy. After
1600, using the scientific method alchemists became chemists. Chemists separated
the air into many parts and isolated the noble gases from it.

<the rest of the article is not included>
SQL> select text from wiki_articles where rowid = '&1';
Enter value for 1: AAAS0VAAMAAAFDdAAF
old   1: select text from wiki_articles where rowid = '&1'
new   1: select text from wiki_articles where rowid = 'AAAS0VAAMAAAFDdAAF'

TEXT
--------------------------------------------------------------------------------
Atoms are very small pieces of matter. There are many different types of atoms,
each with its own name, mass and size.    These different types of atoms are called
chemical elements.  The chemical elements are organized on the periodic table.
Examples of elements are hydrogen, carbon, chlorine, and gold etc.

Atoms are very small, but their exact size depends on the element.  Atoms range
from 0.1 to 0.5 nanometers in width. One nanometer is about 100,000 times smaller
than the width of a human hair. This makes atoms impossible to see without special
tools. Scientists discover how they work and interact with other atoms through
experiments.

<the rest of the article is not included>

Note that these first two articles have text scores of 100 and have passages about science and include the keyword “element”. In fact, even though the text score decreases for articles three through five, they all had the keyword “element” in the text and were about scientific topics. The last one was about John Dalton and his scientific theories about atoms and includes the word “elements”, and this was with a text score of only 15.

If you’re interested in any of the syntax and options that I used you can look in the Understand Hybrid Search section of the Oracle AI Vector Search User’s Guide. The example for Keyword and Semantic in Document Mode search inspired the query I used above.

My next thought was how well would I have done if I had done just a similarity search. In other words, how important was the context of both a similarity search and a text search together?

The following example represents the search results for just a similarity search:

select json_Serialize(
  DBMS_HYBRID_VECTOR.SEARCH(
    json(
         '{
            "hybrid_index_name" : "wiki_articles_hybrid_idx",
            "search_scorer"     : "rsf",
            "search_fusion"     : "INTERSECT",
            "vector":
                      {
                        "search_text"   : "What have scientists discovered?",
                        "search_mode"   : "DOCUMENT",
                        "aggregator"    : "MAX",
                        "score_weight"  : 1
                      },
            "return":
                      {
                        "values"        : [ "rowid", "score", "vector_score" ],
                        "topN"          : 5
                      }
          }'
    )
  ) RETURNING CLOB pretty);


The results were the following:

[
  {
    "rowid" : "AAAS0VAAMAAE27pAAF",
    "score" : 74.61,
    "vector_score" : 74.61
  },
  {
    "rowid" : "AAAS0VAAMAAAFDfAAB",
    "score" : 70.1,
    "vector_score" : 70.1
  },
  {
    "rowid" : "AAAS0VAAMAAAFDdAAG",
    "score" : 67.96,
    "vector_score" : 67.96
  },
  {
    "rowid" : "AAAS0VAAMAAAGZ7AAD",   <- This rowid is the first rowid in
    "score" : 65.48,                     the previous result above
    "vector_score" : 65.48
  },
  {
    "rowid" : "AAAS0VAAMAAE3R8AAC",
    "score" : 64.87,
    "vector_score" : 64.87
  }
]

Notice that a similarity search without the text search produced different results for all but one rowid, and that one rowid was the first one in the similarity and text search above, but fourth in just the similarity search alone. I should also note that the first two articles were scientific in nature, that is about Caecilians and Australian geography, and did not mention anything about elements. It is pretty clear to me that in this example adding a text search significantly narrows the search results or reinforces them.

Now I realize that this is a pretty simple example on a very limited dataset, but I think it does illustrate that similarity search is a very powerful new tool, and that a hybrid vector index can help make similarity searches even more precise when combined with a text search. It is not hard to imagine that in a dynamic environment, adding a reinforcing text search to similarity search could help produce the best-quality answers.