A couple of time ago, a colleague asked me for code examples demonstrating how to implement hybrid vector search in Oracle Database. He has read Hybrid Vector Index – a combination of AI Vector Search with Text Search and was interested in seeing more examples. I took this opportunity to create a follow-up post on Hybrid Vector Search, including additional code snippets and explanations to illustrate various use cases of hybrid search.
To create a comprehensive tutorial or an end-to-end example, follow the sequence of posts below:
- Getting started with vectors in 23ai: Demonstrates how to load an ONNX file into the database, create CCNEWS with 200000 rows, and perform some similarity search queries.
- Hybrid Vector Index – a combination of AI Vector Search with Text Search (part 1): Introduces the concepts and explanations of hybrid vector indexes, shows how to create a hybrid vector index on CCNEWS, and presents an initial query using INTERSECT.
- More examples on hybrid vector search: Provides more examples related to hybrid vector search (the current post).
For detailed syntax and additional information, please refer to the official documentation, e.g.
- Oracle AI Vector Search User's Guide: Manage Hybrid Vector Indexes
- Oracle AI Vector Search User's Guide: Perform Hybrid Search
- Oracle AI Vector Search User's Guide: DBMS_SEARCH
To start with additional examples, let's first review our setup. We demonstrated the simplest case – with very short text snippets that do not need to be chunked etc. or even indexed. Please note, depending on the size of your input data, which can range from small strings to very large documents, the data passes through a pipeline of optional transformation stages from plain Text to Chunks to Tokens to Vectors, with Vector Index as the endpoint.
0. The setup
For this demonstration, we are working with table CCNEWS with 200000 rows.
SQL> desc ccnews Name Null? Type ----------------------------------------- -------- ---------------------------- ID NOT NULL NUMBER(10) INFO VARCHAR2(4000) VEC VECTOR(*, *, DENSE) SQL> select id, info from ccnews where rownum<03; ID ---------- INFO -------------------------------------------------------------------------------- 1 BOGOTA, Colombia - A U.S.-made helicopter on an anti-drugs mission crashed in the Colombian jungle on Thursday, killing all 20 Colombian soldiers aboard, the army said. 2 UNIONTOWN, Pa. - A police officer used a Taser to subdue a python that had wrapp ed itself around a man's arm and would not let go.
And here is the information about the hybrid vector index we created in Hybrid Vector Index – a combination of AI Vector Search with Text Search (part 1).
SQL> select idx_name, idx_table, idx_status, idx_docid_count, idx_maintenance_type, idx_model_name, idx_vector_type from ctx_user_indexes; IDX_NAME IDX_TABLE IDX_STATUS IDX_DOCID_COUNT -------------------- --------------- ---------- --------------- IDX_MAINTENANCE_TY IDX_MODEL_NAME IDX_VECTOR_TYPE ------------------ -------------------- --------------- MY_HYBRID_IDX CCNEWS INDEXED 200000 AUTO DOC_MODEL IVF
1. Hybrid Vector Search
Let's explore different approaches to hybrid vector search. As you're aware, hybrid search combines both vector-based and keyword-based search components to produce a set of document IDs. Using DBMS_HYBRID_VECTOR.SEARCH you can perform textual queries, vector similarity queries, or hybrid queries against hybrid vector indexes. The results of keyword and semantic search are finally fused or merged together. You have control over how these results are combined by adjusting the "search_fusion" parameter. This parameter accepts several valid values, including: "INTERSECT", "UNION", "TEXT_ONLY", "VECTOR_ONLY", "MINUS_TEXT", and "MINUS_VECTOR".
For example, if your application requires retrieving all unique rows from both text search results and vector search results with scores greater than 0, you can set "search_fusion" to "UNION". Or in another example if you want only distinct rows that satisfy both searches, you would choose "INTERSECT". An example implementation for the "INTERSECT" fusion method is available here.
a. The usage of UNION
The following statement illustrates an example using "UNION". Per definition, "UNION" combines all distinct rows from both text search results and vector search results where text score and vector score is greater than 0.
set long 1000000
set longc 1000000
set pagesize 1000
select json_serialize(
dbms_hybrid_vector.search(
json(
'{ "hybrid_index_name" : "my_hybrid_idx",
"vector":
{
"search_text" : "newspaper",
"score_weight" : 1
},
"text":
{
"contains" : "British newspaper or Times or Sun",
"score_weight" : 1,
"snippet" :250
},
"search_fusion" :"UNION",
"return":
{
"values" : [ "score", "text_score", "rowid", "chunk_text", "vector_score"],
"topN" : 10
}
}' )) returning clob pretty) output;
Here's how it works:
- The keyword search retrieves a list of doc IDs satisfying the "contains" text query string (here "British newspaper or Times or Sun").
- The vector search retrieves a list of doc IDs satisfying the "search_text" similarity query string (here "newspaper").
- Then the results are fused using the "search_fusion" operator (here "UNION"). It returns only the rows that are common to both – text search results and vector results.
- The final scoring uses the specified "score_weight" (here 1 in both) for vector or text score.
- Finally, the defined "TopN" results with values for "score", "text_score", "rowid", "chunk_text", "vector_score" are returned.
Note:
We used "snippet" to enable text snippet search results of maximum length 250. A snippet is a text fragment which returns one or more most relevant fragments for a document that contains the query term. The resulting snippet is returned in "chunk_text".
The result looks like …
OUTPUT
--------------------------------------------------------------------------------
[
{
"score" : 50,
"text_score" : 100,
"rowid" : "AAARvGAAAAAAFH4AAB",
"chunk_text" : "InfoWorld) - Sun Microsystems\\u00c2\\u00a0on has been mor
e than two years since Sun \\u00c2\\u00a0released OpenSolaris in November of la
st year. Sun hopes opening up its software will enable High Availability Clust
er, Sun will release its Solaris",
"vector_score" : 0
},
{
"score" : 50,
"text_score" : 100,
"rowid" : "AAARvGAAAAAAPEzAAA",
"chunk_text" : "of the first customers for Sun Microsystems' Proj
ect Blackbox, which in a few weeks, according to Sun .SLAC's Blackbox was d
elivered It is one of four customers that Sun identified Tuesday to illustrate
the Sun Modular Datacenter, or Sun MD.",
"vector_score" : 0
},
...
},
{
"score" : 44,
"text_score" : 19,
"rowid" : "AAARvGAAAAAAN83AAB",
"chunk_text" : "The New York Times debuts the Times Reader, an application t
hat attempts to deliver the experience of a real newspaper to the computer scree
n. Is this the end of ink-stained fingers? By Jeff Koyen.",
"vector_score" : 68.99
},
{
"score" : 43.61,
"text_score" : 19,
"rowid" : "AAARvGAAAAAAEzPAAA",
"chunk_text" : "had 264 newspapers in its consortium, including the San Fran
cisco Chronicle, the Miami Herald, the Atlanta Journal-Constitution, and the Hou
ston Chronicle. On a conference call reporting quarterly financial results Tuesd
ay, Yahoo president Susan Decker said the newspaper consortium is a core area of
focus for the company as it tries to grow revenue after\\u00c2\\u00a0top execut
ive changes, poor financial results, and criticism from investors.",
"vector_score" : 68.22
},
...
},
{
"score" : 37.13,
"text_score" : 0,
"rowid" : "AAARvGAAAAAASrtAAD",
"chunk_text" : "in the press: report",
"vector_score" : 74.26
}
]
b. The usage of TEXT_ONLY
The following statement illustrates an example using "TEXT_ONLY". The fused results contain the text search results that appear in text search, including those that appear in both. The score text condition must be greater than 0.
select json_serialize(
dbms_hybrid_vector.search(
json(
'{ "hybrid_index_name" : "my_hybrid_idx",
"vector":
{
"search_text" : "newspaper",
"score_weight" : 1
},
"text":
{
"contains" : "British newspaper or Times or Sun",
"score_weight" : 1,
"snippet" : 250
},
"search_fusion" :"TEXT_ONLY",
"return":
{
"values" : [ "score", "text_score", "rowid", "chunk_text", "vector_score"],
"topN" : 10
}
}' )) returning clob pretty) output;
The output looks like …
OUTPUT
--------------------------------------------------------------------------------
[
{
"score" : 50,
"text_score" : 100,
"rowid" : "AAARvGAAAAAAFH4AAB",
"chunk_text" : "InfoWorld) - Sun Microsystems\\u00c2\\u00a0on has
been more than two years since Sun \\u00c2\\u00a0re
leased OpenSolaris in November of last year. Sun h
opes opening up its software will enable High Availa
bility Cluster, Sun will release its Solaris",
"vector_score" : 0
},
{
"score" : 50,
"text_score" : 100,
"rowid" : "AAARvGAAAAAAPEzAAA",
"chunk_text" : "of the first customers for Sun Microsystems'
Project Blackbox, which in a few weeks, according t
o Sun .SLAC's Blackbox was delivered It is one
of four customers that Sun identified Tuesday to i
llustrate the Sun Modular Datacenter, or Sun MD.
",
"vector_score" : 0
},
...
{
"score" : 46.5,
"text_score" : 93,
"rowid" : "AAARvGAAAAAATkrAAA",
"chunk_text" : "InfoWorld) - Sun Microsystems\\u00c2\\u00
a0is going to announce enhancements to its Sun Jav
a System Content Delivery new that they'll pay extra
for. Sun is one of many vendors promising up and
attracting users. Sun 's CDS provides a common",
"vector_score" : 0
},
{
"score" : 45.54,
"text_score" : 10,
"rowid" : "AAARvGAAAAAAQa4AAB",
"chunk_text" : ": British newspaper",
"vector_score" : 81.07
},
...
{
"score" : 42,
"text_score" : 19,
"rowid" : "AAARvGAAAAAATCGAAD",
"chunk_text" : "After more than two centuries as a b
roadsheet newspaper, The Times of London has gone st
rictly tabloid. On Monday, The Times moved to a tota
lly compact format after almost a year of dual publi
cation.",
"vector_score" : 65
}
]
c. The usage of VECTOR_ONLY
The following statement illustrates an example using "VECTOR_ONLY". The fused results contain the vector search results that appear in vector search, including those that appear in both. Vector score condition must be greater than 0.
select json_serialize(
dbms_hybrid_vector.search(
json(
'{ "hybrid_index_name" : "my_hybrid_idx",
"vector":
{
"search_text" : "newspaper",
"score_weight" : 1
},
"text":
{
"contains" : "British newspaper or Times or Sun",
"score_weight" : 1,
"snippet" : 250
},
"search_fusion" :"VECTOR_ONLY",
"return":
{
"values" : [ "score", "text_score", "rowid", "chunk_text", "vector_score"],
"topN" : 10
}
}' )) returning clob pretty) output;
The output looks like …
OUTPUT
--------------------------------------------------------------------------------
[
{
"score" : 44,
"text_score" : 19,
"rowid" : "AAARvGAAAAAAN83AAB",
"chunk_text" : "The New York Times debuts the Times Reader, an application t
hat attempts to deliver the experience of a real newspaper to the computer scree
n. Is this the end of ink-stained fingers? By Jeff Koyen.",
"vector_score" : 68.99
},
{
"score" : 43.61,
"text_score" : 19,
"rowid" : "AAARvGAAAAAAEzPAAA",
"chunk_text" : "had 264 newspapers in its consortium, including the San Fran
cisco Chronicle, the Miami Herald, the Atlanta Journal-Constitution, and the Hou
ston Chronicle. On a conference call reporting quarterly financial results Tuesd
ay, Yahoo president Susan Decker said the newspaper consortium is a core area of
focus for the company as it tries to grow revenue after\\u00c2\\u00a0top execut
ive changes, poor financial results, and criticism from investors.",
"vector_score" : 68.22
},
{
"score" : 40.54,
"text_score" : 0,
"rowid" : "AAARvGAAAAAAQa4AAB",
"chunk_text" : ": British newspaper",
"vector_score" : 81.07
},
...
{
"score" : 35.83,
"text_score" : 0,
"rowid" : "AAARvGAAAAAAKIvAAA",
"chunk_text" : "massive numbers of users of written news. Now that computer
monitors are cheap, and we have little computers that can get us news that fits
in our pocket, we can try out lots of ways of arranging it, and maybe we'll even
discover something new. == me, as a software designer, it's no surprise that th
ere are lots of ways to view news. That there used to be one main way to do it i
s also not a surprise, there were technical limits, that aren't there anymore. T
he skill of laying out a paper presentation of the day's news on a big sheet of
paper is now an obsolete craft. The only reason we needed people to do that in t
he past was that was the only way to get written news to",
"vector_score" : 71.66
},
{
"score" : 35.57,
"text_score" : 0,
"rowid" : "AAARvGAAAAAAF4pAAD",
"chunk_text" : "working: paper",
"vector_score" : 71.13
}
]
For detailed syntax and additional information, please refer to the official documentation:
