Imagine a insurance specialist investigates water damage complaints. A semantic search for “water damage caused by heavy rain” should also find documents describing flooding, storms, leaking roofs, or weather-related property damage such as basement flooded etc., even if those exact words are not used. However, the specialist may also need documents that explicitly mention a specific policy clause, product code, or exclusion such as “flood exclusion”.
Pure semantic search return related claims but miss the exact policy. Traditional keyword search might find the clause but miss relevant claims that use different terminology. This is where Hybrid Vector Search comes into play. Hybrid search can address both components of such a query by running traditional full-text (lexical) search and vector search on the same data and then combining the two search results into a single result set. This leads to more relevant results and fewer false positives than either approach alone. If you’d like to learn more about the technology, we’ve included links to additional blog posts and documentation at the end of this article.
If you’ve already played with Similarity Search and are curious about what comes next, this tutorial is for you. We’ll focus on how Hybrid Vector Search combines semantic search with traditional keyword search to deliver more precise and relevant results. We recently used this demo during a customer workshop and thought it would be useful to share it with a wider audience.
What is the use case in this tutorial?
Consider a travel website where users search for “places for activities such as fishing, swimming, etc”. Semantic search can identify parks offering similar outdoor activities, even if the exact words are not present in the description. However, users may also want results that explicitly mention picnic areas or exclude locations where swimming is prohibited. This is a perfect example of where Hybrid Vector Search can improve the quality and precision of search results.
Tutorial Overview
- Setup
- Similarity Search
- Creating a Hybrid Vector Index and Running Queries
- Using Oracle Text Features such as Fuzzy Search
- Further Reading
Setup
To make it simple for you to try it out yourself, we use parts of the National Parks sample data set from the very well received Oracle LiveLab Getting Started with AI Vector Search built by the product management team AI for Data.
You can either run the exercises directly in the LiveLab environment inside the provided sandbox or in your own Oracle Database 26ai setup. The example uses the table PARKS and the embedding model MINILM_L12_V2 for the creation of the Hybrid Vector Index.
The table PARKS should look like this:
desc parks
Name Null? Type
--------------- -------- -------------------
PARK_ID NOT NULL VARCHAR2(36)
PARK_CODE VARCHAR2(4)
URL VARCHAR2(34)
NAME NOT NULL VARCHAR2(51)
DESCRIPTION VARCHAR2(414)
LATITUDE NUMBER(11,8)
LONGITUDE NUMBER(11,8)
LOCATION PUBLIC.SDO_GEOMETRY
DIRECTIONS_INFO VARCHAR2(524)
DIRECTIONS_URL VARCHAR2(162)
CONTACT_PHONE VARCHAR2(13)
EMAIL_ADDRESS VARCHAR2(44)
STREET VARCHAR2(95)
CITY VARCHAR2(35)
STATES VARCHAR2(2)
POSTAL_CODE VARCHAR2(10)
COUNTRY_ID NOT NULL VARCHAR2(3)
DESC_VECTOR VECTOR
SELECT count(*) FROM parks;
COUNT(*)
----------
472
The sample table PARK dmp file can also be downloaded here.
If not already done you can import the model with the procedure DBMS_VECTOR.LOAD_ONNX_MODEL (see also documentation). The embedding model can be accessed here.
BEGIN
DBMS_VECTOR.LOAD_ONNX_MODEL(
directory => 'DATA_PUMP_DIR',
file_name => 'all_MiniLM_L12_v2.onnx',
model_name => 'minilm_l12_v2',
metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}'));
END;
/
You can verify the available models in the database with the following query:
SELECT model_name, mining_function,algorithm, algorithm_type, model_size
FROM user_mining_models;
MODEL_NAME MINING_FUN ALGORITHM ALGORITHM_ MODEL_SIZE
------------- ---------- ---------- ---------- ----------
MINILM_L12_V2 EMBEDDING ONNX NATIVE 133322334
Similarity Search
Let’s start with a simple similarity search. Semantic search is great at finding information based on meaning. Search for “activities such as fishing and swimming” and you’ll find relevant parks even if those exact words never appear in the description. But what if the result must also contain the word “picnic”?
The following query generates an embedding for the search phrase “activities such as fishing, swimming, etc” and compares it with the embeddings stored in the DESC_VECTOR column. The parks name and description with the smallest vector distance are returned first.
SELECT name, description
FROM parks
ORDER BY VECTOR_DISTANCE(desc_vector,
VECTOR_EMBEDDING(minilm_l12_v2 using 'activities such as fishing, swimming, etc' AS data), COSINE)
FETCH EXACT FIRST 10 ROWS ONLY;
NAME DESCRIPTION
---------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Chickasaw Springs, streams, lakes - whatever its form, water is the attraction at Chickasaw National Recreation Area.
Boston Harbor Islands . . . where you can walk a Civil War-era fort, view historic lighthouses, explore tide pools, hike lush trails, camp under the stars, or relax while fishing, picnicking, or swimming—all within reach of downtown Boston. Youth programs, visitor services, research, wildlife management, and more are coordinated on the park's 34 islands and peninsulas by the Boston Harbor Islands Partnership.
Lake Mead Experience outdoor thrills amongst striking landscapes and brilliant blue waters. Hike, swim, boat, cycle, camp, fish, volunteer, and learn! This year-round national park spreads across 1.5 million acres of mountains, canyons, valleys and two vast lakes. Just a stone's throw from the dazzling lights of Las Vegas, this extraordinary playground beckons with a symphony of possibilities.
Delaware Water Gap Situated within the most densely populated region of the United States, Delaware Water Gap National Recreation Area provides a unique opportunity to experience tranquil landscapes, rich human history, and striking scenery along 40 miles of the longest free-flowing river east of the Mississippi. The park offers year-round recreation including hiking, paddling, fishing, and hunting.
...
10 rows selected.
The results include parks such as Chickasaw, Boston Harbor Islands, Lake Mead, and Delaware Water Gap—all locations which are related to our search intent.
To get a better overview, we can display only the park names:
SELECT name FROM parks ORDER BY VECTOR_DISTANCE(desc_vector, VECTOR_EMBEDDING(minilm_l12_v2 using 'activities such as fishing, swimming, etc' AS data), COSINE) FETCH EXACT FIRST 10 ROWS ONLY;
NAME
---------------------------------------------------
Chickasaw
Boston Harbor Islands
Lake Mead
Delaware Water Gap
Cape Lookout
Pictured Rocks
Niobrara
Saint Croix
Curecanti
Cape Cod
10 rows selected.
Now let’s add keyword search capabilities by creating a Hybrid Vector Index.
Creating a Hybrid Vector Index and Running Queries
The Hybrid Vector Index combines Oracle Text and AI Vector Search into a single search structure. This allows us to run pure keyword searches, pure vector searches, or combine both approaches in a single query.
CREATE HYBRID VECTOR INDEX parks_hybrid_idx ON parks(description)
PARAMETERS('MODEL MINILM_L12_V2');
Our example shows the most simplest form of an hybrid vector index creation statement, we only specify the model. However, you can also specify vectorizer preferences e.g. for externally hosted vector embedding models or Text Search Preferences such as filter, lexer, storage etc. More details on the syntax can be found in the documentation.
If you’re curious, you can inspect the index metadata using data dictionary views such as USER_IND_COLUMNS, V$VECTOR_INDEX, and CTX_USER_INDEXES. These views provide information about the vector and text components that make up the hybrid index.
Now let’s execute some hybrid vector queries. The DBMS_HYBRID_VECTOR.SEARCH function provides a JSON-based interface for querying Hybrid Vector Indexes (see documentation).
The query can contain:
- a vector clause for semantic search
- a text clause for keyword search,
- and a fusion strategy that defines how the results are combined.
In the following example, we search for parks that are semantically related to “activities such as fishing, swimming, etc.” while also containing the keyword “Picnic” in the park description.
The fusion strategy is INTERSECT, because it should only return results that satisfy both conditions. In other words, a park must be relevant to the semantic search and contain the specified keyword to be included in the final result set. The result is returned as JSON and includes the combined relevance score, the rowid of the matching row, and a text snippet (chunk_text) showing the matching content.
Note: You don’t need to refer to the table. All what you need is the index name (parks_hybrid_idx).
set long 10000
SELECT DBMS_HYBRID_VECTOR.SEARCH(JSON(
'{"hybrid_index_name": "parks_hybrid_idx",
"vector": {"search_text": "activities such as fishing, swimming, etc"},
"text": {"contains": "Picnic"},
"search_fusion": "INTERSECT",
"return": {"values": ["score", "rowid", "chunk_text"],"topN": "10"}}')) result;
RESULT
--------------------------------------------------------------------------------
[{"score":57.2,"rowid":"AAAmtuAAAAABeb/AAG","chunk_text":"Today the river valle
y attracts us for so many reasons. Take a solitary walk to enjoy nature's displa
y, raft leisurely through the rocky shoals with friends, fish the misty waters a
s the sun comes up, or have a picnic on a Sunday afternoon. Get Outdoors and exp
erience your Chattahoochee River National Recreation Area as you have never done
before."},{"score":56.02,"rowid":"AAAmtuAAAAAB0OvAAF","chunk_text":"Experience
a park so rich it supports 19 distinct ecosystems with over 2, 000 plant and ani
mal species. Go for a hike, enjoy a vista, have a picnic or learn about the cent
uries of overlapping history from California's indigenous cultures, Spanish colo
nialism, the Mexican Republic, US military expansion and the growth of San Franc
isco. All of this and more awaits you, so get out and find your park."}]
To identify the parks behind the returned rowids, we can query the PARKS table:
SELECT name FROM parks
WHERE rowid IN ('AAAmtuAAAAABeb/AAG','AAAmtuAAAAAB0OvAAF' );
NAME
-------------------------
Chattahoochee River
Golden Gate
You usually want to work with relational data instead of JSON. Use the standard table function DBMS_HYBRID_VECTOR.SEARCHPIPELINE to return a pipeline of row records. The results is of type RECORD which contains fields such as doc_rowid, score, vector_score, text_score etc.
You can see an example here. We use DBMS_HYBRID_VECTOR.SEARCHPIPELINE and join the returned rowids back to the PARKS table to get in addition the park names.
SELECT
p.name
h.score,
h.vector_score,
h.text_score,
h.vector_rank,
h.text_rank,
h.chunk_text
FROM parks p,
DBMS_HYBRID_VECTOR.SEARCHPIPELINE(JSON(
'{"hybrid_index_name": "parks_hybrid_idx",
"vector": {"search_text": "activities such as fishing, swimming, etc"},
"text": {"contains": "Picnic"},
"search_fusion": "INTERSECT"
}')) h
WHERE doc_rowid=p.rowid;
NAME SCORE VECTOR_SCORE TEXT_SCORE VECTOR_RANK TEXT_RANK
--------------------------------------------------- ---------- ------------ ---------- ----------- ----------
CHUNK_TEXT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Chattahoochee River 57.2 61.92 10 6 1
Today the river valley attracts us for so many reasons. Take a solitary walk to enjoy nature's display, raft leisurely through the rocky shoals with friends, fish the misty waters as the sun comes up, or have a picnic on a Sunday afternoon. Get Outdoors and experience your Chattahoochee River National Recreation Area as you have never done before.
Golden Gate 56.02 60.62 10 12 1
Experience a park so rich it supports 19 distinct ecosystems with over 2, 000 plant and animal species. Go for a hike, enjoy a vista, have a picnic or learn about the centuries of overlapping history from California's indigenous cultures, Spanish colonialism, the Mexican Republic, US military expansion and the growth of San Francisco. All of this and more awaits you, so get out and find your park.
If you do not want to use the table function DBMS_HYBRID_VECTOR.SEARCHPIPELINE(), the original SEARCH API can be wrapped in a JSON_TABLE specification. Using JSON_TABLE, we can project the JSON response into rows and join the returned rowids back to the PARKS table. Here is an example:
SELECT p.name, h.score, h.text_score, h.vector_score
FROM parks p,
JSON_TABLE (
(SELECT DBMS_HYBRID_VECTOR.SEARCH(JSON(
'{"hybrid_index_name": "parks_hybrid_idx",
"vector": {"search_text": "activities such as fishing, swimming, etc"},
"text": {"contains": "Picnic"},
"search_fusion": "INTERSECT",
"return":{"values": ["score","text_score", "vector_score", "rowid"],"topN": "10"}
}'))), '$[*]'
COLUMNS (
score number(5,2) PATH '$."score"',
text_score number(5,2) PATH '$."text_score"',
vector_score number(5,2) PATH '$."vector_score"',
row_id varchar2(18) PATH '$."rowid"') ) h
WHERE p.rowid=h.row_id;
NAME SCORE TEXT_SCORE VECTOR_SCORE
---------------------- ---------- ---------- ------------
Chattahoochee River 57.2 10 61.92
Golden Gate 56.02 10 60.62
You can experiment with other fusion strategies such as UNION, VECTOR_ONLY etc.
Using Oracle Text Features such as Fuzzy Search
One of the advantages of Hybrid Vector Search is that it gives you access to the powerful linguistic capabilities of Oracle Text. For example, users often misspell search terms. Let’s intentionally misspell Picnic and search for Piccnics using Oracle Text’s fuzzy matching:
"text": {
"contains": "fuzzy(Piccnics)"
}
Even though the keyword is misspelled, Oracle Text still finds relevant matches. In our example, an additional park, Fort Dupont Park, is returned because its description contains the term Picnics.
SELECT p.name, h.score, h.text_score, h.vector_score
FROM parks p,
JSON_TABLE (
(SELECT DBMS_HYBRID_VECTOR.SEARCH(JSON(
'{"hybrid_index_name": "parks_hybrid_idx",
"vector": {"search_text": "activities such as fishing, swimming, etc"},
"text": {"contains": "fuzzy(Piccnics)"},
"search_fusion": "INTERSECT",
"return":{"values": ["score","text_score", "vector_score", "rowid"],"topN":
"10"}}'))), '$[*]'
COLUMNS (
score number(5,2) PATH '$."score"',
text_score number(5,2) PATH '$."text_score"',
vector_score number(5,2) PATH '$."vector_score"',
row_id varchar2(18) PATH '$."rowid"') ) h
WHERE p.rowid=h.row_id;
NAME SCORE TEXT_SCORE VECTOR_SCORE
--------------------------------------------- ---------- ---------- ------------
Chattahoochee River 57.2 10 61.92
Golden Gate 56.02 10 60.62
Fort Dupont Park 55.5 10 60.05
This illustrates the strength of Hybrid Vector Index Search: semantic similarity helps us find conceptually related content, while Oracle Text contributes linguistic intelligence such as fuzzy matching, stemming, and other advanced text-search capabilities. Try a few searches of your own and experiment with different fusion strategies such as UNION, INTERSECT, TEXT_ONLY, and VECTOR_ONLY to see how the ranking changes and other Oracle Text terms.
Further Readings
- Oracle AI Vector Search (oracle.com)
- Oracle AI Vector Search User’s Guide
- DBMS_HYBRID_VECTOR Package
- LiveLab Getting Started with AI Vector Search
