In my previous post, Getting Started with Oracle Database 23ai AI Vector Search, we explored how AI Vector Search allows you to find similar content based on the semantic representation or meaning rather than the raw values (words or pixels). In the examples we performed exact similarity searches, comparing every item in the dataset to find the closest matches. While this approach helps guarantee accuracy, it doesn’t scale well – imagine searching through millions or even billions of vectors one by one!
This is where vector indexes come in. Instead of checking every possible match, an approximate similarity search uses a class of algorithms referred to as Approximate Nearest Neighbor. Using vector indexes helps reduce the number of distance calculations, making searches faster and more efficient with only a slight penalty in accuracy.
There are currently three types of vector indexes available in AI Vector Search:
In this post we will focus on HNSW indexes and leave IVF and Hybrid Vector Indexes for future posts.
The examples I used in the Getting Started with Oracle Database 23ai AI Vector Search post performed exact similarity searches. An exact search to find the closest matches for a given vector are accurate but can be slow, since costly vector distance computations are needed for all vectors in a column. The following is the SQL from the first example that I ran in the Getting Started blog post to perform an exact similarity search:
select description, TO_NUMBER(vector_distance) as vector_distance
from (
select id, description,
vector_distance(vector_desc, VECTOR_EMBEDDING(minilm_l12_v2 USING 'ROBBERY' as data),
COSINE) as vector_distance
from search_data
order by 3
fetch exact first 50000 rows only )
group by description, vector_distance
order by 2
fetch first 10 rows only ;
DESCRIPTION VECTOR_DISTANCE
---------------------------------------- ---------------
RETAIL THEFT .299974263
ATTEMPT THEFT .336217165
KIDNAPPING .449222624
PREDATORY .450500667
OTHER CRIME INVOLVING PROPERTY .462044001
AGGRAVATED FINANCIAL IDENTITY THEFT .462046862
EXTORTION .466129601
FORGERY .481875181
ATTEMPT FINANCIAL IDENTITY THEFT .485770345
THEFT/RECOVERY: AUTOMOBILE .494581938
And the execution plan is the following:
----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads | Writes |
----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 702K(100)| 10 |00:00:01.50 | 117K| 117K| 377 |
| 1 | COUNT STOPKEY | | 1 | | 10 |00:00:01.50 | 117K| 117K| 377 |
| 2 | VIEW | | 1 | 702K (2)| 10 |00:00:01.50 | 117K| 117K| 377 |
| 3 | SORT ORDER BY STOPKEY | | 1 | 702K (2)| 10 |00:00:01.50 | 117K| 117K| 377 |
| 4 | HASH GROUP BY | | 1 | 702K (2)| 59 |00:00:01.50 | 117K| 117K| 377 |
| 5 | VIEW | | 1 | 702K (2)| 50000 |00:00:01.50 | 117K| 117K| 377 |
| 6 | COUNT STOPKEY | | 1 | | 50000 |00:00:01.49 | 117K| 117K| 377 |
| 7 | VIEW | | 1 | 702K (2)| 50000 |00:00:01.49 | 117K| 117K| 377 |
| 8 | SORT ORDER BY STOPKEY| | 1 | 702K (2)| 50000 |00:00:01.48 | 117K| 117K| 377 |
| 9 | TABLE ACCESS FULL | SEARCH_DATA | 1 | 32758 (3)| 469K|00:00:00.72 | 117K| 117K| 0 |
----------------------------------------------------------------------------------------------------------------------------
Since we have no indexes defined we see a full table scan access of the SEARCH_DATA table (i.e. id 9 - TABLE ACCESS FULL). I have included the vector distance calculation in the result since this will help when we compare this exact search result to approximate search results later in this post.
When HNSW indexes are used, you must enable a new memory area in the database called the Vector Pool. The Vector Pool is memory allocated from the System Global Area (SGA) to store HNSW type vector indexes and their associated metadata. The Vector Pool can also be used for IVF index creation as well as DML operations, but we will not elaborate on that in this post.
There is a new database initialization parameter called VECTOR_MEMORY_SIZE that specifies the size of the Vector Pool and enables HNSW vector index creation.
Details of the pool can be displayed by querying the view V$VECTOR_MEMORY_POOL:
POOL ALLOC_BYTES USED_BYTES POPULATE_STATUS CON_ID
------------------ ----------- ---------- ---------------- ----------
1MB POOL 3758096384 0 DONE 3
64KB POOL 1593835520 0 DONE 3
IM POOL METADATA 16777216 16777216 DONE 3
How do you know how big to make the Vector Pool?
The DBMS_VECTOR package has an INDEX_VECTOR_MEMORY_ADVISOR procedure that can help evaluate the number of indexes that can fit for a simulated vector memory size. Both HNSW and IVF indexes are supported. An actual column can be supplied, along with the table owner and name, or you can just supply the dimension count and dimension type for the type of index that you would like to size. Here is an example estimate for our SEARCH_DATA table and the VECTOR_DESC column:
set serveroutput on
variable rjson clob;
begin
DBMS_VECTOR.INDEX_VECTOR_MEMORY_ADVISOR(
TABLE_OWNER=>'VECTOR',
TABLE_NAME=>'SEARCH_DATA',
COLUMN_NAME=>'VECTOR_DESC',
INDEX_TYPE=>'HNSW',
RESPONSE_JSON=> :rjson);
end;
/
SQL> 2 3 4 5 6 7 8 9
Using default accuracy: 90%
Suggested vector memory pool size: 1348169142 Bytes
PL/SQL procedure successfully completed.
SQL>
This result suggests that we will need approximately 1286 MB of memory for this index in our vector memory pool. For this environment I set the VECTOR_MEMORY_SIZE parameter to 5000M so we should have plenty of memory.
To create an HNSW vector index on the vector column VECTOR_DESC in the SEARCH_DATA table the following statement can be used:
CREATE VECTOR INDEX search_data_hnsw_idx ON search_data (vector_desc)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 90;
This statement will create an in-memory, HNSW index, with a TARGET ACCURACY of 90 percent.
You can verify the memory usage of a vector index with the view V$VECTOR_INDEX. For example, if we query the information about the index we just created we see the following:
select owner, index_name, index_organization, allocated_bytes, used_bytes, num_vectors
from v$vector_index where index_organization = 'INMEMORY NEIGHBOR GRAPH';
ALLOCATED
OWNER INDEX_NAME INDEX_ORGANIZATION BYTES USED_BYTES NUM_VECTORS
---------- ------------------------------ ------------------------- ---------- ---------- -----------
VECTOR SEARCH_DATA_HNSW_IDX INMEMORY NEIGHBOR GRAPH 865796096 861246736 469380
Let's query the Vector Pool and see how much memory is being used in the vector pool:
POOL ALLOC_BYTES USED_BYTES POPULATE_STATUS CON_ID
-------------------------- ----------- ---------- -------------------------- ----------
1MB POOL 3758096384 857735168 DONE 3
64KB POOL 1593835520 8388608 DONE 3
IM POOL METADATA 16777216 16777216 DONE 3
Note that we are a bit off in actual used space from the estimate we got from the INDEX_VECTOR_MEMORY_ADVISOR procedure in the previous section (that is, 1286MB versus 826MB).
Since HNSW indexes are created in-memory, when a database instance is restarted the index is no longer in-memory and must be re-built. By default, a reload mechanism is triggered at instance restart. This is controlled by the initialization parameter VECTOR_INDEX_NEIGHBOR_GRAPH_RELOAD and defaults to RESTART. To facilitate faster reloads after an instance restart a full checkpoint on-disk structure is also maintained.
As we looked at previously, you can verify the memory usage of a vector index with the view V$VECTOR_INDEX. For example, at instance startup and before an index rebuild to reload the index into memory, we see the following index information:
select owner, index_name, index_organization, allocated_bytes, used_bytes, num_vectors
from v$vector_index where index_organization = 'INMEMORY NEIGHBOR GRAPH';
ALLOCATED
OWNER INDEX_NAME INDEX_ORGANIZATION BYTES USED_BYTES NUM_VECTORS
---------- ------------------------------ ------------------------- ---------- ---------- -----------
VECTOR SEARCH_DATA_HNSW_IDX INMEMORY NEIGHBOR GRAPH 0 0 0
Next we will explicitly rebuild the index and re-query the V$VECTOR_INDEX view:
exec dbms_vector.rebuild_index('SEARCH_DATA_HNSW_IDX');
ALLOCATED
OWNER INDEX_NAME INDEX_ORGANIZATION BYTES USED_BYTES NUM_VECTORS
---------- ------------------------------ ------------------------- ---------- ---------- -----------
VECTOR SEARCH_DATA_HNSW_IDX INMEMORY NEIGHBOR GRAPH 865796096 861246736 469380
Note that I disabled the automatic index rebuild just for this example. Normally you do not need to worry about explicitly rebuilding any HNSW indexes, this is done as the default, but you should be aware that it does take some time to re-populate the index(es) after an instance restart.
Now we can run the same query that was used in the exact similarity search section previously, but with the APPROX keyword to enable an approximate similarity search and a target accuracy that matches the index we created:
select description, TO_NUMBER(vector_distance) as vector_distance
from (
select id, description,
vector_distance(vector_desc, VECTOR_EMBEDDING(minilm_l12_v2 USING 'ROBBERY' as data),
COSINE) as vector_distance
from search_data
order by 3
fetch approx first 50000 rows only with target accuracy 90 )
group by description, vector_distance
order by 2
fetch first 10 rows only ;
DESCRIPTION VECTOR_DISTANCE
---------------------------------------- ---------------
RETAIL THEFT .299974263
ATTEMPT THEFT .336217165
KIDNAPPING .449222624
PREDATORY .450500667
OTHER CRIME INVOLVING PROPERTY .462044001
AGGRAVATED FINANCIAL IDENTITY THEFT .462046862
EXTORTION .466129601
FORGERY .481875181
ATTEMPT FINANCIAL IDENTITY THEFT .485770345
THEFT/RECOVERY: AUTOMOBILE .494581938
Note that since we added the APPROX keyword, we can use the HNSW index we created previously. We can verify this by displaying the execution plan:
-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | Cost (%CPU)| A-Rows | A-Time | Buffers | Reads |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 59388 (100)| 10 |00:00:00.47 | 37785 | 37274 |
| 1 | COUNT STOPKEY | | 1 | | 10 |00:00:00.47 | 37785 | 37274 |
| 2 | VIEW | | 1 | 59388 (2)| 10 |00:00:00.47 | 37785 | 37274 |
| 3 | SORT ORDER BY STOPKEY | | 1 | 59388 (2)| 10 |00:00:00.47 | 37785 | 37274 |
| 4 | HASH GROUP BY | | 1 | 59388 (2)| 296 |00:00:00.47 | 37785 | 37274 |
| 5 | VIEW | | 1 | 59327 (2)| 37884 |00:00:00.46 | 37785 | 37274 |
| 6 | COUNT STOPKEY | | 1 | | 37884 |00:00:00.46 | 37785 | 37274 |
| 7 | VIEW | | 1 | 59327 (2)| 37884 |00:00:00.45 | 37785 | 37274 |
| 8 | SORT ORDER BY STOPKEY | | 1 | 59327 (2)| 37884 |00:00:00.45 | 37785 | 37274 |
| 9 | TABLE ACCESS BY INDEX ROWID| SEARCH_DATA | 1 | 1 (0)| 37884 |00:00:00.43 | 37785 | 37274 |
| 10 | VECTOR INDEX HNSW SCAN | SEARCH_DATA_HNSW_IDX | 1 | 1 (0)| 37884 |00:00:00.06 | 9 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------
You can see the use of the vector index that was created in the query execution plan at id 10. Note that Oracle AI Vector Search is fully integrated into Oracle Database 23ai, so SQL execution plans will show the use of vector indexes. Also note that the time and the number of buffers accessed is much less than compared to the previous exact search execution (1.50 seconds versus 0.47 seconds).
It is important however, to recognize that using a vector index search is approximate as opposed to exact. This means that the results can be different than with an exact search. And depending on the data it can be quite different. In the queries above we are only fetching the first 10 rows out of a vector search of the first 50,000 rows. In this case our results are the same, but much faster and requiring less work.
However, in the case of our SEARCH_DATA table we have a lot of incidences per id, which translates into many ids with the same vector distance. If we looked at all the descriptions that were returned from our vector search of the 50,000 rows, instead of just the first 10, we would have returned 59 descriptions for the exact query and 291 for the HNSW index query. Why such a discrepancy? It turns out that HNSW indexes don't do well with duplicate data. In our data set we have lots of incidents for given descriptions, which are really descriptions of crimes in our data set. The following query shows just how much "duplication" exists:
SELECT description, count(*)
FROM (
SELECT id, description
FROM search_data
ORDER BY vector_distance(VECTOR_DESC,
VECTOR_EMBEDDING(minilm_l12_v2 USING 'ROBBERY' as input), COSINE)
FETCH EXACT FIRST 50000 ROWS ONLY )
GROUP BY description;
DESCRIPTION COUNT(*)
-------------------------------------------------- --------------------
RETAIL THEFT 12142
ATTEMPT THEFT 760
KIDNAPPING 60
PREDATORY 160
OTHER CRIME INVOLVING PROPERTY 553
AGGRAVATED FINANCIAL IDENTITY THEFT 1
EXTORTION 16
FORGERY 1443
ATTEMPT FINANCIAL IDENTITY THEFT 206
THEFT/RECOVERY: AUTOMOBILE 2013
You can see that there are a lot of incidents for each of the crime descriptions listed above, and this is just the first 10 of the 59 returned for an exact similarity search. The approximate similarity search using the HNSW index returns 291 different crime descriptions, but the good news is that the top 10 are the same for both. This isn't a bad thing, but we do need to recognize that the data plays a role in which search method and/or indexing strategy we might ultimately decide to use.
In our next blog post we will investigate Inverted File Flat (IVF) partition-based indexes and we'll see how an IVF index does with our data set.
Andy Rivenes is a Senior Principal Product Manager at Oracle Corporation and is a Product Manager for Database In-Memory. Andy has been working with Oracle products since 1992 and spent a large part of his career managing Oracle databases for both custom and E-Business Suite applications. Andy now spends his time helping customers implement Database In-Memory and providing information about Database In-Memory through blogs, conferences, and customer visits.
Previous Post