Oracle announced the availability of Oracle Database 23ai in May 2024. Oracle AI Database adds to the more than 300 new features available in Oracle Database 23ai, so there’s a lot to learn. Dominic Giles highlights several marquee features in his blog post, but one of the most exciting new features of Oracle AI Database is Oracle AI Vector Search.
An excellent introduction to AI Vector Search is available here. It covers in quite a bit of detail the additions to Oracle AI Database to support similarity search, the key benefits, and some of the use cases that AI Vector Search addresses.
There are also several blogs and labs that walk through the steps needed to create vectors using vector embedding models and run similarity searches. However, there are fewer articles that explain what all these things are and why you need to know about them. In this post, I will provide additional explanations and show examples of how AI Vector Search delivers these capabilities in Oracle AI Database.
Note: Since this post was originally written we have added an Oracle LiveLab where you can try out AI Vector Search and examples similar to what we cover in this post. More details are available in this post: New AI Vector Search LiveLabs Workshop Available or you can give it a try here.
At the heart of AI Vector Search is the ability to do a similarity search. A similarity search works with the semantic representations of your data rather than the value (words or pixels) and finds similar objects quickly. For example, find other images or documents that look like this one. It uses vectors, or more precisely vector embeddings, to search for semantically similar objects based on their proximity to each other. In other words, vector embeddings are a way of representing almost any kind of data, like text, images, videos, and even music, as points in a multidimensional space where the locations of those points and proximity to other data are semantically meaningful.
Traditionally, relational searches have been based on attribute values or keywords. With AI Vector Search, structured and, more importantly, unstructured data can be searched for similarities based on the semantic meaning of the data. This opens up a whole new world of possibilities, which is why there is so much excitement around this new feature. Use cases like chatbots, natural language processing, data analysis, image processing, and geographic analysis can leverage AI Vector Search in new and exciting ways.
Oracle AI Vector Search includes a collection of features to enable semantic or similarity search within the Oracle AI Database. These features include:
• a new VECTOR data type for storing vector embeddings
• importing pre-trained embedding models into Oracle AI Database
• creating embeddings in the database locally or loading externally generated embeddings
• new state-of-the-art vector indexes for fast similarity searches
• new SQL operators, and syntax, to easily combine relational search on business data with semantic search
• supports Retrieval Augmented Generation (RAG) to augment LLM responses with enterprise specific content
Rather than repeat a description of each of these features, the rest of this post and future posts will walk through examples of these features along with a description of how these features are used to enable and use similarity search.
Getting Started
Oracle AI Vector Search requires Oracle AI Database. Currently, it is available on Oracle Cloud (including Oracle Autonomous AI Database), Oracle Cloud@Customer, Oracle Exadata, Oracle Database Appliance, Oracle Database@Azure, Oracle Database@Google Cloud, Oracle Database@AWS and Oracle AI Database Free. For the latest information on platform availability, see the MOS Note Release Schedule of Current Database Releases (Doc ID 742060.1).
Sample Data Set
For the following examples in this post, we’re going to use part of the crime incidents dataset available on the Chicago Data Portal. This dataset combines both category fields along with a text description that I will use to highlight how to do similarity searches and how they can be combined with attribute filtering to perform relational searches with similarity searches. The data that I downloaded contains roughly 6.8 million rows spanning multiple years.
We will be using a subset of the dataset with the following columns in a new table called SEARCH_DATA:
SQL> desc search_data
Name Null? Type
----------------------------- -------- --------------------
ID NUMBER
DESCRIPTION VARCHAR2(100)
LOCATION_DESC VARCHAR2(100)
DISTRICT VARCHAR2(20)
WARD NUMBER
COMMUNITY VARCHAR2(20)
C_YEAR NUMBER
The SEARCH_DATA table has 469,380 rows for the year 2004. The DESCRIPTION column mentioned above has contents like the following:
DESCRIPTION
------------------------------------------------------
TO PROPERTY
POSS: CANNABIS 30GMS OR LESS
AGGRAVATED: HANDGUN
OTHER VEHICLE OFFENSE
AUTOMOBILE
Vector Embedding Models
To enable similarity search, we will need to create vector embeddings for the column(s) we would like to search on, which, in our examples, is the DESCRIPTION column. Vector embeddings are stored in a column of data type VECTOR. A vector embedding is a mathematical vector representation of data points or, more simply, an array of numbers. Vector embeddings are generated using Machine Learning models. This is the first place we see AI come into play.
How do you decide which embedding model to use? After all, there are open-source embedding models and proprietary embedding models that you might have to pay for, or you could create and train your own embedding models. To add to the confusion, each embedding model has been trained on specific data. The type of embedding model you use will depend on the type of data that you plan to embed and how well that model performs for the searches you or your application need to perform.
Once you decide on one or more embedding models to try, you can choose to create vector embeddings outside the database or inside the database by importing the models directly into Oracle AI Database if they are compatible with the Open Neural Network Exchange (ONNX) standard. Since Oracle AI Database implements an ONNX runtime directly within the database, these imported models can be used to generate vector embeddings in Oracle AI Database.
In the following examples, we are going to be operating on text-based data, so we will use a sentence-transformer based model. Here is a blog post that details converting pre-trained transformer models from Hugging Face, which is a site that hosts open-source Python libraries for deep learning and offers thousands of pre-trained models, into an ONNX format using an Oracle utility called OML4Py. It then details how to load the model(s) into Oracle AI Database.
Here is another blog post that details how to load a pre-built vector embedding model into the database without first augmenting the model with the necessary pre- and post-processing steps to convert it into ONNX format. And finally, here is a post that details the types of embedding models that have been tested with AI Vector Search.
For our example, we will take the easy path and use the pre-built all-MiniLM-L12-v2 model that was detailed in the blog post about using a pre-built vector embedding model above. It is a compact yet powerful and was built using the sentence-transformers library and will work for our data requirements. It has been unzipped into an OS directory on the database server, and a database directory called DM_DUMP has been created that points to that OS directory. The following command can then be run to import the ONNX file into the database:
SQL> exec dbms_vector.load_onnx_model('DM_DUMP','all_MiniLM_L12_v2.onnx','minilm_l12_v2',
JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}'));
Running the following SQL shows that the model name “MINILM_L12_V2” has been loaded into the database:
SQL> SELECT model_name, mining_function, algorithm, algorithm_type, model_size
FROM user_mining_models;
MODEL_NAME MINING_FUNCTION ALGORITHM ALGORITHM_TYPE MODEL_SIZE
-------------- ---------------- ---------- -------------- ----------
MINILM_L12_V2 EMBEDDING ONNX NATIVE 133322334
SQL>
We can also display the model’s characteristics:
SQL> SELECT model_name, attribute_name, attribute_type, data_type, vector_info
FROM user_mining_model_attributes;
MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TYPE DATA_TYPE VECTOR_INFO
-------------- ---------------- --------------- ----------- -------------------
MINILM_L12_V2 ORA$ONNXTARGET VECTOR VECTOR VECTOR(384,FLOAT32)
MINILM_L12_V2 INPUT TEXT VARCHAR2
SQL>
Notice the VECTOR_INFO column. This shows the dimension format for the model. Dimension formats will be explained further in the section Creating Vector Embeddings.
We can test what we have done so far by running a simple query to embed the word ‘hello’:
SQL> SELECT TO_VECTOR(VECTOR_EMBEDDING(minilm_l12_v2 USING 'hello' as input));
TO_VECTOR(VECTOR_EMBEDDING(MINILM_L12_v2USING'HELLO'ASINPUT))
------------------------------------------------------------------------------------------------------------------------------------------------------
[-7.49069825E-002,-1.44330608E-002,4.86499295E-002,-2.713810
28E-002,-4.30882089E-002,-1.47763401E-001,6.88331053E-002,-1
.37038985E-002,-5.35686724E-002,2.69752908E-002,-6.28339127E
-003,-3.98834869E-002,7.65678426E-003,-3.78089696E-002,-1.17
558083E-002,-3.46409045E-002,1.29357144E-001,-2.52777878E-00
2,-1.52099226E-002,7.30356318E-004,-8.06888491E-002,2.693785
54E-002,-9.87356976E-002,-3.41076851E-002,-2.70294175E-002,-
7.32003674E-002,5.08588664E-002,-1.72562376E-002,7.28218481E
< A bunch of numbers have been deleted >
41633E-002,5.40374406E-002,4.60668281E-003,4.81108278E-002,-
1.18950203E-001,-4.22098711E-002,4.28496249E-004,-4.60483041
E-003,-6.80256784E-002,2.4777215E-002,5.72777987E-002,3.3987
131E-002,-3.80932316E-002,5.46789682E-003,1.50439981E-002,-1
.71866838E-003,-4.49497951E-003,8.36174041E-002,3.61522138E-
002,-2.27608755E-002,1.0930731E-002,-4.64579314E-002,-2.5119
7945E-002,3.10342927E-002,1.40036559E-002,2.80776881E-002,-7
.75460666E-003,-3.13466154E-002,5.54159284E-002]
SQL>
Creating Vector Embeddings
Now, we’re ready to create vector embeddings for the DESCRIPTION column. First, we need to add a column to our table to store the vector embeddings. The column will be of data type VECTOR. You can optionally specify the number of dimensions and their format. If you don’t specify any of them, you can enter vectors of different dimensions with different formats, although not at the same time. This is a simplification to help you get started with using vectors in Oracle AI Database and avoids having to recreate the vector definition if you later decide to change the vector embedding model and it uses a different number of dimensions and/or format.
The number of dimensions must be strictly greater than zero, with a maximum of 65535 vectors.
The possible dimension formats are:
• INT8 (8-bit integers)
• FLOAT32 (32-bit IEEE floating-point numbers)
• FLOAT64 (64-bit IEEE floating-point numbers)
• BINARY (packed UINT8 bytes where each dimension is a single bit)
To add a vector column to an existing table just requires an ALTER TABLE statement:
SQL> ALTER TABLE search_data ADD vector_desc VECTOR;
Notice below in the describe command that when the number of dimensions and the format are not included, you will get a definition of VECTOR(*, *).
SQL> desc search_data
Name Null? Type
----------------------------- -------- --------------------
ID NUMBER
DESCRIPTION VARCHAR2(100)
LOCATION_DESC VARCHAR2(100)
DISTRICT VARCHAR2(20)
WARD NUMBER
COMMUNITY VARCHAR2(20)
C_YEAR NUMBER
VECTOR_DESC VECTOR(*, *)
There are various ways to generate vector embeddings and load them into a VECTOR data type column. For our example, I chose to run a CREATE TABLE AS SELECT command because I thought that was the easiest and fastest method since we are using an embedding model that has been loaded into the database.
SQL> DROP TABLE search_data PURGE;
SQL> CREATE TABLE search_data AS
SELECT id, description, location_desc, district, ward, community, c_year,
vector_embedding(minilm_l12_v2 using DESCRIPTION as input) as vector_desc
FROM chicago_data WHERE c_year = 2004;
There are many other ways I could have accomplished the same thing. I could have run an UPDATE statement on the table after adding the VECTOR column, or I could have INSERTed the data into another table using the corresponding id column from the original table. I could have even generated the vector embeddings outside the database and used SQL Loader to load the data just to name a few.
Similarity Search
It may seem like we have taken a big detour, but now that we have created our vector embeddings we are ready to get back to our crime incident data and explore how we can make use of them. The reason we are interested in using a similarity search is because we want to search our data for information based on semantically similar terms. In our first example, I wanted to see if we could find descriptions of robberies. However, nowhere in our DESCRIPTION column does the word “robbery” appear.
SQL> SELECT count(*) FROM search_data
2 WHERE UPPER(description) LIKE '%ROBBERY%';
COUNT(*)
----------
0
Now, it’s true that we might be able to sift through the data and come up with semantically similar words on our own and then plug them into the SQL query above. But with large data sets and many different search terms that doesn’t sound very appealing. Let’s see what happens if we run a similarity search with AI Vector Search. The following is the SQL and the result:
select description, TO_NUMBER(vector_distance) as vector_distance
from (
select id, description,
vector_distance(vector_desc, VECTOR_EMBEDDING(minilm_l12_v2 USING 'ROBBERY' as data),
COSINE) as vector_distance
from search_data
order by 3
fetch exact first 50000 rows only )
group by description, vector_distance
order by 2
fetch first 10 rows only ;
DESCRIPTION VECTOR_DISTANCE
---------------------------------------- ---------------
RETAIL THEFT .299974263
ATTEMPT THEFT .336217165
KIDNAPPING .449222624
PREDATORY .450500667
OTHER CRIME INVOLVING PROPERTY .462044001
AGGRAVATED FINANCIAL IDENTITY THEFT .462046862
EXTORTION .466129601
FORGERY .481875181
ATTEMPT FINANCIAL IDENTITY THEFT .485770345
THEFT/RECOVERY: AUTOMOBILE .494581938
Notice that the SQL syntax for a similarity search might be a little bit different than what you are used to. Since we are operating on vectors, and their distance to other similar vectors, a new VECTOR_DISTANCE function is used. In this example we are comparing the distance from the vectorized search term ‘ROBBERY’ to the other vectors in our SEARCH_DATA table. I included the vector distance just as a point of reference to how each description moves a bit farther from our search vector.
Also notice that the VECTOR_DISTANCE function is part of the ORDER BY clause, the number 3 being used as the column position in the SELECT list which has the VECTOR_DISTANCE calculation. This is because a similarity search looks at the relative distance of the query vector compared to the other vectors in the data set and returns the top-k nearest vectors.
The query results look promising. Certainly, descriptions containing the word “theft” would qualify as robberies. In further analysis of the dataset I discovered that there is another attribute called PRIMARY_TYPE that actually categorizes the descriptions of the crime incidents. At first I was disappointed about my discovery. Why bother using AI Vector Search when you could just query on the primary type, which includes the category of ROBBERY, and THEFT and BURGLARY, to determine the different types of robberies? Then I realized that this is really a huge advantage because it provides us with a way to check just how accurate similarity search can be. After all, one of the big advantages of similarity search is the ability to search for semantically similar data.
Here’s the same query with the PRIMARY_TYPE attribute added (I recreated the table since I hadn’t initially included the PRIMARY_TYPE column and that seemed to be the easiest method of adding the attribute and data):
select primary_type, description
from (
select primary_type, description,
vector_distance(vector_desc, VECTOR_EMBEDDING(minilm_l12_v2 USING 'ROBBERY' as data),
COSINE) as vector_distance
from search_data
order by 3
fetch exact first 50000 rows only )
group by primary_type, description, vector_distance
order by TO_NUMBER(vector_distance)
fetch first 10 rows only ;
PRIMARY_TYPE DESCRIPTION
-------------------- ----------------------------------------
THEFT RETAIL THEFT
THEFT ATTEMPT THEFT
KIDNAPPING KIDNAPPING
CRIM SEXUAL ASSAULT PREDATORY
OTHER OFFENSE OTHER CRIME INVOLVING PROPERTY
DECEPTIVE PRACTICE AGGRAVATED FINANCIAL IDENTITY THEFT
INTIMIDATION EXTORTION
DECEPTIVE PRACTICE FORGERY
THEFT ATTEMPT FINANCIAL IDENTITY THEFT
MOTOR VEHICLE THEFT THEFT/RECOVERY: AUTOMOBILE
Comparing the PRIMARY_TYPE and the DESCRIPTION column data it seems that our similarity search did pretty well. We seem to be in the ballpark for robberies and a keyword search using robbery or theft would not have yielded crimes like purse snatching, and we might have missed robberies in other categories like DECEPTIVE PRACTICE and probably MOTOR VEHICLE THEFT. In this example, I think similarity search gave us an advantage, and this is really just the tip of the iceberg. Just think of needing to search data, especially non-textual data, where it is not nearly as easy to see the similarities.
Summary
We have covered a lot of ground in this post and there is a lot more to talk about in future posts. We started off with an introduction to similarity search, the heart of AI Vector Search. We then identified a sample data set, described vector embedding models, and imported one into our database. We then created vector embeddings using our imported model and stored those vector embeddings in a VECTOR data type column. Finally, we ran some similarity searches and looked at the power of semantic comparisons as opposed to simple keyword searches for our text-based data.
