Similarity Search in Oracle Autonomous Database 19c

INTRODUCTION

You have documents, text, image, audio files and relational data stored in your Oracle Autonomous Database, and you need to compliment your business analysis with semantic search capabilities to analyze this data. Whether you are looking for usage anomalies related to fraud, monitoring news related to your business or simply tracking customer support calls related to broken ski helmet buckles, Autonomous Database offers semantic search capabilities today, in 19c, by invoking Oracle Infrastructure Cloud (OCI), AWS Lambda or Azure cloud functions as SQL functions to perform vector search in the Oracle Autonomous Database without requiring AI or ML skills.

Vector search is a method of information retrieval where texts, documents and queries are represented as vectors instead of plain text. In vector search, Large Language embedding Models (LLMs) generate vector representations of source inputs, which can be text, images, or other content. It is an efficient way to query and browse unstructured data while providing a semantic understanding of content. User-defined functions are used for vector conversion and can be written in PL/SQL (calling external REST APIs using the DBMS_CLOUD package) or Python code wrapped as a User-Defined OCI function.

So how do we accomplish this today in Oracle Autonomous Database? The use case we explore for this blog is a semantic search through aggregated news article titles to answer a very specific question ‘What is the future of crypto currency?’ demonstrating these two common methods – PL/SQL and Python.

Method 1: PL/SQL USER DEFINED FUNCTIONS

For PL/SQL, we create two user defined functions, one is for embedding text query to vector mapping, another is cosine similarity function to determine similarity between two vectors.

Vector Embedding Function:
Requires OPENAI_CRED described later in this page

CREATE OR REPLACE FUNCTION VECTOREMBED(QUERY in VARCHAR2)
RETURN CLOB
IS
    l_headers                 VARCHAR2(4096);
    l_req_payload             JSON_OBJECT_T;
    l_data_array              JSON_ARRAY_T;
    l_query                   JSON_ARRAY_T;
    l_embedding               JSON_ARRAY_T;
    l_resp                    dbms_cloud_types.resp;
    l_ans                     CLOB;
BEGIN
    l_headers := '{''Content-Type'':''application/json''}';
    l_query := new json_array_t;
    l_query.append(QUERY);
    l_req_payload := json_object_t.parse('{}');
    l_req_payload.put('input', l_query);
    l_req_payload.put('model', 'text-embedding-3-small');
    l_req_payload.put('dimensions', 200);
    l_resp := DBMS_CLOUD.SEND_REQUEST (
        credential_name => 'OPENAI_CRED',
        uri             => 'bearer://api.openai.com/v1/embeddings',
        headers         => l_headers,
        body            => l_req_payload.to_blob,
        method          => DBMS_CLOUD.METHOD_POST
        );
    l_data_array := json_object_t(DBMS_CLOUD.get_response_text(l_resp)).get_array('data');
    l_ans := json_object_t(l_data_array.get(0)).get_array('embedding').to_clob;
    return l_ans;
END VECTOREMBED;

Cosine Function:

CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(VECTOR1 in clob, VECTOR2 in clob)
RETURN NUMBER
IS
    l_arr_1        JSON_ARRAY_T;
    l_arr_2        JSON_ARRAY_T;
    l_ans_1        NUMBER;
    l_ans_2        NUMBER;
    l_ans_3        NUMBER;
    num_1          NUMBER;
    num_2          NUMBER;
    ans            NUMBER;
BEGIN
    l_arr_1 := JSON_ARRAY_T.parse(VECTOR1);
    l_arr_2 := JSON_ARRAY_T.parse(VECTOR2);
    l_ans_1 := 0;
    l_ans_2 := 0;
    l_ans_3 := 0;
    for indx in 0 .. l_arr_1.get_size - 1
    LOOP
        num_1 := l_arr_1.get_number(indx);
        num_2 := l_arr_2.get_number(indx);
        l_ans_1 := l_ans_1 + num_1 * num_2;
        l_ans_2 := l_ans_2 + num_1 * num_1;
        l_ans_3 := l_ans_3 + num_2 * num_2;
    END LOOP;
    ans := l_ans_1 / sqrt(l_ans_3) / sqrt(l_ans_2);
    RETURN ROUND(ans, 5);
END COSINE_SIMILARITY;

Method 2: OCI PYTHON USER DEFINED FUNCTIONS

Oracle Cloud Infrastructure (OCI) Functions is a serverless compute service based on the open source Fn Project that lets developers create, run and scale applications without managing any infrastructure. As with the previous PL/SQL example, we create two user defined functions: one is for the user-defined function for cosine similarity, func_cosine.py, to determine similarity between two vectors; the other function, func_embed.py is for embedding text query to vector mapping.

Create and deploy OCI Functions – Steps to create and deploy OCI Functions
Include required packages in function specific requirement.txt files located in the function specific foders created by OCI Functions init.

requirements.txt for func_cosine.py function:

fdk==0.1.60
numpy

Create User-Defined Function func_cosine.py

import io
import json
from fdk import response
import numpy as np

def cosine_distance(query_vector, text_vector):
    query_vector = np.asarray(query_vector, dtype=float)
    text_vector = np.asarray(text_vector, dtype=float)
    # Use cosine as measurement of distance
    dot_product = np.dot(query_vector, text_vector)
    denominator = (np.linalg.norm(query_vector) * np.linalg.norm(text_vector))
    return dot_product/denominator
def handler(ctx, data: io.BytesIO=None):
    try:
        body = json.loads(data.getvalue())
        sql_query = body.get("QUERY")
        text_vector = body.get("TEXT_VECTOR")
        response_body = json.loads(sql_query)["RESPONSE_BODY"]
        json_query = json.loads(response_body)["message"]
        query_vector = list(json_query)
        # Normalize text vector and embed query
        norm_text = text_vector.strip('][').split(', ')
        for i in range(len(norm_text)):
            query_vector[i] = float(query_vector[i])
            norm_text[i] = float(norm_text[i])
        d = round(cosine_distance(query_vector, norm_text), 5)
    except (Exception, ValueError) as ex:
        print(str(ex), flush=True)
    return response.Response(
        ctx, response_data=json.dumps(
            {"message": d}),
        headers={"Content-Type": "application/json"}
    )

Here, query_vector is embedded from a query like “What is the future of crypto currency?”, text_vector represents each vector row ‘embedded’ in the database. Cosine distance measures the distance or similarity between two vectors of an inner product space

Create OCI Function func_embed.py using OCI Function Development Kit (FDK) for Python. The function handler is the entry point to invoke the UDF.

requirements.txt for func_embed.py function:

fdk==0.1.60
langchain
langchain_openai

NOTE: if Langchain package is required, fdk package has to be an older version, otherwise it will throw unexpected error)

func_embed.py

import io
import json
from langchain_openai import OpenAIEmbeddings
from fdk import response

OPENAI_API_KEY = "****"
EMBEDDINGS_OPENAI = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small", dimensions=200)
def handler(ctx, data: io.BytesIO=None):
    try:
        body = json.loads(data.getvalue())
        sql_query = body.get("QUERY")
        # Normalize text vector and embed query
        query_vector =EMBEDDINGS_OPENAI.embed_query(sql_query)
        for i in range(len(query_vector)):
            query_vector[i] = round(query_vector[i], 8)
    except (Exception, ValueError) as ex:
        print(str(ex), flush=True)
    return response.Response(
        ctx, response_data=json.dumps(
            {"message": query_vector}),
        headers={"Content-Type": "application/json"}
    )

NOTE: Replace OPENAI_API_KEY = “****” with your actual OPENAI_API_KEY value

Above is the OCI cloud function handler func_embed.py created with FDK package. This function embeds the query into query_vector and calculates the distance between query vector and text vectors stored in database. Here we using the OpenAI Embedding LLM.

Embed and Insert

In this section, we will transform texts to vectors and store vectors in database tables using the two methods.

For the Python method we create two tables, UCI_NEWS_AGGREGATOR and NEWS_EMBEDDED. Load the news-aggregator-dataset in to the table UCI_NEWS_AGGREGATOR using the table description defined below. The NEWS_EMBEDDED table is used to embed and store the field TITLE data into vectors.

When using the PL/SQL method, we create the UCI_NEWS_AGGREGATOR table as in the Python method, but for the purposes of this blog we create a separate NEWS_EMBEDDED_SQL table so that we can compare the vector embedding between the two methods. The NEWS_EMBEDDED_SQL table for the PL/SQL Vector Transfer function described below, is identical to the Python NEWS_EMBEDDED table.

For this blog we use a public domain news aggregator dataset:

https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

Sample Data

Create and insert data into the UCI_NEWS_AGGREGATOR table

This table has the following definition:

desc UCI_NEWS_AGGREGATOR

Name Null? Type
——— —– ————–
ID NUMBER
TITLE VARCHAR2(4000)
URL VARCHAR2(4000)
PUBLISHER VARCHAR2(4000)
CATEGORY VARCHAR2(4000)
STORY VARCHAR2(4000)
HOSTNAME VARCHAR2(4000)
TIMESTAMP NUMBER

NOTE: This table can be created as an external table eliminating the need to execute a load operation to populate the table. For more detail on creating external tables in Oracle Autonomous Database using DBMS_CLOUD please see DBMS_CLOUD.CREATE_EXTERNAL_TABLE.

Create NEWS_EMBEDDED and NEWS_EMBEDDED_SQL tables with the following description

We need to embed field TITLE into vectors and store them in the NEWS_EMBEDDED table. This table has the following definition:

desc NEWS_EMBEDDED
Name Null? Type
———— —– ————–
TITLE VARCHAR2(4000)
URL VARCHAR2(4000)
CATEGORY VARCHAR2(4000)
TIMESTAMP NUMBER
TITLE_VECTOR CLOB

NOTE: as mentioned above, for PL/SQL our example uses the same table description for the NEWS_EMBEDDED_SQL for the TITLE embedding.

When using the PL/SQL method, create an OPENAI_CREDENTIAL and use the following PL/SQL to transfer and embed the TITLE vectors

Method 1: PL/SQL Embed and Insert

Create OPENAI_CREDENTIAL

BEGIN
    DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(
         host => 'api.openai.com',
         ace  => xs$ace_type(privilege_list => xs$name_list('http'),
                             principal_name => 'ADMIN',
                             principal_type => xs_acl.ptype_db)
   );
END;

EXEC DBMS_CLOUD.CREATE_CREDENTIAL('OPENAI_CRED', 'OPENAI', 'OPENAIKEY');

Use PL/SQL to perform TITLE embedding directly by DBMS_CLOUD.SEND_REQUEST API.

Create and execute the tranfer vector PL/SQL

-- Vector Transfer Using OPENAI Embedding LLM
-- Inserts in to table NEWS_EMBEDDED_SQL
DECLARE
    type refcur               is ref cursor;
    l_cursor                  refcur;
    l_sqlstmt                 clob;
    l_insert_stmt             clob;
    l_headers                 VARCHAR2(4096);
    l_title                   VARCHAR2(4096);
    l_resp                    dbms_cloud_types.resp;
    l_req_payload             JSON_OBJECT_T;
    l_resp_json               JSON_OBJECT_T;
    l_data_array              JSON_ARRAY_T;
    l_titles                  JSON_ARRAY_T;
    l_embedding               JSON_ARRAY_T;
BEGIN
    l_headers := '{''Content-Type'':''application/json''}';
    l_sqlstmt := 'select title from UCI_NEWS_AGGREGATOR';
    l_titles := new json_array_t;
    OPEN l_cursor FOR l_sqlstmt;
    LOOP
        FETCH l_cursor INTO l_title;
        EXIT WHEN l_cursor%NOTFOUND;
        BEGIN
            l_titles.append(l_title);
        END;
    END LOOP;
    l_req_payload := json_object_t.parse('{}');
    l_req_payload.put('input', l_titles);
    l_req_payload.put('model', 'text-embedding-3-small');
    l_req_payload.put('dimensions', 200);
    l_resp := DBMS_CLOUD.SEND_REQUEST (
        credential_name => 'OPENAI_CRED',
        uri             => 'bearer://api.openai.com/v1/embeddings',
        headers         => l_headers,
        body            => l_req_payload.to_blob,
        method          => DBMS_CLOUD.METHOD_POST
        );
    l_resp_json := json_object_t(DBMS_CLOUD.get_response_text(l_resp));
    l_data_array := l_resp_json.get_array('data');
    FOR indx in 0 .. l_data_array.get_size - 1
    LOOP
        l_embedding := JSON_OBJECT_t(l_data_array.get(indx)).get_array('embedding');
        l_insert_stmt := 'insert into NEWS_EMBEDDED_SQL(title, title_vector) values (:1, :2)';
        EXECUTE IMMEDIATE l_insert_stmt USING l_titles.get(indx).to_string, l_embedding.to_clob;
    END LOOP;
END;

Method 2: Python Embed and Insert

Create ReadAndInsert.py

For the Python method, we will use a python script to perform TITLE embedding. We will read the dataset using the Python oracledb package, embed texts of TITLE to vectors, and insert these vectors to NEWS_EMBEDDED table using the same Python oracledb package.

# Python script to embed TITLE vectors in the NEWS_EMBEDDED
# table using OPENAI embeddings LLM

from langchain_openai import OpenAIEmbeddings
import oracledb
OPENAI_API_KEY = "****"
EMBEDDINGS_OPENAI = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-3-small", dimensions=200)
SQL_QUERY = "select TITLE,URL,CATEGORY,TIMESTAMP from UCI_NEWS_AGGREGATOR"
def read_insert_data():
    oracle_connection = oracledb.connect(user=USERNAME, password=PASSWORD, dsn=CONNECTION_STRING)
    cursor = oracle_connection.cursor()
    cursor.execute(SQL_QUERY)
    rows = cursor.fetchall()
    all_data = []
    for row in rows:
        all_data.append(list(row))
    # Embed title into vector
    embeddings = EMBEDDINGS_OPENAI.embed_documents([row[0] for row in all_data])
    for x in embeddings:
        for i in range(len(x)):
            x[i] = round(x[i], 8)
    # Insert embedded data into table
    for i in range(len(all_data)):
        cursor.execute("""insert into news_embedded values (:1, :2, :3, :4, :5)""", [all_data[i][0],
                                                                                     all_data[i][1],
                                                                                     all_data[i][2],
                                                                                     all_data[i][3],
                                                                                     str(embeddings[i])])
    cursor.close()
    oracle_connection.commit()
    oracle_connection.close()
read_insert_data()

NOTE: Replace OPENAI_API_KEY=”****” with your OPEN_API_KEY

Invoke OCI functions as SQL functions

In this section, we will use DBMS_CLOUD_FUNCTION package to invoke the above OCI cloud function as a SQL function. Similar instructions can be found here OCI: oci user-defined function. The following is specific to the Python method and does not apply to PL/SQL method.

Create OCI Credential

Create credential, private_key and fingerprint generation: Step to get singing key and fingerprint

BEGIN

  DBMS_CLOUD.CREATE_CREDENTIAL (
                   credential_name => 'OCI_CRED2',
                   user_ocid       => 'user_ocid',
                   tenancy_ocid    => 'tenancy_ocid',
                   private_key     => 'private_key',
                   fingerprint     => 'fingerprint'
   );
END;
/

Create function catalog OCI_DEMO_CATALOG

BEGIN

    DBMS_CLOUD_FUNCTION.CREATE_CATALOG (
            credential_name  => 'OCI_CRED2',
           catalog_name     => 'OCI_DEMO_CATALOG',
           service_provider => 'OCI',
           cloud_params     => '{"region_id":"us-phoenix-1",
          "compartment_id":"comp_id"}'
   );
END;
/

A catalog is a collection of wrapper functions that reference and call their respective cloud functions via their API endpoints. It provides a structured way to organize and manage functions, make it easier to find and access them. It is required for OCI Function use. Please see dbms-cloud-function documentation for details.

Create SQL functions VECTOREMBEDDED and VECTORSEARCH(cosine_similarity)

# SQL function for VECTOREMBEDDED and VECTORSEARCH(cosine_similarity) 
BEGIN
    DBMS_CLOUD_FUNCTION.CREATE_FUNCTION (
        credential_name  => 'OCI_CRED2',
        catalog_name     => 'OCI_DEMO_CATALOG',
        function_name    => 'vectorembedded',
        function_id      => 'ocid1.fnfunc.oc1.phx.aaaaaaaa3or4vaylyhyxpsvp4nr2ucpfrr5aspj6mqfak7r7cfuoq3gl37ga',
        input_args       => TO_CLOB('{"QUERY":"VARCHAR2"}')
 );
END;

CREATE SQL FUNCTION VECTORSEARCH

# SQL Function for VECTORSEARCH cosine similarity
VAR function_args CLOB;
EXEC :function_args :=
TO_CLOB('{"QUERY":”VARCHAR2”,”TEXT_VECTOR”:”CLOB"}');
BEGIN
        DBMS_CLOUD_FUNCTION.CREATE_FUNCTION (
        credential_name  => 'OCI_CRED',
        catalog_name     => 'OCI_DEMO_CATALOG',
        function_name    => 'vectorsearch',
        function_id      => 'ocid1.fnfunc.sample',
        input_args       => :function_args
);
END;
/

Vectorsearch function description

desc VECTORSEARCH

FUNCTION VECTORSEARCH RETURNS CLOB

Argument Name Type In/Out Default?
————- ——– —— ——–
QUERY VARCHAR2 IN unknown
TEXT_VECTOR CLOB IN unknown

VECTORSEARCH queries – PL/SQL and Python

PL/SQL method

In this step we use PL/SQL function VECTORSEARCH to perform a vector search to answer the question:

‘What is the future about crypto currency?’

Sql statement for PL/SQL function:

with queryvector as (select VECTOREMBED('What is the future of crypto currency?') as qvector from DUAL)
select title, COSINE_SIMILARITY(a.qvector, b.title_vector) as distance from queryvector a, NEWS_EMBEDDED_SQL b order by distance desc fetch first 5 rows only;

Here are the top 5 relevant rows with largest cosine similarity (distance) ordered by relevance:

PLSQL-VectorSearch

BENEFITS OF EXECUTING QUERY IN SQL

When functions are executed within the database, it often leads to better performance compared to connecting to database through external Python scripts to fetch data. Executing SQL will minimize the data transformation between database and external application, it will also reduce latency due to network response. Additionally, running functions through external python scripts requires wallet and password configuration, which may lead to the security leakage of database credentials.

Python method

SQL statement for OCI Python cloud functions

with queryvector as (select VECTOREMBEDDED('What is the future of crypto currency?') as qvector from dual)
select b.title, JSON_VALUE(JSON_VALUE(VECTORSEARCH(a.qvector, b.TITLE_VECTOR), '$.RESPONSE_BODY'), '$.message') as distance from queryvector a, NEWS_EMBEDDED b order by distance desc fetch first 5 rows only;

Python-VectorSearch-top5

As you can see, whether you use the PL/SQL method or the Python OCI function method the inference and cosign similarity is identical.

CONCLUSION

Leveraging OCI Functions and the inherent security and core functionality of the Oracle Autonomous Database, we illustrate two common methods where you can create embedded vectors on your target data set and execute vector search with cosine similarity in Oracle Autonomous Database 19c today. Oracle is committed to enriching the built-in AI and vector search functionality in the Oracle Autonomous Database but there is no need to wait! Get started today!

Contributing Author: Xiao Sai

Similarity Search in Oracle Autonomous Database 19c

INTRODUCTION

Method 1: PL/SQL USER DEFINED FUNCTIONS

Method 2: OCI PYTHON USER DEFINED FUNCTIONS

Embed and Insert

Method 1: PL/SQL Embed and Insert

Method 2: Python Embed and Insert

VECTORSEARCH queries – PL/SQL and Python

CONCLUSION

John McHugh

Senior Principal Product Manager

Using variables in Data Transforms

Using Data Studio Data Load APIs in PL/SQL

Similarity Search in Oracle Autonomous Database 19c

INTRODUCTION

Method 1: PL/SQL USER DEFINED FUNCTIONS

Method 2: OCI PYTHON USER DEFINED FUNCTIONS

Embed and Insert

Method 1: PL/SQL Embed and Insert

Method 2: Python Embed and Insert

VECTORSEARCH queries – PL/SQL and Python

CONCLUSION

Authors

John McHugh

Senior Principal Product Manager

Using variables in Data Transforms

Using Data Studio Data Load APIs in PL/SQL