Creating vector embedding using Data Transforms for ML/AI application

 

This is a guide to help you learn how to create and populate vector columns with text embedding in Autonomous database Data Studio using Data Transforms tool in a no-code UI.

Oracle Database 23ai has introduced a native VECTOR data type and AI Vector Search capability. This feature is used in various applications such as chatbots, document semantic search using similarity analysis and supports Retrieval Augmented Generation (RAG), a technique that augments LLM with private data for more accurate responses.

First step in all the above applications is to create vector embedding for the data at hand. This task is made easier in Data Transforms by providing a drag and drop Vector embedding step. There is no coding needed. Once the data flow is defined to populate the Vector embedding column, it can be scheduled to run periodically to keep it up to date as new data becomes available.

Note that this blog assumes basic knowledge of Data Transforms.

The steps are as follows:

  • Register OCI Generative AI service
  • Create Vector column in the target table if it doesn’t exist already
  • Create Data Flow to populate the target table
  • Schedule Data Flow

Register OCI Generative AI service
First step is to connect to OCI Generative AI service which will be used for generating Vector embeddings. Go to Connections menu in Data Transforms and find OCI Generative AI service in the Services section.

Fill in the credentials as follows:


OCI URL – The endpoint URL of the OCI Generative AI service.

User OCID – The user OCID from the Oracle Cloud Infrastructure Console.

Tenancy OCID – The tenancy OCID from the Oracle Cloud Infrastructure Console.

Compartment OCID – The compartment OCID from the Oracle Cloud Infrastructure Console.

Private Key – The private key in the PEM format. Specify the path to your downloaded private key file.

Fingerprint – The fingerprint of the key that was just added.

Test the connection for success. You will need to test it against an Oracle 23ai database since it is a feature of 23ai.

Create Vector column in the target table if it doesn’t exist already
Import the source and target table definitions in the Data Entities menu. In this example the source table MOVIE contains data for movies. There is a SUMMARY column that will be the source for Vector embedding.

This SUMMARY column has long textual summary of movies. After we create Vector embedding for this column, it will be possible to run similarity search queries on movies table and possibly use the Vector embedding for other ML/AI applications.

The SUMMARY data looks like this.


Now create a new column of VECTOR data type in the target table MOVIE_SUMMARY_VECTOR_EMBEDDING.

 

Now we are ready to create the Data Flow to populate this Vector column.

Create Data Flow to populate the target table
Navigate to your project and create a new Data Flow. Drag the source table MOVIE in the UI.
On the top of the UI, there are many transforms bucket. Click on Machine Learning bucket and drag Text Embedding Vector transform step into the canvas. Join the source and the transform step.

 

Open the properties of the embedding step and configure is as below in the General section.

Select the AI Service, Connection and the AI Model in the General section. The Connection is the previously defined OCI Generative AI service and the AI Model should be appropriate for your usage. If not sure, then consult with the application user using the embedding for the right AI model.

In the Attributes section, you will notice there is an output called VECTOR_EMBEDDING. Optionally you can change this name to be more meaningful. We will leave it as it is.

In the Column Mapping section, map the source column SUMMARY to the input_text by dragging the column over.


Now, as a last step, drag the target table MOVIE_SUMMARY_VECTOR_EMBEDDING on the canvas and join it to the embedding step.

 

Open the Column Mapping in the property page of the target table and make sure the Vector column is mapped to the output of Vector embedding step.

Navigate to the Options tab and make sure the target loading options are as needed. You have an option to simply truncate and load the target, append to existing rows, or even incrementally update it (based on PK). Please refer to Data Transforms documentation for the advanced loading options.
 

In this example the target is simply being dropped and recreated before loading.


Save the Data Flow and execute it by clicking on the triangle icon on the top menu.

After the load is successful, you can do the data preview to look at the newly populated columns.

 

Schedule Data Flow
Optionally you may want to run this Data Flow on a periodic basic for new incoming data. Data Transforms provides a full scheduler to schedule your Data Flow. You can also include this Data Flow in a workflow so that it is part of a bigger data integration job.

Conclusion

Vector embedding is a very powerful feature of Oracle 23ai database but using it normally requires learning new SQL syntax and learning details of setting up the Generative AI service.  Data Transforms makes this task very easy with its no-code UI for AI application users.