Oracle Machine Learning now supports the vector data type for clustering, classification, regression, anomaly detection, and feature extraction. With this new feature, you can provide vector data as input to several in-database algorithms to complement other structured data or to use alone.
The addition of vector data expands the set of machine learning use cases. Using dense vector embeddings alone or in combination with other structured data has the potential to improve model quality. Further, it complements Oracle Machine Learning’s ability to import to the database and use text and image embedding models (transformers) to generate such embeddings as input to the in-database machine learning algorithms.
There are many applicable use cases involving vectors as predictors for traditional machine learning models, for example:
OML supports the vector data type for the following algorithms and the scoring operators:
Techniques | Algorithms | Scoring Operators |
---|---|---|
Classification |
SVM Neural Network GLM |
PREDICTION PREDICTION_PROBABILITY PREDICTION_SET PREDICTION_BOUNDS |
Anomaly Detection |
One-class SVM |
PREDICTION PREDICTION_PROBABILITY PREDICTION_SET |
Clustering | k-Means Expectation Maximization |
CLUSTER_ID CLUSTER_PROBABILITY CLUSTER_SET, CLUSTER_DISTANCE |
Feature Extraction | SVD PCA |
FEATURE_ID FEATURE_VALUE FEATURE_SET VECTOR_EMBEDDING |
In this simple example, we start with a table IMAGEVEC that minimally contains two columns: ID of type NUMBER and EMBEDDING of type VECTOR. You can create a vector column from image data using an ONNX-format transformer loaded into the in-database, or you can generate the vectors externally and save them to a VECTOR column in your table using features of AI Vector Search in the DBMS_VECTOR package.
To reduce the dimensionality of the vectors, here we set the number of features (dimensions) to 20 using the Singular Value Decomposition algorithm in combination with the Principal Component Analysis (PCA) scoring mode. The following code produces the model in our database schema for immediate use with SQL queries:
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('ALGO_NAME') := 'ALGO_SINGULAR_VALUE_DECOMP';
v_setlst('SVDS_SCORING_MODE') := 'SVDS_SCORING_PCA';
v_setlst('FEAT_NUM_FEATURES') := '20';
DBMS_DATA_MINING.CREATE_MODEL2(
MODEL_NAME => 'PCA_MODEL_1',
MINING_FUNCTION => 'FEATURE_EXTRACTION',
DATA_QUERY => 'SELECT * FROM IMAGEVEC ',
CASE_ID_COLUMN_NAME => 'ID',
SET_LIST => v_setlst);
END;
From here, we can transform each vector using the VECTOR_EMBEDDING operator and the model PCA_MODEL_1 created above. We could create a view instead of a table but for the most efficient processing during the next phase of our example, we materialize a table with a vector column containing the reduced number of dimensions.
CREATE TABLE REDUCED_IMAGEVEC as
SELECT ID,
VECTOR_EMBEDDING(PCA_MODEL_1 using *) EMBEDDING
FROM IMAGEVEC;
Now, we’re ready to build a K-means clustering model using the vectors with reduced dimensions. We’ll produce a model with four clusters.
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('ALGO_NAME') := 'ALGO_KMEANS';
v_setlst('KMNS_DETAILS') := 'KMNS_DETAILS_ALL';
v_setlst('CLUS_NUM_CLUSTERS') := '4';
DBMS_DATA_MINING.CREATE_MODEL2(
MODEL_NAME => 'KM_MODEL_USING_VECTORS',
MINING_FUNCTION => 'CLUSTERING',
DATA_QUERY => 'SELECT * FROM REDUCED_IMAGEVEC',
CASE_ID_COLUMN_NAME => 'ID',
SET_LIST => v_setlst);
END;
Using the USER_MINING_MODEL_ATTRIBUTES table, we can view the attributes (predictors) used in a model.
SELECT MODEL_NAME MODEL_NAME, ATTRIBUTE_NAME, DATA_TYPE, TARGET, VECTOR_INFO
FROM USER_MINING_MODEL_ATTRIBUTES
WHERE MODEL_NAME ='KM_MODEL_USING_VECTORS' ORDER BY attribute_name;
Here, we show one vector attribute of length 20.
MODEL_NAME ATTRIBUTE_NAME DATA_TYPE TARGET VECTOR_INFO
KM_MODEL_USING_VECTORS EMBEDDING VECTOR NO VECTOR(20,FLOAT64)
Using the SQL cluster operators, we can assign each image to a cluster, along with a corresponding probability.
SELECT ID,
cluster_id(KM_MODEL_USING_VECTORS using *) cluster_id,
cluster_probability(KM_MODEL_USING_VECTORS using *) probability
FROM REDUCED_IMAGEVEC ORDER BY id;
ID CLUSTER_ID PROBABILITY
1 2 0.3200635911914684
2 2 0.3431603123967911
3 3 0.27137847598263004
4 3 0.26255812851723653
5 2 0.2954242359224814
Doing some simple aggregation, we can summarize the number of images assigned to each cluster – visualized using OML Notebooks’ built-in visualizations.
SELECT count(*) cnt, cluster_id
FROM (SELECT id,
cluster_id(KM_MODEL_USING_VECTORS using *) cluster_id,
cluster_probability(KM_MODEL_USING_VECTORS using *) probability
FROM pca_data_from_vector)
GROUP BY cluster_id;
This powerful new capability is available in Oracle Database 23ai (specifically, Oracle Database 23.7). For more information about this and other 23ai features, check out the OML4SQL product documentation and changes introduced for 23ai.
Mark Hornick is Senior Director, Machine Learning and AI Product Management. Mark has more than 20 years of experience integrating and leveraging machine learning with Oracle software as well as working with internal and external customers to apply Oracle’s machine learning technologies. Mark is Oracle’s representative to the R Consortium and is an Oracle Adviser and founding member of the Analytics and Data Oracle User Community. He has been issued seven US patents. Mark holds a bachelor’s degree from Rutgers University and a master’s degree from Brown University, both in computer science. Follow him on Twitter and connect on LinkedIn.