Announcing vector support for in-database machine learning algorithms

January 24, 2025 | 6 minute read
Mark Hornick
Senior Director, Machine Learning and AI Product Management
Text Size 100%:

Oracle Machine Learning now supports the vector data type for clustering, classification, regression, anomaly detection, and feature extraction. With this new feature, you can provide vector data as input to several in-database algorithms to complement other structured data or to use alone.

The addition of vector data expands the set of machine learning use cases. Using dense vector embeddings alone or in combination with other structured data has the potential to improve model quality. Further, it complements Oracle Machine Learning’s ability to import to the database and use text and image embedding models (transformers) to generate such embeddings as input to the in-database machine learning algorithms.

Use Case Examples

There are many applicable use cases involving vectors as predictors for traditional machine learning models, for example:

  • Text Classification—Consider email spam classification. Add a vector column (using email text and an in-database transformer model) to your table that contains a target column (what you want to predict, e.g., IS_SPAM) and possibly other structured data columns (for example, email sender and length). Then provide this data to classifiers like SVM, GLM, and neural networks.
  • Image Clustering—Consider product image catalog organization. Vectorize images and provide the vector column, possibly with other structured data, to a clustering algorithm, like K-means or Expectation Maximization.
  • Anomaly Detection—Consider identifying images with flawed crystalline structure. Vectorize proper images and supply the vector column to an anomaly detection algorithm like One-class SVM, Expectation Maximization. As new images are scored, images that significantly differ from the training data will be flagged as anomalies.
  • Dimensionality Reduction—Reduce the number of dimensions in a column of vectors for more efficient processing. Once you have your vector column, use a feature extraction algorithm like PCA or SVD to produce new vectors of a reduced dimension. This can be used as a preprocessing step to other algorithms like K-means clustering.

OML supports the vector data type for the following algorithms and the scoring operators:

ML techniques and algorihtms supporting VECTOR data type
Techniques Algorithms Scoring Operators

 Classification
 Regression

 SVM
 Neural Network
 GLM
 PREDICTION
 PREDICTION_PROBABILITY
 PREDICTION_SET
 PREDICTION_BOUNDS
 Anomaly Detection

 One-class SVM
 Expectation Maximization

 PREDICTION
 PREDICTION_PROBABILITY
 PREDICTION_SET
 Clustering  k-Means
 Expectation Maximization
 CLUSTER_ID
 CLUSTER_PROBABILITY
 CLUSTER_SET, CLUSTER_DISTANCE
 Feature Extraction  SVD
 PCA
 FEATURE_ID
 FEATURE_VALUE
 FEATURE_SET
 VECTOR_EMBEDDING

  
Dimensionality Reduction Example

In this simple example, we start with a table IMAGEVEC that minimally contains two columns: ID of type NUMBER and EMBEDDING of type VECTOR. You can create a vector column from image data using an ONNX-format transformer loaded into the in-database, or you can generate the vectors externally and save them to a VECTOR column in your table using features of AI Vector Search in the DBMS_VECTOR package.

To reduce the dimensionality of the vectors, here we set the number of features (dimensions) to 20 using the Singular Value Decomposition algorithm in combination with the Principal Component Analysis (PCA) scoring mode. The following code produces the model in our database schema for immediate use with SQL queries:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
DECLARE
  v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
  v_setlst('ALGO_NAME')         := 'ALGO_SINGULAR_VALUE_DECOMP';
  v_setlst('SVDS_SCORING_MODE') := 'SVDS_SCORING_PCA';
  v_setlst('FEAT_NUM_FEATURES') := '20';
 
  DBMS_DATA_MINING.CREATE_MODEL2(
        MODEL_NAME          => 'PCA_MODEL_1',
        MINING_FUNCTION     => 'FEATURE_EXTRACTION',
        DATA_QUERY          => 'SELECT * FROM IMAGEVEC ',
        CASE_ID_COLUMN_NAME => 'ID',
        SET_LIST            => v_setlst);    
END;

From here, we can transform each vector using the VECTOR_EMBEDDING operator and the model PCA_MODEL_1 created above. We could create a view instead of a table but for the most efficient processing during the next phase of our example, we materialize a table with a vector column containing the reduced number of dimensions.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
CREATE TABLE REDUCED_IMAGEVEC as
  SELECT ID,
         VECTOR_EMBEDDING(PCA_MODEL_1 using *) EMBEDDING
  FROM IMAGEVEC;

Now, we’re ready to build a K-means clustering model using the vectors with reduced dimensions. We’ll produce a model with four clusters.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
DECLARE
  v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
  v_setlst('ALGO_NAME')         := 'ALGO_KMEANS';
  v_setlst('KMNS_DETAILS')      := 'KMNS_DETAILS_ALL';
  v_setlst('CLUS_NUM_CLUSTERS') := '4';

  DBMS_DATA_MINING.CREATE_MODEL2(
        MODEL_NAME          => 'KM_MODEL_USING_VECTORS',
        MINING_FUNCTION     => 'CLUSTERING',
        DATA_QUERY          => 'SELECT * FROM REDUCED_IMAGEVEC',
        CASE_ID_COLUMN_NAME => 'ID',
        SET_LIST            => v_setlst);    
END;

Using the USER_MINING_MODEL_ATTRIBUTES table, we can view the attributes (predictors) used in a model.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT MODEL_NAME MODEL_NAME, ATTRIBUTE_NAME, DATA_TYPE, TARGET, VECTOR_INFO
FROM   USER_MINING_MODEL_ATTRIBUTES
WHERE  MODEL_NAME ='KM_MODEL_USING_VECTORS' ORDER BY attribute_name;

Here, we show one vector attribute of length 20.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
MODEL_NAME                 ATTRIBUTE_NAME    DATA_TYPE    TARGET    VECTOR_INFO
KM_MODEL_USING_VECTORS    EMBEDDING           VECTOR           NO    VECTOR(20,FLOAT64)

Using the SQL cluster operators, we can assign each image to a cluster, along with a corresponding probability.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT ID,
       cluster_id(KM_MODEL_USING_VECTORS using *) cluster_id,
       cluster_probability(KM_MODEL_USING_VECTORS using *) probability  
FROM REDUCED_IMAGEVEC ORDER BY id;

ID    CLUSTER_ID    PROBABILITY
1              2    0.3200635911914684
2              2    0.3431603123967911
3              3    0.27137847598263004
4              3    0.26255812851723653
5              2    0.2954242359224814


Doing some simple aggregation, we can summarize the number of images assigned to each cluster – visualized using OML Notebooks’ built-in visualizations.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT count(*) cnt, cluster_id
FROM (SELECT id, 
             cluster_id(KM_MODEL_USING_VECTORS using *) cluster_id,
             cluster_probability(KM_MODEL_USING_VECTORS using *) probability  
      FROM pca_data_from_vector)
GROUP BY cluster_id;

Vector clustering model histogram of cluster sizes

For More Information

This powerful new capability is available in Oracle Database 23ai (specifically, Oracle Database 23.7). For more information about this and other 23ai features, check out the OML4SQL product documentation and changes introduced for 23ai.

 

Mark Hornick

Senior Director, Machine Learning and AI Product Management

Mark Hornick is Senior Director, Machine Learning and AI Product Management. Mark has more than 20 years of experience integrating and leveraging machine learning with Oracle software as well as working with internal and external customers to apply Oracle’s machine learning technologies. Mark is Oracle’s representative to the R Consortium and is an Oracle Adviser and founding member of the Analytics and Data Oracle User Community. He has been issued seven US patents. Mark holds a bachelor’s degree from Rutgers University and a master’s degree from Brown University, both in computer science. Follow him on Twitter and connect on LinkedIn.

Show more
Oracle Chatbot
Disconnected