Machine Learning for Python (OML4Py) is a Python API that supports the machine learning process including data exploration and preparation, machine learning modeling, and solution deployment using your Oracle Database or Oracle Autonomous Database. In the OML4Py 2.1 release, we introduce several new capabilities designed to expand support for bring-your-own-model, take advantage of new 23ai features like vectors, and facilitate solution deployment across databases:
- Take advantage of the new Oracle Database 23ai VECTOR datatype via:
- New Python vector data type, oml.Vector that can be a part of OML4Py DataFrame proxy objects
- Using vectors with in-database machine learning (ML) algorithms
- Expand support for bring-your-own-model via:
- Ability to import traditional ML models and transformers using the Open Neural Network Exchange (ONNX) format
- Ability to convert models easily from Hugging Face to ONNX format using the ONNX Pipeline Models capability, available for Oracle Database 23.7
- Ability to score data using OML4Py functions and the native SQL prediction operators
- Export and import user-defined Python functions in the in-database script repository
OML4Py 2.1 is available for download and installation on Oracle Database 23ai and, coming soon, for Oracle Autonomous Database Serverless supporting Oracle Database 23ai.
Vector type support
Vectors are compact semantic representations of unstructured data like text and images. Vectors play a crucial role in Oracle Database 23ai AI Vector Search for which the datatype VECTOR was introduced. OML4Py 2.1 includes a new vector data type, oml.Vector, which can be part of OML4Py DataFrame proxy objects – whether reading from or writing to the database or for manipulation using OML4Py functions. Vector columns can also be used for compatible machine learning (ML) algorithms for model training and prediction. This capability facilitates using vectors as predictors and producing vectors. Such vector results could be used for similarity searches, e.g., with AI Vector Search.
Vectors in DataFrame proxy objects
The class oml.Vector represents a single column of VECTOR data in an Oracle Database table or view.
Here, we use oml.push to create a temporary table with integer vector dimensions. The function oml.push returns a DataFrame proxy object for use with other functions, like dimension to obtain the dimension of the vector column. DataFrame proxy objects enable manipulating database data without loading the table data into Python memory. Here, we create a DataFrame with vector columns that have integer dimensions.
>>> value1 = array.array("b", [1, 2, 3]) >>> dataframe = pd.DataFrame({'VECTOR': [value1, value1]}) >>> vector1 = oml.push(dataframe, dbtypes = "VECTOR(3, int8)") >>> vector1 VECTOR 0 [1, 2, 3] 1 [1, 2, 3] >>> vector1['VECTOR'].dimension() 3
We can also use oml.create to create a persistent table, here named MY_VECTOR_TABLE, with float vector dimensions.
>>> value2 = array.array("f", [1.5, 2.1, 3.3]) >>> dataframe = pd.DataFrame({'VECTOR': [value2, value2]}) >>> vector2 = oml.create(dataframe, table = 'MY_VECTOR_TABLE2', btypes = "VECTOR(3,float32)") >>> vector2 VECTOR 0 [1.5, 2.0999999046325684, 3.299999952316284] 1 [1.5, 2.0999999046325684, 3.299999952316284]
In the following example, we get a proxy object for the database table MY_TEXT_TABLE (which has ID and DATA columns), load our ONNX-format embedding model from a file, and use that model to generate vectors for DATA.
>>> input_dat = oml.sync(table = "MY_TEXT_TABLE") >>> onnx_mod=oml.onnx(onnx_file = "all-MiniLM-L6-v2.onnx", mining_function = "embedding", embedding_output = 'embedding', model_input = {'input': ['DATA']}) >>> onnx_mod = onnx_mod.load2db(model_name) >>> vector_embedding = onnx_mod.vector_embedding(input_dat, supplemental_cols = input_dat).sort_values(by = ['ID']) >>> vector_embedding ID DATA VECTOR_EMBEDDINGS 0 1 Hello, my dog is cute. [-0.011534097604453564, -0.02878376282751560,... 1 2 I am sad. [-0.03587627410888672, 0.016700683161616325, ... 2 3 I am happy. [-0.00781599897891283, 0.020758874714374542, ... 3 4 I am angry. [-0.006832946091890335, -0.00432532839477062,... 4 5 I am others. [0.021059399470686913, -0.061699334532022476,...
Using vectors with in-database ML algorithms
Using DataFrame proxy objects that reference VECTOR columns, you can use the following algorithms, which take vectors as predictors:
- Classification and Regression: Neural Network, Generalized Linear Model, Support Vector Machines
- Anomaly Detection: One-class Support Vector Machines, Expectation Maximization
- Clustering: K-Means, Expectation Maximization
- Feature Extraction: Singular Value Decomposition, Principal Component Analysis
The following algorithms can produce vectors as output:
- Feature Extraction: Singular Value Decomposition, Principal Component Analysis,
Non-negative Matrix Factorization, Explicit Semantic Analysis
Example using vectors with an in-database ML algorithm
The example below shows how to use one of the feature extraction algorithms, Singular Value Decomposition (SVD), with the vector_embedding function to vectorize relational data, build similarity indexes, and perform a semantic similarity search. We’ll use the customer bank marketing data, which includes a mix of numeric and categorical columns and has more than 40,000 rows.
To perform a semantic similarity search, we first vectorize the relational data using the OML feature extraction algorithm, which projects the data onto a more compact vector space. We’ll use SVD to perform a Principal Component Analysis (PCA) projection to produce five features – or vector dimensions.
BANK = oml.sync(table="BANK") svd_mod = oml.svd(ODMS_DETAILS = 'ODMS_ENABLE', PREP_AUTO = 'ON', SVDS_SCORING_MODE = 'PCA_MOD', NUM_FEATUES = 5) svd_mod = svd_mod.fit(BANK, case_id='id', model_name='PCA_TEST_MODEL')
Next, the vector_embedding function can then be used to output the SVD projection results as vectors using a SQL query. The dimension of the vector column is the same as the number of features in the PCA model, and the value of the vector represents the PCA projection results of the original row data.
SELECT id, vector_embedding(PCA_TEST_MODEL USING *) embedding FROM bank WHERE id=10000; ID EMBEDDING --------- -------------------------------------------------- 10000 [-2.3551013972411354E+002,2.8160084506788273E+001, 5.2821278275005774E+001,-1.8960922352439308E-002,- 2.5441143639048378E+000]
After creating a table with the vector output, we build a vector index, for example, using IVF with cosine distance.
CREATE TABLE pca_output AS SELECT id, vector_embedding(pcamod USING *) embedding FROM bank; CREATE VECTOR INDEX my_ivf_idx ON pca_output(embedding) ORGANIZATION NEIGHBOR PARTITIONS DISTANCE COSINE WITH TARGET ACCURACY 95;
Using this index, we can find the top five clients most similar to our client of interest. By joining the PCA output table with the BANK table, we can retrieve additional client information:
SELECT p.id id, b.PDAYS PDAYS, b.EURIBOR3M EURIBOR3M, b.CONTACT CONTACT, b.EMP_VAR_RATE EMP_VAR_RATE, b.DAY_OF_WEEK DAY_OF_WEEK FROM pca_output p, bank b WHERE p.id <> 10000 AND p.id=b.id ORDER BY VECTOR_DISTANCE(embedding, (select embedding from pca_output where id=10000), COSINE) FETCH APPROXIMATE FIRST 5 ROWS ONLY; ID PDAYS EURIBOR3M CONTACT EMP_VAR_RATE DAY_OF_WEEK ------- ------ ---------- -------------------- ----------------- ----------- 9416 999 4.967 telephone 1.4 fri 13485 999 4.963 telephone 1.4 thu 9800 999 4.959 telephone 1.4 wed 11607 999 4.959 telephone 1.4 fri 8264 999 4.864 telephone 1.4 tue
The results illustrate that the closest 5 records are very similar based on their column values.
Bring more of your models to the database
Machine learning and other AI models can originate from many sources, e.g., in-database models, native Python or R models, AI model repositories like Hugging Face, and others. Your Oracle Database enables you to bring your own model to the database to support AI application development.
OML4Py 2.1 on Oracle Database 23.7 expands this “bring your own model” capability by supporting the automated conversion of Hugging Face text, image, and multi-modal transformers to the Open Neural Network Exchange (ONNX) format using ONNX Pipeline Models and the import of those models. These capabilities make it easier to leverage an even broader set of text, image, and multi-modal transformers (embedding models), as well as text classification and reranking models. In addition, it also supports importing traditional ML models you’ve converted to ONNX format models to use with the in-database ONNX Runtime. The ONNX Runtime eliminates the need to call out to separately hosted transformer models.
Using ONNX Pipeline Models
If you do not have a pretrained transformer in the required ONNX format, OML4Py can help using the new “ONNX Pipeline Models” including:
- Text Embedding – ONNX Pipeline Models : Text Embedding
- Image Embedding – ONNX Pipeline Models : Image Embedding
- Multi-modal Embedding – ONNX Pipeline Models : CLIP Multi-Modal Embedding
- Text Classification – ONNX Pipeline Models: Text Classification
- Reranking – ONNX Pipeline Models: Reranking Pipeline
This capability automates the following pipeline tasks for using pretrained models from the Hugging Face repository within your Oracle database:
- Downloading the pretrained model from the Hugging Face repository to your system
- Augmenting the model with pre-processing and post-processing steps
- Creating a new ONNX-format model bundle
- Validating the augmented ONNX-format model bundle, and
- Loading the model bundle in the database as a first-class Oracle Machine Learning “data mining model” or optionally exporting the model to a file, possibly for import to multiple databases.
After importing such models to your database, you can use them in the AI Vector Search operators, the SQL prediction operators or the OML4Py predict functions, as appropriate.
You can work with any of the Hugging Face models that have been validated with OML4Py. Alternatively, you can use models directly from your file system using the built-in templates, which contain common configurations for text or image models.
The Python package oml.utils handles ONNX pipeline generation and export. The oml.onnx class allows you to import your own ONNX-format model, load it in the database, and score data using the OML prediction operators. You can load embedding (transformer), regression, classification, and clustering ONNX-format models and score using database data from Python and SQL. Once loaded in the database, an ONNX-format model can again be accessed using the onnx class.
Example of how ONNX Pipeline Model works
Here’s an example using the ONNX Pipeline Model interface in oml.utils. After transforming the model using ONNXPipeline, we export a pretrained transformer (embedding model) to the database; you can also export it to your local file system.
In the case of multi-modal transformers like CLIP, OML4Py produces two .onnx files with an image suffix (“_img”) and a text suffix (“_txt”) added automatically (e.g., “clip_txt.onnx” and “clip_img.onnx”).
import oml from oml.utils import ONNXPipeline, ONNXPipelineConfig pipeline = ONNXPipeline(model_name="openai/clip-vit-large-patch32") # Export to the database for use with in-database ONNX Runtime pipeline.export2db("CLIP") # Export to file pipeline.export2file("clip",output_dir="/tmp/models")
To use this model from a file, move the ONNX file to a directory on the target database server and create a directory in the database for the import.
Use oml.onnx to create an ONNX model object and then load that model to the database.
import oml from oml.algo import onnx onnx_model = onnx('clip_img.onnx', mining_function="embedding",model_input={"input":["DATA"]}) onnx_model.load2db("CLIP_IMAGE")
You can verify if the models exist in your database schema by getting a proxy object with the following SQL query.
model_list = oml.sync(query = "SELECT model_name, algorithm, mining_function FROM user_mining_models WHERE model_name LIKE 'CLIP_%'")
Importing traditional ONNX-format machine learning models
Native in-database machine learning models benefit from multiple optimizations for running in the core database. However, you may have models that were produced in external systems and want to use the OML SQL prediction operators and Python functions with these models. You can use the onnx module as above.
Example of how a traditional ONNX-format ML model works
In this example, we load a classification ONNX-format model into the database. This model is based on the iris dataset and predicts the species target.
import oml from oml.algo import onnx # Load your ONNX-format model into the database - providing the ONNX file (.onnx) and associated metadata model_2 = onnx(onnx_file="path_to_model.onnx", mining_function='CLASSIFICATION', classification_prob_output="output", model_input={"tensor_of_dim_4":['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']}, apply_softmax=True, labels=['setosa','versicolor','virginica'], description={"description":"This model is useful.","author":"John Doe"}, default_on_null={'Sepal_Length':3.2,'Sepal_Width':2.5,'Petal_Length':46,'Petal_Width':1.8}) model_2.load2db("MY_CLASSIFICATION_MODEL")
You can also use the onnx class to reference a model previously loaded into the database. In the following example, we specify the model loaded above, MY_CLASSIFICATION_MODEL , owned by user OMLUSER. Once the model is loaded, you can view the model information by displaying the model object.
from oml.algo import onnx model_2 = onnx(model_name="MY_CLASSIFICATION_MODEL", model_owner="OMLUSER")
Now, we can use the model to score data using Python using the iris dataset.
# Get a proxy object to the IRIS database table oml_iris = oml.sync(table = 'IRIS') model_2.predict(oml_iris.drop('Species'), supplemental_cols=oml_iris[:,['Sepal_Length', 'Sepal_Width', 'Species']], proba=True)
Export and import user-defined Python scripts
The Python Script Repository allows you to store user-defined functions (UDFs) in the database and load them into Python memory for immediate use. You can also invoke UDFs using OML4Py embedded Python execution from Python, SQL, and REST APIs.
To facilitate sharing and migrating scripts from one database to another, user-defined Python functions, also called scripts, can be exported or imported among Python script repositories across database instances and schemas. This simplifies solution deployment—for example, deploying from development to test and then production systems.
Use the oml.export_script function to export UDFs from the OML4Py script repository to a destination file. OML4Py enables exporting scripts in JSON or Python format.
- If you provide a ‘.json’ suffix, the exported JSON file contains a list of dictionaries. Each dictionary corresponds to a script with content: script name, the UDF, and script description.
- If you provide a ‘.py’ suffix, the exported Python file contains a list of Python scripts, each with its own doc string containing the script name and description.
See Export a User-Defined Python Function for more details. Use the oml.import_script function to import UDFs to the OML4Py script repository. See Import a User-Defined Python Function for more details.
Example of how exporting and importing UDF works
The following example lists the UDFs in the script repository available to the current user containing “LM”. It exports user-owned and global scripts to the named files and verifies whether the specified file is present in the given directory.
import oml import os oml.script.dir(name="LM", regex_match=True, sctype="all")[['owner', 'name', 'script', 'description']] oml.export_script(file_name="script_user.json", sctype="user") oml.export_script(file_name="script_global.py", sctype="global") {"script_user.json", "script_global.py"}.issubset(os.listdir(path="./"))
The following example imports scripts in the file “script_user.json” as private UDFs and the file “script_global.py” as global UDFs.
import oml oml.import_script(file_name="script_user.json") oml.script.dir()[['name', 'script', 'description']] oml.import_script(file_name="script_global.py", is_global=True) oml.script.dir(name="LM", regex_match=True, sctype="all")[['owner', 'name', 'script','description']]
Why use OML4Py?
All the new capabilities are great, but why should you use OML4Py? Here are a few of the main reasons why:
- Performance and scalability: Oracle Machine Learning for Python (OML4Py) leverages the database as a high-performance computing environment to explore, transform, and analyze data faster and at scale from Python. Using overloaded functions on Pandas DataFrame proxy objects, database tables and views can be manipulated in the database—taking advantage of database parallelism, query optimization, and column indexes, among other features. The in-database parallelized machine learning algorithms are exposed through a natural Python interface and in-database models can be used through ML model proxy objects.
- Ease of collaboration and solution deployment: Data scientists and other Python users can create user-defined Python functions that are managed in the database and leverage third-party packages to augment native OML4Py and database functionality. Using the datastore, you can store Python objects directly in the database—as opposed to being managed in flat files. These capabilities facilitate collaboration across the data science team by enabling convenient hand-off of work products from data scientists to application developers and production system administrators for immediate deployment.
- ML for experts and non-experts: OML4Py supports automated machine learning—or AutoML— for classification and regression models. This not only enhances data scientist productivity but also enables non-experts to use and benefit from machine learning. AutoML can help produce more accurate models faster, through automated algorithm and feature selection, and model tuning and selection.
- Choice of models: OML4Py also facilitates importing transformer models from Hugging Face as well as other ONNX-format models for in-database use.
For more information…
You can try these OML4Py 2.1 capabilities on your local Oracle Database instance. Download the OML4Py 2.1 client today.
See these resources:
- OML4Py documentation
- OML4Py ONNX class documentation
- OML4Py API Reference
- OML4Py 2.0 LiveLab Workshop
- OML4Py 2.1 client download