In March 2021, a new feature landed on Oracle Machine Learning that sits on top of Oracle Autonomous Database: AutoML. The new AutoML user interface is a low-code, browser-based solution that allows you to create and deploy models using data that resides within Autonomous Database. In this blog post, we go through the process of loading data, cleaning that data, creating models with the AutoML UI, model deployment, and finally generating predictions.
What does the AutoML UI do?
AutoML doesn’t simply create models. It runs an entire experiment pipeline to generate tuned models with appropriate features. Using the data selected within the database, AutoML chooses the algorithm, samples the data, performs feature selection, and finally tunes hyperparameters when appropriate. The following graphic shows this process:
You can read more about AutoML features in this blog post before following along with the demo.
Loading data
AutoML runs directly on the data within the Autonomous Database. As long as the data sits in a table that your user has access to, you can train and deploy a model with that data.
Throughout this blog, we work with a dataset that measures the fuel efficiency in miles per gallon of various vehicles based on a few different features. You can find the dataset at Kaggle, which is originally sourced from the UCI Machine Learning repository.
When you have the CSV file, you can load it into Autonomous Database either through the cloud native Database Actions console found within the service console of your Autonomous Database instance or your local SQL Developer application. For instructions on deploying an autonomous database on Oracle Cloud Infrastructure (OCI)’s Free Tier and more information, see Using Oracle Autonomous Database on Shared Exadata Infrastructure and the documentation.
Let’s look at the first few rows of the dataset:
The first column is our response variable, miles per gallon. We have a suite of predictors, from the cylinder count, displacement, horsepower, weight, acceleration, model year, country of origin, and finally “Car Name,” a field that contains some useful text. Most of these fields seem ready to go for machine learning, but we still have some data translation to do!
Preprocessing the data with OML4Py
No matter the tool you choose, you must first prepare your data for the model training process. We can create a simple OML4Py script that can be run within the Oracle Machine Learning notebooks that connect to Autonomous Database. OML4Py allows you to interact with the data within the autonomous database using Python.
For in-depth information on OML notebooks, see Using Oracle Machine Learning on Autonomous Database. You can also familiarize yourself with OML4Py before jumping into the script with the documentation, which has a set of links to some helpful LiveLabs.
The following code snippets were run in a Machine Learning notebook running on the same Autonomous Database instance that contains the data. Simply create a notebook to run this code in. If your Autonomous Database user isn’t set up for Machine Learning on Autonomous Database, follow the documentation.
To follow in the same manner, click Notebooks.
Then click Create.
First, import the following Python packages and load the data in.
%python
import oml import numpy as np import pandas as pd
query = 'SELECT * FROM USER1.AUTO_MPG_RAW'
data = oml.sync(query=query)
In this code block, we use a simple select query to create a Python object, “data,” which is a proxy for the table AUTO_MPG_RAW.
To do some data translation with pandas, we can use the following pull method on this object to make a pandas DataFrame in memory:
%python
df = data.pull()
Next, we do some cleaning.
In the following code, we make all the column names lowercase, replace any question marks in the data with NaN, and adjust the type of horsepower because the missing values, originally designated with a “?,” caused it to be read in as a string.
%python df.columns = [x.lower() for x in df.columns] df = df.replace('?', np.nan) df.horsepower = df.horsepower.astype('float')
The “Car Name” column has a lot of information in that can be valuable for our model, such as the brand name. The following block shows the information that we can extract into a new column, “brand.”
To make the information in this table more readable, we also convert the “origin” feature into the appropriate countries that the cars originated from.
Finally, let’s create a table in the database with our transformed data using the create method. We can drop any features that we no longer need with the following command:
Additional settings: Leave all values as default and set the metric to R2
You have some granular control in the additional settings in terms of the models you train, the metric you use, and the run duration for training. We chose R2, which measures how much of the variance in the response is explained by your features.
Speaking of features, for the last step, select the features for your model. In this case, you select all the features present because we dropped the unnecessary features at the end our OML4Py script.
Now you can click Start and then Better Accuracy to run the experiment. This selection increases the training time but allows for a more effective set of models to be generated.
As the models are being trained, you can see a leaderboard develop on the graphic and the current stage in the experiment pipeline.
Viewing results
When the experiment is completed, you have a set of models trained and available within Autonomous Data Warehouse. You can see them ranked on the final leaderboard based on the metric you selected earlier, in this case R2. In the following screenshot, we can see that the Support Vector Machine model with a gaussian kernel performed best:
If we click the name of this model, we can see the features that had the highest impact:
In this case, the weight, horsepower, and model year of the car had the largest impact on predicting the fuel efficiency of the car, which makes sense!
Deploying a model
Now that these models are trained, they’re available for you to call as functions using SQL within Autonomous Database or as model artifacts in a connected instance of Oracle Analytics Cloud. You can also deploy them easily as a REST endpoint from the AutoML UI directly.
First, highlight the SVM (Gaussian) model and click Deploy.
Fill in the required details.
Here, I used “svm_mpg” for both the name and URI and 0.1 as the version.
Now, this model is deployed as a REST endpoint. You can view it under the Deployments tab of the Models section of Oracle Machine Learning, but more importantly you can issue a call to generate predictions.
Let’s do exactly that using cURL.
Generating predictions
I can generate predictions by setting some environment variables in a local Bash shell and issuing a few commands as a basic example. For details, see the documentation.
First, let’s set some environment variables that detail your tenancy and Autonomous Data Warehouse deployment:
In this code block, I set a few values, such as the tenancy OCID, the username and password for the database user, the database name, and the URL for the OML server. In this example, I’m using a deployment in Ashburn, but you can change the ‘us-ashburn-1’ portion of the URL to your relevant region.
Next, we get an access token, which provides authentication when calling the REST endpoint. Using the previous environment variables, we can issue the following call:
This command outputs a JSON object, with the value of the “accesstoken” key being a long string. Copy and paste the following command to make one last environment variable, including the apostrophes wrapping the token:
export token='<accesstoken>'
Now, we can issue some calls against the model’s endpoint. We can issue the following command to get some model metadata and the model’s input parameters:
Here, we can see the modeling task at hand (regression), the algorithm used to train the model (Support vector machine), and the different inputs and their type.
Now let’s actually generate a prediction!
In the following call, I’m including a JSON object that details the car that I want to predict the miles per gallon for. The following code block shows that object:
We can see that the model predicted that this car has a fuel efficiency of about 30 miles per gallon.
Conclusion
We covered a lot of ground with the easy-to-use framework provided by Oracle Machine Learning sitting on top of Autonomous Database.
We manipulated data using a Python interface for SQL, stored that cleansed data in the database, completed a full end-to-end model creation pipeline in a few clicks with AutoML, performed a one-click deployment of a model, and finally generated predictions using that model.
This process shows the power and ease of use of machine learning on Autonomous Database and specifically the power that’s delivered to the end-user through the AutoML UI!