X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • February 24, 2021

Getting started with Oracle Machine Learning for Python

Mark Hornick
Senior Director, Data Science and Machine Learning

As noted in Introducing Oracle Machine Learning for Python, OML4Py is included with Oracle Autonomous Database, making the open source Python scripting language and environment ready for the enterprise and big data. 

To get started with OML4Py, log into your Oracle Machine Learning Notebooks account and create a new notebook. If you don't have one yet, you can create an Autonomous Database account using your Oracle Always Free Services and follow this OML Notebooks tutorial.

Load the OML package

In the initial paragraph, specify %python as your interpreter. At this point, you can invoke Python code. However, to use OML4Py, import the package oml. Click the "run this paragraph" button. You can optionally invoke oml.isconnected to verify your connection, which should return true.

%python

import oml
oml.isconnected()

Load a Pandas DataFrame to the database

There are several way to load data into Oracle Autonomous Database. In this first example, we create a table using the sklearn iris data set. We combine the target and predictors into a single Pandas DataFrame and load this DataFrame object into an Oracle Autonomous Database  table using the create function.

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
x = pd.DataFrame(iris.data, 
                 columns = ["SEPAL_LENGTH", "SEPAL_WIDTH", 
                            "PETAL_LENGTH", "PETAL_WIDTH"])
y = pd.DataFrame(list(map(lambda x: {0:'setosa', 1: 'versicolor', 
                                     2:'virginica'}[x], iris.target)), 
                 columns = ['Species'])
iris_df = pd.concat([x,y], axis=1)

IRIS = oml.create(iris_df, table="IRIS")
print("Shape:",IRIS.shape)
print("Columns:",IRIS.columns)
IRIS.head(4)

The script above produces the following output. Note that we access shape and columns properties on the proxy object, just as we would with a Pandas DataFrame. Similarly, we invoke the overloaded head function on the IRIS proxy object.

Shape: (150, 5)
Columns: ['SEPAL_LENGTH', 'SEPAL_WIDTH', 'PETAL_LENGTH', 'PETAL_WIDTH', 'Species']

Out[6]:
   SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH  PETAL_WIDTH Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa

This table is also readily available in the user schema under the name IRIS, just as any other database table.

Using overloaded functions

Using the numeric columns, we compute the correlation matrix on the in-database table IRIS using the overloaded corr function. Here, we see that petal length and petal width are highly correlated.

IRIS.corr()

With the output:

              SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH  PETAL_WIDTH
SEPAL_LENGTH      1.000000    -0.109369      0.871754     0.817954
SEPAL_WIDTH      -0.109369     1.000000     -0.420516    -0.356544
PETAL_LENGTH      0.871754    -0.420516      1.000000     0.962757
PETAL_WIDTH       0.817954    -0.356544      0.962757     1.000000

OML4Py overloads graphics functions as well. Here, we use boxplot to show the distribution of the numeric columns. In such overloaded functions, the statistical computations take place in the database - avoiding data movement and leveraging Autonomous Database as a high performance compute engine - returning only the summary statistics needed to produce the plot.

import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.figure(figsize=[10,5]))

oml.graphics.boxplot(IRIS[:, :4], notch=True, showmeans = True,
                     labels=IRIS.columns[:4])
plt.title('Distribution of IRIS Attributes')
plt.ylabel('cm');

In-database attribute importance

Let's rank the relative importance of each variable (a.k.a., attribute or predictor) to predict the target 'Species' from the IRIS table.

We define the ai (attribute importance) object, compute the result, and show the attribute importance ranking.

In the result, notice that petal width is most predictive of the target species. The importance value produced by this algorithm provides a relative ranking to be used to distinguish importance among variables.

from oml import ai
# here we use sync to get handle to existing table
IRIS = oml.sync(table = "IRIS")
IRIS_x = IRIS.drop('Species')
IRIS_y = IRIS['Species']

ai_obj = ai()  # Create attribute importance object
ai_obj = ai_obj.fit(IRIS_x, IRIS_y)
ai_obj 

With the output:

Algorithm Name: Attribute Importance

Mining Function: ATTRIBUTE_IMPORTANCE

Settings: 
                   setting name            setting value
0                     ALGO_NAME              ALGO_AI_MDL
1                  ODMS_DETAILS              ODMS_ENABLE
2  ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
3                 ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
4                     PREP_AUTO                       ON

Global Statistics: 
  attribute name  attribute value
0       NUM_ROWS              150

Attributes: 
PETAL_LENGTH
PETAL_WIDTH
SEPAL_LENGTH
SEPAL_WIDTH

Partition: NO

Importance: 

       variable  importance  rank
0   PETAL_WIDTH    1.050935     1
1  PETAL_LENGTH    1.030633     2
2  SEPAL_LENGTH    0.454824     3
3   SEPAL_WIDTH    0.191514     4

Change your service level

In your notebook, you can change the service level of your connection to Oracle Autonomous Database to take advantage of different parallelism options. Available parallelism is relative to your autonomous database compute resource settings. Click the gear icon in the upper right (as indicated by the arrow in the figure), then click individual interpreters to turn them on or off, and click and drag each interpreter box to change the default service level. The 'low' binding runs your functions and queries without parallelism, 'medium' allows limited parallelism, and 'high' allows your functions and queries to use up to the maximum number of compute resources allocated to your Autonomous Database.

In my next post, we'll look at building predictive models.

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.